5,827 Matching Annotations
  1. Oct 2025
    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      This study presents cryoEM-derived structures of the Trypanosome aquaporin AQP2, in complex with its natural ligand, glycerol, as well as two trypanocidal drugs, pentamidine and melarsoprol, which use AQP2 as an uptake route. The structures are high quality, and the density for the drug molecules is convincing, showing a binding site in the centre of the AQP2 pore. 

      The authors then continue to study this system using molecular dynamics simulations. Their simulations indicate that the drugs can pass through the pore and identify a weak binding site in the centre of the pore, which corresponds with that identified through cryoEM analysis. They also simulate the effect of drug resistance mutations, which suggests that the mutations reduce the affinity for drugs and therefore might reduce the likelihood that the drugs enter into the centre of the pore, reducing the likelihood that they progress through into the cell. 

      While the cryoEM and MD studies are well conducted, it is a shame that the drug transport hypothesis was not tested experimentally. For example, did they do cryoEM with AQP2 with drug resistance mutations and see if they could see the drugs in these maps? They might not bind, but another possibility is that the binding site shifts, as seen in Chen et al. 

      TbAQP2 from the drug-resistant mutants does not transport either melarsoprol or pentamidine and there was thus no evidence to suggest that the mutant TbAQP2 channels could bind either drug. Moreover, there is not a single mutation that is characteristic for drug resistance in TbAQP2: references 12–15 show a plethora of chimeric AQP2/3 constructs in addition to various point mutations in laboratory strains and field isolates. In reference 17 we describe a substantial number of SNPs that reduced pentamidine and melarsoprol efficacy to levels that would constitute clinical resistance to acceptable dosage regimen. It thus appears that there are many and diverse mutations that are able to modify the protein sufficiently to induce resistance, and likely in multiple different ways, including the narrowing of the pore, changes to interacting amino acids, access to the pore etc. We therefore did not attempt to determine the structures of the mutant channels because we did not think that in most cases we would see any density for the drugs in the channel, and we would be unable to define ‘the’ resistance mechanism if we did in the case of one individual mutant TbAQP2. Our MD data suggests that pentamidine binding affinity is in the range of 50-300 µM for the mutant TbAQP2s selected for that test (I110W and L258Y/L264R), i.e. >1000-fold higher than TbAQP2WT. Thus these structures will be exceedingly challenging to determine with pentamidine in the pore but, of course, until the experiment has been tried we will not know for sure.

      Do they have an assay for measuring drug binding? 

      We tried many years ago to develop a <sup>3</sup>H-pentamidine binding assay to purified wild type TbAQP2 but we never got satisfactory results even though the binding should be in the doubledigit nanomolar range. This may be for any number of technical reasons and could also be partly because flexible di-benzamidines bind non-specifically to proteins at µM concentrations giving rise to high background. Measuring binding to the mutants was not tested given that they would be binding pentamidine in the µM range. If we were to pursue this further, then isothermal titration calorimetry (ITC) may be one way forward as this can measure µM affinity binding using unlabelled compounds, although it uses a lot of protein and background binding would need to be carefully assessed; see for example our work on measuring tetracycline binding to the tetracycline antiporter TetAB (https://doi.org/10.1016/j.bbamem.2015.06.026 ). Membrane proteins are also particularly tricky for this technique as the chemical activity of the protein solution must be identical to the chemical activity of the substrate solution which titrates in the molecule binding to the protein; this can be exceedingly problematic if any free detergent remains in the purified membrane protein. Another possibility may be fluorescence polarisation spectroscopy, although this would require fluorescently labelling the drugs which would very likely affect their affinity for TbAQP2 and how they interact with the wild type and mutant proteins – see the detailed SAR analysis in Alghamdi et al. 2020 (ref. 17). As you will appreciate, it would take considerable time and effort to set up an assay for measuring drug binding to mutants and is beyond the current scope of the current work.

      I think that some experimental validation of the drug binding hypothesis would strengthen this paper. Without this, I would recommend the authors to soften the statement of their hypothesis (i.e, lines 65-68) as this has not been experimentally validated.

      We agree with the referee that direct binding of drugs to the mutants would be very nice to have, but we have neither the time nor resources to do this. We have therefore softened the statement on lines 65-68 to read ‘Drug-resistant TbAQP2 mutants are still predicted to bind pentamidine, but the much weaker binding in the centre of the channel observed in the MD simulations would be insufficient to compensate for the high energy processes of ingress and egress, hence impairing transport at pharmacologically relevant concentrations.’ 

      Reviewer #2 (Public review): 

      Summary: 

      The authors present 3.2-3.7 Å cryo-EM structures of Trypanosoma brucei aquaglyceroporin-2 (TbAQP2) bound to glycerol, pentamidine, or melarsoprol and combine them with extensive allatom MD simulations to explain drug recognition and resistance mutations. The work provides a persuasive structural rationale for (i) why positively selected pore substitutions enable diamidine uptake, and (ii) how clinical resistance mutations weaken the high-affinity energy minimum that drives permeation. These insights are valuable for chemotherapeutic re-engineering of diamidines and aquaglyceroporin-mediated drug delivery. 

      My comments are on the MD part. 

      Strengths: 

      The study 

      (1) Integrates complementary cryo-EM, equilibrium, applied voltage MD simulations, and umbrella-sampling PMFs, yielding a coherent molecular-level picture of drug permeation. 

      (2) Offers direct structural rationalisation of long-standing resistance mutations in trypanosomes, addressing an important medical problem. 

      Weaknesses: 

      Unphysiological membrane potential. A field of 0.1 V nm ¹ (~1 V across the bilayer) was applied to accelerate translocation. From the traces (Figure 1c), it can be seen that the translocation occurred really quickly through the channel, suggesting that the field might have introduced some large changes in the protein. The authors state that they checked visually for this, but some additional analysis, especially of the residues next to the drug, would be welcome. 

      This is a good point from the referee, and we thank them for raising it. It is common to use membrane potentials in simulations that are higher than the physiological value, although these are typically lower than used here. The reason we used the higher value was to speed sampling and it still took 1,400 ns for transport in the physiologically correct direction, and even then, only in 1/3 repeats. Hence this choice of voltage was probably necessary to see the effect. The exceedingly slow rate of pentamidine permeation seen in the MD simulation was consistent with the experimental observations, as discussed in Alghamdi et al (2020) [ref. 17] where we estimated that TbAQP2-mediated pentamidine uptake in T. brucei bloodstream forms proceeds at just 9.5×10<sup>5</sup> molecules/cell/h; the number of functional TbAQP2 units in the plasma membrane is not known but their location is limited to the small flagellar pocket (Quintana et al. PLoS Negl Trop Dis 14, e0008458 (2020)). 

      The referee is correct that it is important to make sure that the applied voltage is not causing issues for the protein, especially for residues in contact with the drug. We have carried out RMSF analysis to better test this. The data show that comparing our simulations with the voltage applied to the monomeric MD simulations + PNTM with no voltage reveals little difference in the dynamics of the drug-contacting residues. 

      We have added these new data as Supplementary Fig12b with a new legend (lines1134-1138) 

      ‘b, RMSF calculations were run on monomeric TbAQP2 with either no membrane voltage or a 0.1V nm<sup>-1</sup> voltage applied (in the physiological direction). Shown are residues in contact with the pentamidine molecule, coloured by RMSF value. RMSF values are shown for residues Leu122, Phe226, Ile241, and Leu264. The data suggest the voltage has little impact on the flexibility or stability of the pore lining residues.’

      We have also added the following text to the manuscript (lines 524-530):

      ‘Membrane potential simulations were run using the computational electrophysiology protocol. An electric field of 0.1 V/nm was applied in the z-axis dimension only, to create a membrane potential of about 1 V (see Fig. S10a). Note that this is higher than the physiological value of 87.1 ± 2.1 mV at pH 7.3 in bloodstream T. brucei, and was chosen to improve the sampling efficiency of the simulations. The protein and lipid molecules were visually confirmed to be unaffected by this voltage, which we quantify using RMSF analysis on pentamidine-contacting residues (Fig. S12b).’ 

      Based on applied voltage simulations, the authors argue that the membrane potential would help get the drug into the cell, and that a high value of the potential was applied merely to speed up the simulation. At the same time, the barrier for translocation from PMF calculations is ~40 kJ/mol for WT. Is the physiological membrane voltage enough to overcome this barrier in a realistic time? In this context, I do not see how much value the applied voltage simulations have, as one can estimate the work needed to translocate the substrate on PMF profiles alone. The authors might want to tone down their conclusions about the role of membrane voltage in the drug translocation.

      We agree that the PMF barriers are considerable, however we highlight that other studies have seen similar landscapes, e.g. PMID 38734677 which saw a barrier of ca. 10-15 kcal/mol (ca. 4060 kJ/mol) for PNTM transversing the channel. This was reduced by ca. 4 kcal/mol when a 0.4 V nm ¹ membrane potential was applied, so we expect a similar effect to be seen here. 

      We have updated the Results to more clearly highlight this point and added the following text (lines 274-275):

      We note that previous studies using these approaches saw energy barriers of a similar size, and that these are reduced in the presence of a membrane voltage[17,31].’ 

      Pentamidine charge state and protonation. The ligand was modeled as +2, yet pKa values might change with the micro-environment. Some justification of this choice would be welcome. 

      Pentamidine contains two diamidine groups and each are expected to have a pKa above 10 in solution (PMID: 20368397), suggesting that the molecule will carry a +2 charge. Using the +2 charge is also in line with previous MD studies (PMID: 32762841). We have added the following text to the Methods (lines 506-509):

      ‘The pentamidine molecule used existing parameters available in the CHARMM36 database under the name PNTM with a charge state of +2 to reflect the predicted pKas of >10 for these groups [73] and in line with previous MD studies[17].’

      We note that accounting for the impact of the microenvironment is an excellent point – future studies might employ constant pH calculations to address this.

      The authors state that this RMSD is small for the substrate and show plots in Figure S7a, with the bottom plot being presumably done for the substrate (the legends are misleading, though), levelling off at ~0.15 nm RMSD. However, in Figure S7a, we see one trace (light blue) deviating from the initial position by more than 0.2 nm - that would surely result in an RMSD larger than 0.15, but this is somewhat not reflected in the RMSD plots. 

      The bottom plot of Fig. S9a (previously Fig. S7a) is indeed the RMSD of the drug (in relation to the protein). We have clarified the legend with the following text (lines 1037-1038): ‘… or for the pentamidine molecule itself, i.e. in relation to the Cα of the channel (bottom).’ 

      With regards the second comment, we assume the referee is referring to the light blue trace from Fig S9c. These data are actually for the monomeric channel rather than the tetramer. We apologise for not making this clearer in the legend. We have added the word ‘monomeric’ (line 1041).

      Reviewer #3 (Public review): 

      Summary: 

      Recent studies have established that trypanocidal drugs, including pentamidine and melarsoprol, enter the trypanosomes via the glyceroaquaporin AQP2 (TbAQP2). Interestingly, drug resistance in trypanosomes is, at least in part, caused by recombination with the neighbouring gene, AQP3, which is unable to permeate pentamidine or melarsoprol. The effect of the drugs on cells expressing chimeric proteins is significantly reduced. In addition, controversy exists regarding whether TbAQP2 permeates drugs like an ion channel, or whether it serves as a receptor that triggers downstream processes upon drug binding. In this study the authors set out to achieve three objectives: 

      (1) to determine if TbAQP2 acts as a channel or a receptor,

      We should clarify here that this was not an objective of the current manuscript as the transport activity has already been extensively characterised in the literature, as described in the introduction.

      (2) to understand the molecular interactions between TbAQP2 and glycerol, pentamidine, and melarsoprol, and 

      (3) to determine the mechanism by which mutations that arise from recombination with TbAQP3 result in reduced drug permeation. 

      Indeed, all three objectives are achieved in this paper. Using MD simulations and cryo-EM, the authors determine that TbAQP2 likely permeates drugs like an ion channel. The cryo-EM structures provide details of glycerol and drug binding, and show that glycerol and the drugs occupy the same space within the pore. Finally, MD simulations and lysis assays are employed to determine how mutations in TbAQP2 result in reduced permeation of drugs by making entry and exit of the drug relatively more energy-expensive. Overall, the strength of evidence used to support the author's claims is solid. 

      Strengths: 

      The cryo-EM portion of the study is strong, and while the overall resolution of the structures is in the 3.5Å range, the local resolution within the core of the protein and the drug binding sites is considerably higher (~2.5Å). 

      I also appreciated the MD simulations on the TbAQP2 mutants and the mechanistic insights that resulted from this data. 

      Weaknesses: 

      (1) The authors do not provide any empirical validation of the drug binding sites in TbAQP2. While the discussion mentions that the binding site should not be thought of as a classical fixed site, the MD simulations show that there's an energetically preferred slot (i.e., high occupancy interactions) within the pore for the drugs. For example, mutagenesis and a lysis assay could provide us with some idea of the contribution/importance of the various residues identified in the structures to drug permeation. This data would also likely be very valuable in learning about selectivity for drugs in different AQP proteins.

      On a philosophical level, we disagree with the requirement for ‘validation’ of a structure by mutagenesis. It is unclear what such mutagenesis would tell us beyond what was already shown experimentally through <sup>3</sup>H-pentamidine transport, drug sensitivity and lysis assays i.e. a given mutation will impact permeation to a certain extent. But on the structural level, what does mutagenesis tell us? If a bulky aromatic residue that makes many van der Waals interactions with the substrate is changed to an alanine residue and transport is reduced, what does this mean? It would confirm that the phenylalanine residue is very likely indeed making van der Waals contacts to the substrate, but we knew that already from the WT structure. And if it doesn’t have any effect? Well, it could mean that the van der Waals interactions with that particular residue are not that important or it could be that the substrate has changed its positions slightly in the channel and the new pose has similar energy of interactions to that observed in the wild type channel. Regardless of the result, any data from mutagenesis would be open to interpretation and therefore would not impact on the conclusions drawn in this manuscript. We might not learn anything new unless all residues interacting with the substrate are mutated, the structure of each mutant was determined and MD simulations were performed for all, which is beyond the scope of this work. Even then, the value for understanding clinical drug resistance would be limited, as this phenomenon has been linked to various chimeric rearrangements with adjacent TbAQP3 (references 12–15), each with a structure distinct from TbAQP2 with a single SNP. We also note that the recent paper by Chen et al. did not include any mutagenesis of the drug binding sites in TbAQP2 in their analysis of TbAQP2, presumably for similar reasons as discussed above.

      (2) Given the importance of AQP3 in the shaping of AQP2-mediated drug resistance, I think a figure showing a comparison between the two protein structures/AlphaFold structures would be beneficial and appropriate

      We agree that the comparison is of considerably interest and would contribute further to our understanding of the unique permeation capacities of TbAQP2. As such, we followed the reviewer’s suggestion and made an AlphaFold model of TbAQP3 and compared it to our structures of TbAQP2. The RMSD is 0.6 Å to the pentamidine-bound TbAQP2, suggesting that the fold of TbAQP3 has been predicted well, although the side chain rotamers cannot be assessed for their accuracy. Previous work has defined the selectivity filter of TbAQP3 to be formed by W102, R256, Y250. The superposition of the TbAQP3 model and the TbAQP2 pentamidine-bound structure shows that one of the amine groups is level with R256 and that there is a clash with Y250 and the backbone carbonyl of Y250, which deviates in position from the backbone of TbAQP2 in this region. There is also a clash with Ile252. 

      Although these observations are indeed interesting, on their own they are highly preliminary and extensive further work would be necessary to draw any convincing conclusions regarding these residues in preventing uptake of pentamidine and melarsoprol. The TbAQP3 AlphaFold model would need to be verified by MD simulations and then we would want to look at how pentamidine would interact with the channel under different experimental conditions like we have done with TbAQP2. We would then want to mutate to Ala each of the residues singly and in combination and assess them in uptake assays to verify data from the MD simulations. This is a whole new study and, given the uncertainties surrounding the observations of just superimposing TbAQP2 structure and the TbAQP3 model, we feel that, regrettably, this is just too speculative to add to our manuscript. 

      (3) A few additional figures showing cryo-EM density, from both full maps and half maps, would help validate the data. 

      Two new Supplementary Figures have been made, on showing the densities for each of the secondary structure elements (the new Figure S5) and one for the half maps showing the ligands (the new Figure S6). All the remaining supplementary figures have been renamed accordingly.

      (4) Finally, this paper might benefit from including more comparisons with and analysis of data published in Chen et al (doi.org/10.1038/s41467-024-48445-4), which focus on similar objectives. Looking at all the data in aggregate might reveal insights that are not obvious from either paper on their own. For example, melarsoprol binds differently in structures reported in the two respective papers, and this may tell us something about the energy of drug-protein interactions within the pore. 

      We already made the comparisons that we felt were most pertinent and included a figure (Fig. 5) to show the difference in orientation of melarsoprol in the two structures. We do not feel that any additional comparison is sufficiently interesting to be included. As we point out, the structures are virtually identical (RMSD 0.6 Å) and therefore there are no further mechanistic insights we would like to make beyond the thorough discussion in the Chen et al paper.

      Reviewer #1 (Recommendations for the authors): 

      (1) Line 65 - I don't think that the authors have tested binding experimentally, and so rather than 'still bind', I think that 'are still predicted to bind' is more appropriate. 

      Changed as suggested

      (2) Line 69 - remove 'and' 

      Changed as suggested

      (3) Line 111 - clarify that it is the protein chain which is 'identical'. Ligands not. 

      Changed to read ‘The cryo-EM structures of TbAQP2 (excluding the drugs/substrates) were virtually identical…

      (4) Line 186 - make the heading of this section more descriptive of the conclusion than the technique? 

      We have changed the heading to read: ‘Molecular dynamics simulations show impaired pentamidine transport in mutants’

      Reviewer #2 (Recommendations for the authors): 

      (1) Methods - a rate of 1 nm per ns is mentioned for pulling simulations, is that right? 

      Yes, for the generation of the initial frames for the umbrella sampling a pull rate of 1 nm/ns was used in either an upwards or downwards z-dimension

      (2) Figure S9 and S10 have their captions swapped. 

      The captions have been swapped to their proper positions.

      (3) Methods state "40 ns per window" yet also that "the first 50 ns of each window was discarded as equilibration". 

      Well spotted - this line should have read “the first 5 ns of each window was discarded as equilibration”. This has been corrected (line 541).

      Reviewer #3 (Recommendations for the authors): 

      (1) Abstract, line 68-70: incomplete sentence.

      The sentence has been re-written: ‘The structures of drug-bound TbAQP2 represent a novel paradigm for drug-transporter interactions and are a new mechanism for targeting drugs in pathogens and human cells.

      (2) Line 312-313: The paper you mention here came out in May 2024 - a year ago. I appreciate that they reported similar structural data, but for the benefit of the readers and the field, I would recommend a more thorough account of the points by which the two pieces of work differ. Is there some knowledge that can be gleaned by looking at all the data in the two papers together? For example, you report a glycerol-bound structure while the other group provides an apo one. Are there any mechanistic insights that can be gained from a comparison?

      We already made the comparisons that we felt were most pertinent and included a figure (Fig. 5) to show the difference in orientation of melarsoprol in the two structures. We do not feel that any additional comparison is sufficiently interesting to be included. As we point out, the structures are virtually identical (RMSD 0.6 Å) and therefore there are no further mechanistic insights we would like to make beyond the thorough discussion in the Chen et al paper.

      (3) Similarly, you can highlight the findings from your MD simulations on the TbAQP2 drug resistance mutants, which are unique to your study. How can this data help with solving the drug resistance problem?

      New drugs will need to be developed that can be transported by the mutant chimera AQP2s and the models from the MD simulations will provide a starting point for molecular docking studies. Further work will then be required in transport assays to optimise transport rather than merely binding. However, the fact that drug resistance can also arise through deletion of the AQP2 gene highlights the need for developing new drugs that target other proteins.

      (4) A glaring question that one has as a reader is why you have not attempted to solve the structures of the drug resistance mutants, either in complex with the two compounds or in their apo/glycerol-bound form? To be clear, I am not requesting this data, but it might be a good idea to bring this up in the discussion.

      TbAQP2 containing the drug-resistant mutants does not transport either melarsoprol or pentamidine (Munday et al., 2014; Alghamdi et al., 2020); there was thus no evidence to suggest that the mutant TbAQP2 channels could bind either drug. We therefore did not attempt to determine the structures of the mutant channels because we did not think that we would see any density for the drugs in the channel. Our MD data suggests that pentamidine binding affinity is in the range of 50-300 µM for the mutant TbAQP2, supporting the view that getting these structures would be highly challenging, but of course until the experiment is tried we will not know for sure.

      We also do not think we would learn anything new about doing structures of the drug-free structures of the transport-negative mutants of TbAQP2. The MD simulations have given novel insights into why the drugs are not transported and we would rather expand effort in this direction and look at other mutants rather than expend further effort in determining new structures.

      (5) Line 152-156: Is there a molecular explanation for why the TbAQP2 has 2 glycerol molecules captured in the selectivity filter while the PfAQP2 and the human AQP7 and AQP10 have 3?

      The presence of glycerol molecules represents local energy minima for binding, which will depend on the local disposition of appropriate hydrogen bonding atoms and hydrophobic regions, in conjunction with the narrowness of the channel to effectively bind glycerol from all sides. It is noticeable that the extracellular region of the channel is wider in TbAQP2 than in AQP7 and AQP10, so this may be one reason why additional ordered glycerol molecules are absent, and only two are observed. Note also that the other structures were determined by X-ray crystallography, and the environment of the crystal lattice may have significantly decreased the rate of diffusion of glycerol, increasing the likelihood of observing their electron densities.

      (6) I would also think about including the 8JY7 (TbAQP2 apo) structure in your analysis.

      We included 8JY7 in our original analyses, but the results were identical to 8JY6 and 8JY8 in terms of the protein structure, and, in the absence of any modelled substrates in 8JY7 (the interesting part for our manuscript), we therefore have not included the comparison.

      (7) I also think, given the importance of AQP3 in this context, it would be really useful to have a comparison with the AQP3 AlphaFold structure in order to examine why it does not permeate drugs.

      We made an AlphaFold model of TbAQP3 and compared it to our structures of TbAQP2. The RMSD is 0.6 Å to the pentamidine-bound TbAQP2, suggesting that the fold of TbAQP3 has been predicted well, although the side chain rotamers cannot be assessed for their accuracy. Previous work has defined the selectivity filter of TbAQP3 to be formed by W102, R256, Y250. The superposition of the TbAQP3 model and the TbAQP2 pentamidine-bound structure shows that one of the amine groups is level with R256 and that there is a clash with Y250 and the backbone carbonyl of Y250, which deviates in position from the backbone of TbAQP2 in this region. There is also a clash with Ile252. 

      Although these observations are interesting, on their own they are preliminary in the extreme and extensive further work will be necessary to draw any convincing conclusions regarding these residues in preventing uptake of pentamidine and melarsoprol. The TbAQP3 AlphaFold model would need to be verified by MD simulations and then we would want to look at how pentamidine would interact with the channel under different experimental conditions like we have done with TbAQP2. We would then want to mutate to Ala each of the residues singly and in combination and assess them in uptake assays to verify data from the MD simulations. This is a whole new study and, given the uncertainties surrounding the observations of just superimposing TbAQP2 structure and the TbAQP3 model, we feel this is just too speculative to add to our manuscript. 

      (8) To validate the densities representing glycerol and the compounds, you should show halfmap densities for these. 

      A new figure, Fig S6 has been made to show the half-map densities for the glycerol and drugs.

      (9) I would also like to see the density coverage of the individual helices/structural elements. 

      A new figure, Fig S5 has been made to show the densities for the structural elements.

      (10) While the LigPlot figure is nice, I think showing the data (including the cryo-EM density) is necessary validation.

      The LigPlot figure is a diagram (an interpretation of data) and does not need the densities as these have already been shown in Fig. 1c (the data).

      (11) I would recommend including a figure that illustrates the points described in lines 123-134.

      All of the points raised in this section are already shown in Fig. 2a, which was referred to twice in this section. We have added another reference to Fig.2a on lines 134-135 for completeness.

      (12) Line 202: I would suggest using "membrane potential/voltage" to avoid confusion with mitochondrial membrane potential. 

      We have changed this to ‘plasma membrane potential’ to differentiate it from mitochondrial membrane potential.

      (13) Figure 4: Label C.O.M. in the panels so that the figure corresponds to the legend. 

      We have altered the figure and added and explanation in the figure legend (lines 716-717):

      ‘Cyan mesh shows the density of the molecule across the MD simulation. and the asterisk shows the position of the centre of mass (COM).’

      (14) Figure S2: Panels d and e appear too similar, and it is difficult to see the stick representation of the compound. I would recommend either using different colours or showing a close-up of the site.

      We have clarified the figure by including two close-up views of the hot-spot region, one with melarsoprol overlaid and one with pentamidine overlaid

      (15) Figure S2: Typo in legend: 8YJ7 should be 8JY7.

      Changed as suggested  

      (16) Figure S3 and Figure S4: Please clarify which parts of the process were performed in cryoSPARC and which in Relion. 

      Figure S3 gives an overview of the processing and has been simplified to give the overall picture of the procedures. All of the details were included in the Methods section as other programmes are used, not just cryoSPARC and Relion. Given the complexities of the processing, we have referred the readers to the Methods section rather than giving confusing information in Fig. S3.

      We have updated the figure legend to Fig. S4 as requested.

      (17) Figure S9 and Figure S10: The legends are swapped in these two figures.

      The captions have been swapped to their proper positions.

      (18) For ease of orientation and viewing, I would recommend showing a vertical HOLE plot aligned with an image of the AQP2 pore. 

      The HOLE plot has been re-drawn as suggest (Fig. S2)

    1. Author response:

      We are very pleased to hear the overall positive views and constructive criticisms of eLife Editors and Reviewers on our work. In particular, we appreciate their global assessment that the work is important for understanding how body plan cues shape sensorimotor behavioural patterns, that the strength of evidence is solid, and their views that our experimental toolkit will be useful to others. We also very much appreciate eLife’s assessment that our findings will be of broad interest to researchers studying neural circuits, developmental genetics, and the evolution of behaviour.

      Regarding Reviewer 1, we thank them for their positive comments on the value of our study, highlighting that our paper addresses an important question using an elegant and innovative combination of methods, which leads to clear insights into the sensory biology of self-righting, which they consider shall be useful for others in the field. We are also very pleased to hear that they consider that our study makes a substantial contribution to understanding how animals correct their body position and that the manuscript is very clearly written and couched in interesting biology. In a revised version of the manuscript, we will consider some of the interesting points raised by Rev1, including the possibility of conducting new experiments using neuronal subset-specific Gal4s, to establish whether daIV sensory neurons are also acting in a regionally specific manner along the A-P axis.

      Turning to the comments by Rev2, we are grateful to them for considering that our experimental design is elegant, and that it introduces innovative methods that will likely benefit the fly behavior community, and the results are robustly supported. In connection to other comments, in a revised manuscript we will consider addressing the question of whether normal levels of expression of the Hox gene Antennapedia within the daIV domain are essential for self-righting. We will also seek to add technical replicates to our Hox expression molecular analysis, amend typos and incorporate several of the constructive corrections mentioned.

    1. Author response:

      Reviewer #1:

      Indicated the paper provided a strong analysis of RNAseq databases to provide a biological context and resource for the massive amounts of data in the field on RNA editing. The reviewer noted that future studies will be important to define the functional consequences of the individual edits and why the RNA editing rules we identified exist. We address these comments below.

      (1) The reviewer wondered about the role of noncanonical editing to neuronal protein expression.

      Indeed, the role of noncanonical editing has been poorly studied compared to the more common A-to-I ADAR-dependent editing. Most non-canonical coding edits we found actually caused silent changes at the amino acid level, suggesting evolutionary selection against this mechanism as a pathway for generating protein diversity. As such, we suspect that most of these edits are not altering neuronal function in significant ways. Two potential exceptions to this were non-canonical edits that altered conserved residues in the synaptic proteins Arc1 and Frequenin 1. The C-to-T coding edit in the activity-regulated Arc1 mRNA that encodes a retroviral-like Gag protein involved in synaptic plasticity resulted in a P124L amino acid change (see Author response image 1 panel A below). ~50% of total Arc1 mRNA was edited at this site in both Ib and Is neurons, suggesting a potentially important role if the P124L change alters Arc1 structure or function. Given Arc1 assembles into higher order viral-like capsids, this change could alter capsid formation or structure. Indeed, P124 lies in the hinge region separating the N- and C-terminal capsid assembly regions (panel B) and we hypothesize this change will alter the ability of Arc1 capsids to assemble properly. We plan to experimentally test this by rescuing Arc1 null mutants with edited versus unedited transgenes to see how the previously reported synaptic phenotypes are modified. We also plan to examine the ability of the change to alter Arc1 capsid assembly in a collaboration using CyroEM.

      Author response image 1.

      A. AlphaFold predictions of Drosophila Arc1 and Frq1 with edit site noted. B. Structure of the Drosophila Arc1 capsid. Monomeric Arc1 conformation within the capsid is shown on the right with the location of the edit site indicated.

      The other non-canonical edit (G-to-A) that stood out was in Frequenin 1 (Frq1), a multi-EF hand containing Ca<sup>2+</sup> binding protein that regulates synaptic transmission, that resulted in a G2E amino acid substitution (location within Frq1shown in panel A above). This glycine residue is conserved in all Frq homologs and is the site of N-myristoylation, a co-translational lipid modification to the glycine after removal of the initiator methionine by an aminopeptidase. Myristoylation tethers Frq proteins to the plasma membrane, with a Ca<sup>2+</sup>-myristoyl switch allowing some family members to cycle on and off membranes when the lipid domain is sequestered in the absence of Ca<sup>2+</sup>. Although the G2E edit is found at lower levels (20% in Ib MNs and 18% in Is MNs), it could create a pool of soluble Frq1 that alters it’s signaling. We plan to functionally assay the significance of this non-canonical edit as well. Compared to edits that alter amino acid sequence, determining how non canonical editing of UTRs might regulate mRNA dynamics is a harder question at this stage and will require more experimental follow-up.

      (2) The reviewer noted the last section of the results might be better split into multiple parts as it reads as a long combination of two thoughts.

      We agree with the reviewer that the last section is important, but it was disconnected a bit from the main story and was difficult for us to know exactly where to put it. All the data to that point in the paper was collected from our own PatchSeq analysis from individual larval motoneurons. We wanted to compare these results to other large RNAseq datasets obtained from pooled neuronal populations and felt it was best to include this at the end of the results section, as it no longer related to the rules of RNA editing within single neurons. We used these datasets to confirm many of our edits, as well as find evidence for some developmental and neuron-specific cell type edits. We also took advantage of RNAseq from neuronal datasets with altered activity to explore how activity might alter the editing machinery. We felt it better to include that data in this final section given it was not collected from our original PatchSeq approach.

      Reviewer #2:

      Noted the study provided a unique opportunity to identify RNA editing sites and rates specific to individual motoneuron subtypes, highlighting the RNAseq data was robustly analyzed and high-confidence hits were identified and compared to other RNAseq datasets. The reviewer provided some suggestions for future experiments and requested a few clarifications.

      (1) The reviewer asked about Figure 1F and the average editing rate per site described later in the paper.

      Indeed, Figure 1F shows the average editing rate for each individual gene for all the Ib and Is cells, so we primarily use that to highlight the variability we find in overall editing rate from around 20% for some sites to 100% for others. The actual editing rate for each site for individual neurons is shown in Figure 4D that plots the rate for every edit site and the overall sum rate for that neuron in particular.

      (2) The reviewer also noted that it was unclear where in the VNC the individual motoneurons were located and how that might affect editing.

      The precise segment of the larvae for every individual neuron that was sampled by Patch-seq was recorded and that data is accessible in the original Jetti et al 2023 paper if the reader wants to explore any potential anterior to posterior differences in RNA editing. Due to the technical difficulty of the Patch-seq approach, we pooled all the Ib and Is neurons from each segment together to get more statistical power to identify edit sites. We don’t believe segmental identify would be a major regulator of RNA editing, but cannot rule it out.

      (3) The reviewer also wondered if including RNAs located both in the nucleus and cytoplasm would influence editing rate.

      Given our Patch-seq approach requires us to extract both the cytoplasm and nucleus, we would be sampling both nuclear and cytoplasmic mRNAs. However, as shown in Figure 8 – figure supplement 3 D-F, the vast majority of our edits are found in both polyA mRNA samples and nascent nuclear mRNA samples from other datasets, indicating the editing is occurring co-transcriptionally and within the nucleus. As such, we don't think the inclusion of cytoplasmic mRNA is altering our measured editing rates for most sites. This may not be true for all non-canonical edits, as we did see some differences there, indicating some non-canonical editing may be happening in the cytoplasm as well.

      Reviewer #3:

      indicated the work provided a valuable resource to access RNA editing in single neurons. The reviewer suggested the value of future experiments to demonstrate the effects of editing events on neuronal function. This will be a major effort for us going forwards, as we indeed have already begun to test the role of editing in mRNAs encoding several presynaptic proteins that regulate synaptic transmission. The reviewer also had several other comments as discussed below.

      (1) The reviewer noted that silent mutations could alter codon usage that would result in translational stalling and altered protein production.

      This is an excellent point, as silent mutations in the coding region could have a more significant impact if they generate non-preferred rare codons. This is not something we have analyzed, but it certainly is worth considering in future experiments. Our initial efforts are on testing the edits that cause predictive changes in presynaptic proteins based on the amino acid change and their locale in important functional domains, but it is worth considering the silent edits as well as we think about the larger picture of how RNA editing is likely to impact not only protein function but also protein levels.

      (2) The reviewer noted future studies could be done using tools like Alphafold to test if the amino acid changes are predicted to alter the structure of proteins with coding edits.

      This is an interesting approach, though we don’t have much expertise in protein modeling at that level. We could consider adding this to future studies in collaboration with other modeling labs.

      (3) The reviewer wondered if the negative correlation between edits and transcript abundance could indicate edits might be destabilizing the transcripts.

      This is an interesting idea, but would need to be experimentally tested. For the few edits we have generated already to begin functionally testing, including our published work with editing in the C-terminus of Complexin, we haven’t seen a change in mRNA levels causes by these edits. However, it would not be surprising to see some edits reducing transcript levels. A set of 5’UTR edits we have generated in Syx1A seem to be reducing protein production and may be acting in such a manner.

      (4) The reviewer wondered if the proportion of edits we report in many of the figures is normalized to the length of the transcript, as longer transcripts might have more edits by chance.

      The figures referenced by the reviewer (1, 2 and 7) show the number of high-confidence editing sites that fall into the 5’ UTR, 3’ UTR, or CDS categories. Our intention here was to highlight that the majority of the high confidence edits that made it through the stringent filtering process were in the coding region. This would still be true if we normalized to the length of the given gene region. However, it would be interesting to know if these proportions match the expected proportions of edits in these gene regions given a random editing rate per gene region length across the Drosophila genome, although we did not do this analysis.    

      (5) The reviewer noted that future studies could expand on the work to examine miRNA or other known RBP binding sites that might be altered by the edits.

      This is another avenue we could pursue in the future. We did do this analysis for a few of the important genes encoding presynaptic proteins (these are the most interesting to us given the lab’s interest in the synaptic vesicle fusion machinery), but did not find anything obvious for this smaller subset of targets.

      (6) The reviewer suggested sequence context for Adar could also be investigated for the hits we identified.

      We haven’t pursued this avenue yet, but it would be of interest to do in the future. In a similar vein, it would be informative to identify intron-exon base pairing that could generate the dsDNA template on which ADAR acts.

      (7) The reviewer noted the disconnect between Adar mRNA levels and overall editing levels reported in Figure 4A/B.

      Indeed, the lack of correlation between overall editing levels and Adar mRNA abundance has been noted previously in many studies. For the type of single cell Patch-seq approach we took to generate our RNAseq libraries, the absolute amount of less abundant transcripts obtained from a single neuron can be very noisy. As such, the few neurons with no detectable Adar mRNA are likely to represent that single neuron noise in the sampling. Per the reviewer’s question, these figure panels only show A-to-I edits, so they are specific to ADAR.

      (8) The reviewer notes the scale in Figure 5D can make it hard to visualize the actual impact of the changes.

      The intention of Figure 5D was to address the question of whether sites with high Ib/Is editing differences were simply due to higher Ib or Is mRNA expression levels. If this was the case, then we would expect to see highly edited sites have large Ib/Is TPM differences. Instead, as the figure shows, the vast majority of highly-edited sites were in mRNAs that were NOT significantly different between Ib and Is (red dots in graph) and are therefore clustered together near “0 Difference in TPMs”. TPMs and editing levels for all edit sites can be found in Table 1, and a visualization of these data for selected sites is shown in Figure 5E.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer #1 (Public review):

      In this manuscript, Hoon Cho et al. present a novel investigation into the role of PexRAP, an intermediary in ether lipid biosynthesis, in B cell function, particularly during the Germinal Center (GC) reaction. The authors profile lipid composition in activated B cells both in vitro and in vivo, revealing the significance of PexRAP. Using a combination of animal models and imaging mass spectrometry, they demonstrate that PexRAP is specifically required in B cells. They further establish that its activity is critical upon antigen encounter, shaping B cell survival during the GC reaction. Mechanistically, they show that ether lipid synthesis is necessary to modulate reactive oxygen species (ROS) levels and prevent membrane peroxidation.

      Highlights of the Manuscript:

      The authors perform exhaustive imaging mass spectrometry (IMS) analyses of B cells, including GC B cells, to explore ether lipid metabolism during the humoral response. This approach is particularly noteworthy given the challenge of limited cell availability in GC reactions, which often hampers metabolomic studies. IMS proves to be a valuable tool in overcoming this limitation, allowing detailed exploration of GC metabolism.

      The data presented is highly relevant, especially in light of recent studies suggesting a pivotal role for lipid metabolism in GC B cells. While these studies primarily focus on mitochondrial function, this manuscript uniquely investigates peroxisomes, which are linked to mitochondria and contribute to fatty acid oxidation (FAO). By extending the study of lipid metabolism beyond mitochondria to include peroxisomes, the authors add a critical dimension to our understanding of B cell biology.

      Additionally, the metabolic plasticity of B cells poses challenges for studying metabolism, as genetic deletions from the beginning of B cell development often result in compensatory adaptations. To address this, the authors employ an acute loss-of-function approach using two conditional, cell-type-specific gene inactivation mouse models: one targeting B cells after the establishment of a pre-immune B cell population (Dhrs7b^f/f, huCD20-CreERT2) and the other during the GC reaction (Dhrs7b^f/f; S1pr2-CreERT2). This strategy is elegant and well-suited to studying the role of metabolism in B cell activation.

      Overall, this manuscript is a significant contribution to the field, providing robust evidence for the fundamental role of lipid metabolism during the GC reaction and unveiling a novel function for peroxisomes in B cells. 

      Comments on revisions:

      There are still some discrepancies in gating strategies. In Fig. 7B legend (lines 1082-1083), they show representative flow plots of GL7+ CD95+ GC B cells among viable B cells, so it is not clear if they are IgDneg, as the rest of the GC B cells aforementioned in the text.

      We apologize for missing this item in need of correction in the revision and sincerely thank the reviewer for the stamina and care in picking this up. The data shown in Fig. 7B represented cells (events) in the IgD<sup>neg</sup> Dump<sup>neg</sup> viable lymphoid gate. We will correct this omission/blemish in the final revision that becomes the version of record.

      Western blot confirmation: We understand the limitations the authors enumerate. Perhaps an RT-qPCR analysis of the Dhrs7b gene in sorted GC B cells from the S1PR2-CreERT2 model could be feasible, as it requires a smaller number of cells. In any case, we agree with the authors that the results obtained using the huCD20-CreERT2 model are consistent with those from the S1PR2-CreERT2 model, which adds credibility to the findings and supports the conclusion that GC B cells in the S1PR2-CreERT2 model are indeed deficient in PexRAP.

      We will make efforts to go back through the manuscript and highlight this limitation to readers, i.e., that we were unable to get genetic evidence to assess what degree of "counter-selection" applied to GC B cells in our experiments.

      We agree with the referee that optimally to support the Imaging Mass Spectrometry (IMS) data showing perturbations of various ether lipids within GC after depletion of PexRAP, it would have been best if we could have had a qRT2-PCR that allowed quantitation of the Dhrs7b-encoded mRNA in flow-purified GC B cells, or the extent to which the genomic DNA of these cells was in deleted rather than 'floxed' configuration.

      While the short half-life of ether lipid species leads us to infer that the enzymatic function remains reduced/absent, it definitely is unsatisfying that the money for experiments ran out in June and the lab members had to move to new jobs.

      Lines 222-226: We believe the correct figure is 4B, whereas the text refers to 4C.

      As for the 1st item, we apologize and will correct this error.

      Supplementary Figure 1 (line 1147): The figure title suggests that the data on T-cell numbers are from mice in a steady state. However, the legend indicates that the mice were immunized, which means the data are not from steady-state conditions. 

      We will change the wording both on line 1147 and 1152.

      Reviewer #2 (Public review):

      Summary:

      In this study, Cho et al. investigate the role of ether lipid biosynthesis in B cell biology, particularly focusing on GC B cell, by inducible deletion of PexRAP, an enzyme responsible for the synthesis of ether lipids.

      Strengths:

      Overall, the data are well-presented, the paper is well-written and provides valuable mechanistic insights into the importance of PexRAP enzyme in GC B cell proliferation.

      Weaknesses:

      More detailed mechanisms of the impaired GC B cell proliferation by PexRAP deficiency remain to be further investigated. In minor part, there are issues for the interpretation of the data which might cause confusions by readers.

      Comments on revisions:

      The authors improved the manuscript appropriately according to my comments.

      To re-summarize, we very much appreciate the diligence of the referees and Editors in re-reviewing this work at each cycle and helping via constructive peer review, along with their favorable comments and overall assessments. The final points will be addressed with minor edits since there no longer is any money for further work and the lab people have moved on.


      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      In this manuscript, Sung Hoon Cho et al. presents a novel investigation into the role of PexRAP, an intermediary in ether lipid biosynthesis, in B cell function, particularly during the Germinal Center (GC) reaction. The authors profile lipid composition in activated B cells both in vitro and in vivo, revealing the significance of PexRAP. Using a combination of animal models and imaging mass spectrometry, they demonstrate that PexRAP is specifically required in B cells. They further establish that its activity is critical upon antigen encounter, shaping B cell survival during the GC reaction. 

      Mechanistically, they show that ether lipid synthesis is necessary to modulate reactive oxygen species (ROS) levels and prevent membrane peroxidation.

      Highlights of the Manuscript:

      The authors perform exhaustive imaging mass spectrometry (IMS) analyses of B cells, including GC B cells, to explore ether lipid metabolism during the humoral response. This approach is particularly noteworthy given the challenge of limited cell availability in GC reactions, which often hampers metabolomic studies. IMS proves to be a valuable tool in overcoming this limitation, allowing detailed exploration of GC metabolism.

      The data presented is highly relevant, especially in light of recent studies suggesting a pivotal role for lipid metabolism in GC B cells. While these studies primarily focus on mitochondrial function, this manuscript uniquely investigates peroxisomes, which are linked to mitochondria and contribute to fatty acid oxidation (FAO). By extending the study of lipid metabolism beyond mitochondria to include peroxisomes, the authors add a critical dimension to our understanding of B cell biology.

      Additionally, the metabolic plasticity of B cells poses challenges for studying metabolism, as genetic deletions from the beginning of B cell development often result in compensatory adaptations. To address this, the authors employ an acute loss-of-function approach using two conditional, cell-type-specific gene inactivation mouse models: one targeting B cells after the establishment of a pre-immune B cell population (Dhrs7b^f/f, huCD20-CreERT2) and the other during the GC reaction (Dhrs7b^f/f; S1pr2-CreERT2). This strategy is elegant and well-suited to studying the role of metabolism in B cell activation.

      Overall, this manuscript is a significant contribution to the field, providing robust evidence for the fundamental role of lipid metabolism during the GC reaction and unveiling a novel function for peroxisomes in B cells.

      We appreciate these positive reactions and response, and agree with the overview and summary of the paper's approaches and strengths.

      However, several major points need to be addressed:

      Major Comments:

      Figures 1 and 2

      The authors conclude, based on the results from these two figures, that PexRAP promotes the homeostatic maintenance and proliferation of B cells. In this section, the authors first use a tamoxifen-inducible full Dhrs7b knockout (KO) and afterwards Dhrs7bΔ/Δ-B model to specifically characterize the role of this molecule in B cells. They characterize the B and T cell compartments using flow cytometry (FACS) and examine the establishment of the GC reaction using FACS and immunofluorescence. They conclude that B cell numbers are reduced, and the GC reaction is defective upon stimulation, showing a reduction in the total percentage of GC cells, particularly in the light zone (LZ).

      The analysis of the steady-state B cell compartment should also be improved. This includes a  more detailed characterization of MZ and B1 populations, given the role of lipid metabolism and lipid peroxidation in these subtypes.

      Suggestions for Improvement:

      B Cell compartment characterization: A deeper characterization of the B cell compartment in non-immunized mice is needed, including analysis of Marginal Zone (MZ) maturation and a more detailed examination of the B1 compartment. This is especially important given the role of specific lipid metabolism in these cell types. The phenotyping of the B cell compartment should also include an analysis of immunoglobulin levels on the membrane, considering the impact of lipids on membrane composition.

      Although the manuscript is focused on post-ontogenic B cell regulation in Ab responses, we believe we will be able to polish a revised manuscript through addition of results of analyses suggested by this point in the review: measurement of surface IgM on and phenotyping of various B cell subsets, including MZB and B1 B cells, to extend the data in Supplemental Fig 1H and I. Depending on the level of support, new immunization experiments to score Tfh and analyze a few of their functional molecules as part of a B cell paper may be feasible.   

      Addendum / update of Sept 2025: We added new data with more on MZB and B1 B cells, surface IgM, and on Tfh populations. 

      GC Response Analysis Upon Immunization: The GC response characterization should include additional data on the T cell compartment, specifically the presence and function of Tfh cells. In Fig. 1H, the distribution of the LZ appears strikingly different. However, the authors have not addressed this in the text. A more thorough characterization of centroblasts and centrocytes using CXCR4 and CD86 markers is needed.

      The gating strategy used to characterize GC cells (GL7+CD95+ in IgD− cells) is suboptimal. A more robust analysis of GC cells should be performed in total B220+CD138− cells.

      We first want to apologize the mislabeling of LZ and DZ in Fig 1H. The greenish-yellow colored region (GL7<sup>+</sup> CD35<sup>+</sup>) indicate the DZ and the cyan-colored region (GL7<sup>+</sup> CD35<sup>+</sup>) indicates the LZ.    Addendum / update of Sept 2025: We corrected the mistake, and added new experimental data using the CD138 marker to exclude preplasmablasts.  

      As a technical note, we experienced high background noise with GL7 staining uniquely with PexRAP deficient (Dhrs7b<sup>f/f</sup>; Rosa26-CreER<sup>T2</sup>) mice (i.e., not WT control mice). The high background noise of GL7 staining was not observed in B cell specific KO of PexRAP (Dhrs7b<sup>f/f</sup>; huCD20-CreER<sup>T2</sup>). Two formal possibilities to account for this staining issue would be if either the expression of the GL7 epitope were repressed by PexRAP or the proper positioning of GL7<sup>+</sup> cells in germinal center region were defective in PexRAPdeficient mice (e.g., due to an effect on positioning cues from cell types other than B cells). In a revised manuscript, we will fix the labeling error and further discuss the GL7 issue, while taking care not to be thought to conclude that there is a positioning problem or derepression of GL7 (an activation antigen on T cells as well as B cells).

      While the gating strategy for an overall population of GC B cells is fairly standard even in the current literature, the question about using CD138 staining to exclude early plasmablasts (i.e., analyze B220<sup>+</sup> CD138<sup>neg</sup> vs B220<sup>+</sup> CD138<sup>+</sup>) is interesting. In addition, some papers like to use GL7<sup>+</sup> CD38<sup>neg</sup> for GC B cells instead of GL7<sup>+</sup> Fas (CD95)<sup>+</sup>, and we thank the reviewer for suggesting the analysis of centroblasts and centrocytes. For the revision, we will try to secure resources to revisit the immunizations and analyze them for these other facets of GC B cells (including CXCR4/CD86) and for their GL7<sup>+</sup> CD38<sup>neg</sup>. B220<sup>+</sup> CD138<sup>-</sup> and B220<sup>+</sup> CD138<sup>+</sup> cell populations. 

      We agree that comparison of the Rosa26-CreERT2 results to those with B cell-specific lossof-function raise a tantalizing possibility that Tfh cells also are influenced by PexRAP. Although the manuscript is focused on post-ontogenic B cell regulation in Ab responses, we hope to add a new immunization experiments that scores Tfh and analyzes a few of their functional molecules could be added to this B cell paper, depending on the ability to wheedle enough support / fiscal resources.  

      Addendum / update of Sept 2025: Within the tight time until lab closure, and limited $$, we were able to do experiments that further reinforced the GC B cell data - including stains for DZ vs LZ sub-subsetting - and analyzed Tfh cells. We were not able to explore changes in functional antigenic markers on the GC B or Tfh cells. 

      The authors claim that Dhrs7b supports the homeostatic maintenance of quiescent B cells in vivo and promotes effective proliferation. This conclusion is primarily based on experiments where CTV-labeled PexRAP-deficient B cells were adoptively transferred into μMT mice (Fig. 2D-F). However, we recommend reviewing the flow plots of CTV in Fig. 2E, as they appear out of scale. More importantly, the low recovery of PexRAP-deficient B cells post-adoptive transfer weakens the robustness of the results and is insufficient to conclusively support the role of PexRAP in B cell proliferation in vivo.

      In the revision, we will edit the text and try to adjust the digitized cytometry data to allow more dynamic range to the right side of the upper panels in Fig. 2E, and otherwise to improve the presentation of the in vivo CTV result. However, we feel impelled to push back respectfully on some of the concern raised here. First, it seems to gloss over the presentation of multiple facets of evidence. The conclusion about maintenance derives primarily from Fig. 2C, which shows a rapid, statistically significant decrease in B cell numbers (extending the finding of Fig. 1D, a more substantial decrease after a bit longer a period). As noted in the text, the rate of de novo B cell production does not suffice to explain the magnitude of the decrease. 

      In terms of proliferation, we will improve presentation of the Methods but the bottom line is that the recovery efficiency is not bad (comparing to prior published work) inasmuch as transferred B cells do not uniformly home to spleen. In a setting where BAFF is in ample supply in vivo, we transferred equal numbers of cells that were equally labeled with CTV and counted B cells. The CTV result might be affected by lower recovered B cell with PexRAP deficiency, generally, the frequencies of CTV<sup>low</sup> divided population are not changed very much. However, it is precisely because of the pitfalls of in vivo analyses that we included complementary data with survival and proliferation in vitro. The proliferation was attenuated in PexRAP-deficient B cells in vitro; this evidence supports the conclusion that proliferation of PexRAP knockout B cells is reduced. It is likely that PexRAP deficient B cells also have defect in viability in vivo as we observed the reduced B cell number in PexRAP-deficient mice. As the reviewer noticed, the presence of a defect in cycling does, in the transfer experiments, limit the ability to interpret a lower yield of B cell population after adoptive transfer into µMT recipient mice as evidence pertaining to death rates. We will edit the text of the revision with these points in mind. 

      In vitro stimulation experiments: These experiments need improvement. The authors have used anti-CD40 and BAFF for B cell stimulation; however, it would be beneficial to also include antiIgM in the stimulation cocktail. In Fig. 2G, CTV plots do not show clear defects in proliferation, yet the authors quantify the percentage of cells with more than three divisions. These plots should clearly display the gating strategy. Additionally, details about histogram normalization and potential defects in cell numbers are missing. A more in-depth analysis of apoptosis is also required to determine whether the observed defects are due to impaired proliferation or reduced survival. 

      As suggested by reviewer, testing additional forms of B cell activation can help explore the generality (or lack thereof) of findings. We plan to test anti-IgM stimulation together with anti-CD40 + BAFF as well as anti-IgM + TLR7/8, and add the data to a revised and final manuscript. 

      Addendum / update of Sept 2025: The revision includes results of new experiments in which anti-IgM was included in the stimulation cocktail, as well as further data on apoptosis and distinguishing impaired cycling / divisions from reduced survival .

      With regards to Fig. 2G (and 2H), in the revised manuscript we will refine the presentation (add a demonstration of the gating, and explicate histogram normalization of FlowJo). 

      It is an interesting issue in bioscience, but in our presentation 'representative data' really are pretty representative, so a senior author is reminded of a comment Tak Mak made about a reduction (of proliferation, if memory serves) to 0.7 x control. [His point in a comment to referees at a symposium related that to a salary reduction by 30% :) A mathematical alternative is to point out that across four rounds of division for WT cells, a reduction to  0.7x efficiency at each cycle means about 1/4 as many progeny.] 

      We will try to edit the revision (Methods, Legends, Results, Discussion] to address better the points of the last two sentences of the comment, and improve the details that could assist in replication or comparisons (e.g., if someone develops a PexRAP inhibitor as potential therapeutic). 

      For the present, please note that the cell numbers at the end of the cultures are currently shown in Fig 2, panel I. Analogous culture results are shown in Fig 8, panels I, J, albeit with harvesting at day 5 instead of day 4. So, a difference of ≥ 3x needs to be explained. As noted above, a division efficiency reduced to 0.7x normal might account for such a decrease, but in practice the data of Fig. 2I show that the number of PexRAP-deficient B cells at day 4 is similar to the number plated before activation, and yet there has been a reasonable amount of divisions. So cell numbers in the culture of mutant B cells are constant because cycling is active but decreased and insufficient to allow increased numbers ("proliferation" in the true sense) as programmed death is increased. In line with this evidence, Fig 8G-H document higher death rates [i.e., frequencies of cleaved caspase3<sup>+</sup> cell and Annexin V<sup>+</sup> cells] of PexRAP-deficient B cells compared to controls. Thus, the in vitro data lead to the conclusion that both decreased division rates and increased death operate after this form of stimulation. 

      An inference is that this is the case in vivo as well - note that recoveries differed by ~3x (Fig. 2D), and the decrease in divisions (presentation of which will be improved) was meaningful but of lesser magnitude (Fig. 2E, F). 

      Reviewer #2 (Public review):

      Summary:

      In this study, Cho et al. investigate the role of ether lipid biosynthesis in B cell biology, particularly focusing on GC B cell, by inducible deletion of PexRAP, an enzyme responsible for the synthesis of ether lipids.

      Strengths:

      Overall, the data are well-presented, the paper is well-written and provides valuable mechanistic insights into the importance of PexRAP enzyme in GC B cell proliferation.

      We appreciate this positive response and agree with the overview and summary of the paper's approaches and strengths. 

      Weaknesses:

      More detailed mechanisms of the impaired GC B cell proliferation by PexRAP deficiency remain to be further investigated. In the minor part, there are issues with the interpretation of the data which might cause confusion for the readers.

      Issues about contributions of cell cycling and divisions on the one hand, and susceptibility to death on the other, were discussed above, amplifying on the current manuscript text. The aggregate data support a model in which both processes are impacted for mature B cells in general, and mechanistically the evidence and work focus on the increased ROS and modes of death. Although the data in Fig. 7 do provide evidence that GC B cells themselves are affected, we agree that resource limitations had militated against developing further evidence about cycling specifically for GC B cells. We will hope to be able to obtain sufficient data from some specific analysis of proliferation in vivo (e.g., Ki67 or BrdU) as well as ROS and death ex vivo when harvesting new samples from mice immunized to analyze GC B cells for CXCR4/CD86, CD38, CD138 as indicated by Reviewer 1. As suggested by Reviewer 2, we will further discuss the possible mechanism(s) by which proliferation of PexRAP-deficient B cells is impaired. We also will edit the text of a revision where to enhance clarity of data interpretation - at a minimum, to be very clear that caution is warranted in assuming that GC B cells will exhibit the same mechanisms as cultures in vitro-stimulated B cells. 

      Addendum / update of Sept 2025: We were able to obtain results of intravital BrdU incorporation into GC B cells to measure cell cycling rates. The revised manuscript includes these results as well as other new data on apoptosis / survival, while deleting the data about CD138 populations whose interpretation was reasonably questioned by the referees.  

      Reviewer #1 (Recommendations for the authors):

      We believe the evidence presented to support the role of PexRAP in protecting B cells from cell death and promoting B cell proliferation is not sufficiently robust and requires further validation in vivo. While the study demonstrates an increase in ether lipid content within the GC compartment, it also highlights a reduction in mature B cells in PexRAP-deficient mice under steady-state conditions. However, the IMS results (Fig. 3A) indicate that there are no significant differences in ether lipid content in the naïve B cell population. This discrepancy raises an intriguing point for discussion: why is PexRAP critical for B cell survival under steady-state conditions?

      We thank the referee for all their care and input, and we agree that further intravital analyses could strengthen the work by providing more direct evidence of impairment of GC B cells in vivo. To revise and improve this manuscript before creation of a contribution of record, we performed new experiments to the limit of available funds and have both (i) added these new data and (ii) sharpened the presentation to correct what we believe to be one inaccurate point raised in the review. 

      (A) Specifically, we immunized mice with a B cell-specific depletion of PexRAP (Dhrs7b<sup>D/D-B</sup> mice) and measured a variety of readouts of the GC B cells' physiology in vivo: proliferation by intravital incorporation of BrdU, ROS in the viable GC B cell gate, and their cell death by annexin V staining directly ex vivo. Consistent with the data with in vitro activated B cells, these analyses showed increased ROS (new - Fig. 7D) and higher frequencies of Annexin V<sup>+</sup> 7AAD<sup>+</sup> in GC B cells (GL7<sup>+</sup> CD38<sup>-</sup> B cell-gate) of immunized Dhrs7b<sup>D/D-B</sup> mice compared with WT controls (huCD20-CreERT2<sup>+/-</sup>, Dhrs7b<sup>+/+</sup>)  (new - Fig. 7E). Collectively, these results indicate that PexRAP aids (directly or indirectly) in controlling ROS in GC B cells and reduces B cell death, likely contributing to the substantially decreased overall GC B cell population. These new data are added to the revised manuscript in Figure 7.  

      Moreover, in each of two independent experiments (each comprising 3 vs 3 immunized mice), BrdU<sup>+</sup> events among GL7<sup>+</sup> CD38<sup>-</sup> (GC B cell)-gated cells were reduced in the B cell-specific PexRAP knockouts compared with WT controls (new, Fig. 7F and Supplemental Fig 6E). This result on cell cycle rates in vivo is presented with caution in the revised manuscript text because the absolute labeling fractions were somewhat different in Expt 1 vs Expt 2. This situation affords a useful opportunity to comment on the culture of "P values" and statistical methods. It is intriguing to consider how many successful drugs are based on research published back when the standard was to interpret a result of this sort more definitively despite a merged "P value" that was not a full 2 SD different from the mean. In the optimistic spirit of the eLife model, it can be for the attentive reader to decide from the data (new, Fig. 7F and Supplemental Fig 6E) whether to interpret the BrdU results more strongly that what we state in the revised text.  

      (B) On the issue of whether or not the loss of PexRAP led to perturbations of the lipidome of B cells prior to activation, we have edited the manuscript to do a better job making this point more clear.  

      We point out to readers that in the resting, pre-activation state abnormalities were detected in naive B cells, not just in activated and GC B cells. In brief, the IMS analysis and LC-MS-MS analysis detected statistically significant differences in some, but not all, the ether phospholipids species in PexRAP deficient cells (some of which was in Supplemental Figure 2 of the original version). 

      With this appropriate and helpful concern having been raised, we realize that this important point merited inclusion in the main figures. We point specifically to a set of phosphatidyl choline ions shown in Fig. 3 (revised - panels A, B, D) of the revised manuscript (PC O-36:5; PC O-38:5; PC O-40:6 and -40:7). 

      For this ancillary record (because a discourse on the limitations of each analysis), we will note issues such as the presence of many non-B cells in each pixel of the IMS analyses (so that some or many "true positives" will fail to achieve a "significant difference") and for the naive B cells, differential rates of synthesis, turnover, and conversion (e.g., addition of another 2-carbon unit or saturation / desaturation of one side-chain). To the extent the concern reflects some surprise and perhaps skepticism that what seem relatively limited differences (many species appear unaffected, etc), we share in the sentiment. But the basic observation is that there are differences, and a reasonable connection between the altered lipid profile and evidence of effects on survival or proliferation (i.e., integration of survival and cell cycling / division). 

      Additionally, it would be valuable to evaluate the humoral response in a T-independent setting. This would clarify whether the role of PexRAP is restricted to GC B cells or extends to activated B cells in general. 

      We agree that this additional set of experiments would be nice and would extend work incrementally by testing the generality of the findings about Ab responses. The practical problem is that money and time ran out while testing important items that strengthen the evidence about GC B cells. 

      Finally, the manuscript would benefit from a thorough revision to improve its readability and clarity. Including more detailed descriptions of technical aspects, such as the specific stimuli and time points used in analyses, would greatly enhance the flow and comprehension of the study. Furthermore, the authors should review figure labeling to ensure consistency throughout the manuscript, and carefully cite the relevant references. For instance, S1PR2 CreERT2 mouse is established by Okada and Kurosaki (Shinnakasu et al ,Nat. Immunol, 2016)

      We appreciate this feedback and comment, inasmuch as both the clarity and scholarship matter greatly to us for a final item of record. For the revision, we have given our best shot to editing the text in the hopes of improved clarity, reduction of discrepancies (helpfully noted in the Minor Comments), and further detail-rich descriptions of procedures. We also edited the figure labeling to give a better consistency. While we note that the appropriate citation of Shinnakasu et al (2016) was ref. #69 of the original and remains as a citation, we have rechecked other referencing and try to use citations with the best relevant references.  

      Minor Comments: The labeling of plots in Fig. 2 should be standardized. For example, in Fig. 2C, D, and G, the same mouse strain is used, yet the Cre+ mouse is labeled differently in each plot. 

      We agree and have tried to tighten up these features in the panels noted as well as more generally (e.g., Fig. 4, 5, 6, 7, 9; consistency of huCD20-CreERT2 / hCD20CreERT2).

      According to the text, the results shown in Fig. 1G and H correspond to a full KO  (Dhrs7b^f/f; Rosa26-CreERT2 mice). However, Fig. 1H indicates that the bottom image corresponds to Dhrs7b^f/f, huCD20-CreERT2 mice (Dhrs7bΔ/Δ -B). 

      We have corrected Fig. 1H to be labeled as Dhrs7b<sup>Δ/Δ</sup> (with the data on Dhrs7b<sup>Δ/Δ-B</sup> presented in Supplemental Figure 4A, which is correctly labeled). Thank you for picking up this error that crept in while using copy/paste in preparation of figure panels and failing to edit out the "-B"!  

      Similarly, the gating strategy for GC cells in the text mentions IgD− cells, while the figure legend refers to total viable B cells. These discrepancies need clarification.

      We believe we located and have corrected this issue in the revised manuscript.   

      Figures 3 and 4. The authors claim that B cell expression of PexRAP is required to  achieve normal concentrations of ether phospholipids. 

      Suggestions for Improvement: 

      Lipid Metabolism Analysis: The analysis in Fig. 3 is generally convincing but could be strengthened by including an additional stimulation condition such as anti-IgM plus antiCD40. In Fig. 4C, the authors display results from the full KO model. It would be helpful to include quantitative graphs summarizing the parameters displayed in the images.

      We have performed new experiments (anti-IgM + anti-CD40) and added the data to the revised manuscript (new - Supplemental Fig. 2H and Supplemental Fig 6, D & F). Conclusions based on the effects are not changed from the original. 

      As a semantic comment and point of scientific process, any interpretation ("claim") can - by definition - only be taken to apply to the conditions of the experiment. Nonetheless, it is inescapable that at least for some ether P-lipids of naive, resting B cells, and for substantially more in B cells activated under the conditions that we outline, B cell expression of PexRAP is required. 

      With regards to the constructive suggestion about a new series of lipidomic analyses, we agree that for activated B cells it would be nice and increase insight into the spectrum of conditions under which the PexRAP-deficient B cells had altered content of ether phospholipids. However, in light of the costs of metabolomic analyses and the lack of funds to support further experiments, and the accuracy of the point as stated, we prioritized the experiments that could fit within the severely limited budget. 

      [One can add that our results provide a premise for later work to analyze a time course after activation, and to perform isotopomer (SIRM) analyses with [13] C-labeled acetate or glucose, so as to understand activation-induced increases in the overall   To revise the manuscript, we did however extrapolate from the point about adding BCR cross-linking to anti-CD40 as a variant form of activating the B cells for measurements of ROS, population growth, and rates of division (CTV partitioning). The results of these analyses, which align with and thereby strengthen the conclusions about these functional features from experiments with anti-CD40 but no anti-IgM, are added to Supplemental Fig 2H and Supplemental Fig 6D, F. 

      Figures 5, 6, and 7

      The authors claim that Dhrs7b in B cells shapes antibody affinity and quantity. They use two mouse models for this analysis: huCD20-CreERT2 and Dhrs7b f/f; S1pr2-CreERT2 mice. 

      Suggestions for Improvement:

      Adaptive immune response characterization: A more comprehensive characterization of the adaptive immune response is needed, ideally using the Dhrs7b f/f; S1pr2-CreERT2 model. This should include: Analysis of the GC response in B220+CD138− cells. Class switch recombination analysis. A detailed characterization of centroblasts, centrocytes, and Tfh populations. Characterization of effector cells (plasma cells and memory cells).

      Within the limits of time and money, we have performed new experiments prompted by this constructive set of suggestions. 

      Specifically, we analyzed the suggested read-outs in the huCD20-CreERT2, Dhrs7b<sup>f/f</sup> model after immunization, recognizing that it trades greater signal-noise for the fact that effects are due to a mix of the impact on B cells during clonal expansion before GC recruitment and activities within the GC. In brief, the results showed that 

      (a) the GC B cell population - defined as CD138<sup>neg</sup> GL7<sup>+</sup> CD38<sup>lo/neg</sup> IgD<sup>neg</sup> B cells - was about half as large for PexRAP-deficient B cells net of any early- or preplasmablasts (CD138<sup>+</sup> events) (new - Fig 5G); 

      (b) the frequencies of pre- / early plasmablasts (CD138<sup>+</sup> GL7<sup>+</sup> CD38<sup>neg</sup>) events (see new - Fig. 6H, I; also, new Supplemental Fig 5D) were so low as to make it unlikely that our data with the S1pr2-CreERT2 model (in Fig 7B, C) would be affected meaningfully by analysis of the CD138 levels;

      (c) There was a modest decrease in centrocytes (LZ) but not centroblasts (DZ) (new - Fig 5H, I) - consistent with the immunohistochemical data of Supplemental Fig. 5A-C). 

      Because of time limitations (the "shelf life" of funds and the lab) and insufficient stock of the S1pr2-CreERT2, Dhrs7b<sup>f/f</sup> mice as well as those that would be needed as adoptive transfer recipients because of S1PR2 expression in (GC-)Tfh, the experiments were performed instead with the huCD20-CreERT2, Dhrs7b<sup>f/f</sup> model. We would also note that using this Cre transgene better harmonizes the centrocyte/centroblast and Tfh data with the existing data on these points in Supplemental Fig. 4. 

      (d) Of note, the analyses of Tfh and GC-Tfh phenotype cells using the huCD20-CreERT2 B cell type-specific inducible Cre system to inactivate Dhrs7b (new - Supplemental Fig 1G-I; which, along with new - Supplemental Fig 5E) provide evidence of an abnormality that must stem from a function or functions of PexRAP in B cells, most likely GC B cells. Specifically, it is known that the GC-Tfh population proliferates and is supported by the GC B cells, and the results of B cell-specific deletion show substantial reductions in Tfh cells (both the GC-Tfh gating and the wider gate for plots of CXCR5/PD-1/ fluorescence of CD4 T cells 

      Timepoint Consistency: The NP response (Fig. 5) is analyzed four weeks postimmunization, whereas SRBC (Supp. Fig. 4) and Fig. 7 are analyzed one week or nine days post-immunization. The NP system analysis should be repeated at shorter timepoints to match the peak GC reaction.

      This comment may stem from a misunderstanding. As diagrammed in Fig. 5A, the experiments involving the NP system were in fact measured at 7 d after a secondary (booster) immunization. That timing is approximately the peak period and harmonizes with the 7 d used for harvesting SRBC-immunized mice. So in fact the data with each system were obtained at a similar time point. Of course the NP experiments involved a second immunization so that many plasma cell and Ab responses derived from memory B cells generated by the primary immunization. However, the field at present is dominated by the view that the vast majority of the GC B cells after this second immunization (which historically we perform with alum adjuvant) are recruited from the naive rather than the memory B cell pool. For the revised manuscript, we have taken care that the Methods, Legend, and Figure provide the information to readers, and expanded the statement of a rationale. 

      It may seem a technicality but under NIH regulations we are legally obligated to try to minimize mouse usage. It also behooves researchers to use funds wisely. In line with those imperatives, we used systems that would simultaneously allow analyses of GC B cells, identification of affinity maturation (which is minimal in our hands at a 7 d time point after primary NP-carrier immunization), and a switched repertoire (also minimal), and where with each immunogen the GC were scored at 7-9 d after immunization (9 d refers to the S1pr2-CreERT2 experiments). Apart from the end of funding, we feel that what little might be learned from performing a series of experiments that involve harvests 7 d after a primary immunization with NP-ovalbumin cannot well be justified. 

      In vitro plasma cell differentiation: Quantification is missing for plasma cell differentiation in vitro (Supp. Fig. 4). The stimulus used should also be specified in the figure legend. Given the use of anti-CD40, differentiation towards IgG1 plasma cells could provide additional insights.

      As suggested by reviewer, we have added the results of quantifying the in vitro plasma cell differentiation in Supplemental Fig 6B. Also, we edited the Methods and Supplemental Figure Legend to give detailed information of in vitro stimulation. 

      Proliferation and apoptosis analysis: The observed defects in the humoral response should be correlated with proliferation and apoptosis analyses, including Ki67 and Caspase markers.

      As suggested by the review, we have performed new experiment and analyzed the frequencies of cell death by annexin V staining, and elected to use intravital uptake of BrdU as a more direct measurement of S phase / cell cycling component of net proliferation. The new results are now displayed in Figure 5 and Supplemental Fig. 5. 

      Western blot confirmation: While the authors have demonstrated the absence of PexRAP protein in the huCD20-CreERT2 model, this has not been shown in GC B cells from the Dhrs7b f/f; S1pr2-CreERT2 model. This confirmation is necessary to validate the efficiency of Dhrs7b deletion.

      We were unable to do this for technical reasons expanded on below. For the revision, we have edited in a bit of text more explicitly to alert readers to the potential impact of counter-selection on interpretation of the findings with GC B cells. Before entering the GC, B cells have undergone many divisions, so if there were major pre-GC counterselection, in all likelihood the GC B cells would PexRAP-sufficient. To recap from the original manuscript and the new data we have added, IMS shows altered lipid profiles in the GC B cells and the literature indicates that the lipids are short-lived, requiring de novo resynthesis. The BrdU, ROS, and annexin V data show that GC B cells are abnormal. Accordingly, abnormal GC B cells represent the parsimonious or straightforward interpretation of the new results with GC-Tfh cell prevalence. 

      While we take these findings together to suggest that counterselection (i.e., a Western result showing normal levels of PexRAP in the GC B cells) seems unlikely, it is formally possible and would mean that the in situ defects of GC B cells arose due to environmental influences of the PexRAP-deficient B cells during the developmental history of the WT B cells observed in the GC. 

      Having noted all that, we understand that concerns about counter-selection are an issue if a reader accepts the data showing that mutant (PexRAP-deficient) B cells tend to proliferate less and die more readily. Indeed, one can speculate that were we also to perform competition experiments in which the Ighb, Cd45.2 B cells (WT or Dhrs7b D/D) are mixed with equal numbers of Igha, Cd45.1 competitors, the differences would become much greater. With this in mind, Western blotting of flow-purified GC B cells might give a sense of how much counter-selection has occurred. 

      That said, the Westerns need at least 2.5 x 10<sup>6</sup> B cells (those in the manuscript used five million, 5  x 10<sup>6</sup>) and would need replication. Taken together with the observation that ~200,000 GC B cells (on average) were measured in each B cell-specific knockout mouse after immunization (Fig. 1, Fig 5) and taking into account yields from sorting, each Western would require some 20-25 tamoxifen-injected ___-CreERT2, Dhrs7b f/f mice, and about half again that number as controls. The expiry of funds prohibited the time and costs of generating that many mice (>70) and flow-purified GC B cells. 

      Figure 8

      The authors claim that Dhrs7b contributes to the modulation of ROS, impacting B cell proliferation.

      Suggestions for Improvement:

      GC ROS Analysis: The in vitro ROS analysis should be complemented by characterizing ROS and lipid peroxidation in the GC response using the Dhrs7b f/f; S1pr2-CreERT2 model. Flow cytometry staining with H2DCFDA, MitoSOX, Caspase-3, and Annexin V would allow assessment of ROS levels and cell death in GC B cells. 

      While subject to some of the same practical limits noted above, we have performed new experiments in line with this helpful input of the reviewer, and added the helpful new data to the revised manuscript. Specifically, in addition to the BrdU and phenotyping analyses after immunization of huCD20-CreER<sup>T2</sup>, Dhrs7b<sup>f/f</sup> mice, DCFDA (ROS), MitoSox, and annexin V signals were measured for GC B cells. Although the mitoSox signals did not significantly differ for PexRAP-deficient GCB, the ROS and annexin V signals were substantially increased. We added the new data to Figure 5 and Supplemental Figure 5. Together with the decreased in vivo BrdU incorporation in GC B cells from Dhrs7b<sup>D/D-B</sup> mice, these results are consistent with and support our hypothesis that PexRAP regulates B cell population growth and GC physiology in part by regulating ROS detoxification, survival and proliferation of B cells.  

      Quantification is missing in Fig. 8E, and Fig. 8F should use clearer symbols for better readability. 

      We added quantification for Fig 8E in Supplemental Fig 6E, and edited the symbols in Fig 8F for better readability.

      Figure 9

      The authors claim that Dhrs7b in B cells affects oxidative metabolism and ER mass. The  results in this section are well-performed and convincing.

      Suggestion for Improvement:

      Based on the results, the discussion should elaborate on the potential role of lipids in antigen presentation, considering their impact on mitochondria and ER function.

      We very much appreciate the praise of the tantalizing findings about oxidative metabolism and ER mass, and will accept the encouragement that we add (prudently) to the Discussion section to make note of the points mentioned by the Reviewer, particularly now that (with their encouragement) we have the evidence that B cell-specific loss of PexRAP (with the huCD20-CreERT2 deletion prior to immunization) resulted in decreased (GC-)Tfh and somewhat lower GC B cell proliferation.  

      Reviewer #2 (Recommendations for the authors):

      The authors should investigate whether PexRAP-deficient GC B cells exhibit increased mitochondrial ROS and cell death ex vivo, as observed in in vitro cultured B cells.

      We very much appreciate the work of the referee and their input. We addressed this helpful recommendation, in essence aligned with points from Reviewer 1, via new experiments (until the money ran out) and addition of data to the manuscript. To recap briefly, we found increased ROS in GC B cells along with higher fractions of annexin V positive cells; intriguingly, increased mtROS (MitoSox signal) was not detected, which contrasts with the results in activated B cells in vitro in a small way. To keep the text focused and not stray too far outside the foundation supported by data, this point may align with papers that provide evidence of differences between pre-GC and GC B cells (for instance with lack of Tfam or LDHA in B cells).    

      It remains unclear whether the impaired proliferation of PexRAP-deficient B cells is primarily due to increased cell death. Although NAC treatment partially rescued the phenotype of reduced PexRAP-deficient B cell number, it did not restore them to control levels. Analysis of the proliferation capacity of PexRAP-deficient B cells following NAC treatment could provide more insight into the cause of impaired proliferation.

      To add to the data permitting an assessment of this issue, we performed new experiments in which B cells were activated (BCR and CD40 cross-linking), cultured, and both the change in population and the CTV partitioning were measured in the presence or absence of NAC. The results, added to the revision as Supplemental Fig 6FH, show that although NAC improved cell numbers for PexRAP-deficient cells relative to controls, this compound did not increase divisions at all. We infer that the more powerful effect of this lipid synthesis enzyme is to promote survival rather than division  capacity. 

      Primary antibody responses were assessed at only one time point (day 20). It would be valuable to examine the kinetics of antibody response at multiple time points (0, 1w, 2w, 3w, for example) to better understand the temporal impact of PexRAP on antibody production.

      We thank the reviewer for this suggestion. While it may be that the kinetic measurement of Ag-specific antibody level across multiple time points would provide an additional mechanistic clue into the of impact PexRAP on antibody production, the end of sponsored funding and imminent lab closure precluded performing such experiments.   

      CD138+ cell population includes both GC-experienced and GC-independent plasma cells (Fig. 7). Enumeration of plasmablasts, which likely consists of both PexRAP-deleted and undeleted cells (Fig. 7D and E), may mislead the readers such that PexRAP is dispensable for plasmablast generation. I would suggest removing these data and instead examining the number of plasmablasts in the experimental setting of Fig. 4A (huCD20-CreERT2-mediated deletion) to address whether PexRAP-deficiency affects plasmablast generation. 

      We have eliminated the figure panels in question, since it is accurate that in the absence of a time-stamping or marking approach we have a limited ability to distinguish plasma cells that arose prior to inactivation of the Dhrs7b gene in B cells. In addition, we performed new experiments that were used to analyze the "early plasmablast" phenotype and added those data to the revision (Supplemental Fig 5D).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary:

      The authors use the theory of planned behavior to understand whether or not intentions to use sex as a biological variable (SABV), as well as attitude (value), subjective norm (social pressure), and behavioral control (ability to conduct behavior), across scientists at a pharmacological conference. They also used an intervention (workshop) to determine the value of this workshop in changing perceptions and misconceptions. Attempts to understand the knowledge gaps were made.

      Strengths:

      The use of SABV is limited in terms of researchers using sex in the analysis as a variable of interest in the models (and not a variable to control). To understand how we can improve on the number of researchers examining the data with sex in the analyses, it is vital we understand the pressure points that researchers consider in their work. The authors identify likely culprits in their analyses. The authors also test an intervention (workshop) to address the main bias or impediments for researchers' use of sex in their analyses. 

      Weaknesses:

      There are a number of assumptions the authors make that could be revisited: 

      (1) that all studies should contain across sex analyses or investigations. It is important to acknowledge that part of the impetus for SABV is to gain more scientific knowledge on females. This will require within sex analyses and dedicated research to uncover how unique characteristics for females can influence physiology and health outcomes. This will only be achieved with the use of female-only studies. The overemphasis on investigations of sex influences limits the work done for women's health, for example, as within-sex analyses are equally important.

      The Sex and Gender Equity in Research (SAGER) guidelines (1) provide guidance that “Where the subjects of research comprise organisms capable of differentiation by sex, the research should be designed and conducted in a way that can reveal sex-related differences in the results, even if these were not initially expected.”.  This is a default position of inclusion where the sex can be determined and analysis assessing for sex related variability in response. This position underpins many of the funding bodies new policies on inclusion.   

      However, we need to place this in the context of the driver of inclusion. The most common reason for including male and female samples is for those studies that are exploring the effect of a treatment and then the goal of inclusion is to assess the generalisability of the treatment effect (exploratory sex inclusion)(2). The second scenario is where sex is included because sex is one of the variables of interest and this situation will arise because there is a hypothesized sex difference of interest (confirmatory sex inclusion).  

      We would argue that the SABV concept was introduced to address the systematic bias of only studying one sex when assessing treatment effect to improve the generalisability of the research.  Therefore, it isn’t directly to gain more scientific knowledge on females.  However, this strategy will highlight when the effect is very different between male and female subjects which will potentially generate sex specific hypotheses.  

      Where research has a hypothesis that is specific to a sex (e.g. it is related to oestrogen levels) it would be appropriate to study only the sex of interest, in this case females. The recently published Sex Inclusive Research Framework gives some guidance here and allows an exemption for such a scenario classifying such proposals “Single sex study justified” (3).

      We have added an additional paragraph to the introduction to clarify the objectives behind inclusion and how this assists the research process. 

      (2) It should be acknowledged that although the variability within each sex is not different on a number of characteristics (as indicated by meta-analyses in rats and mice), this was not done on all variables, and behavioral variables were not included. In addition, across-sex variability may very well be different, which, in turn, would result in statistical sex significance. In addition, on some measures, there are sex differences in variability, as human males have more variability in grey matter volume than females. PMID: 33044802. 

      The manuscript was highlighting the common argument used to exclude the use of females, which is that females are inherently more variable as an absolute truth. We agree there might be situations, where the variance is higher in one sex or another depending on the biology.  We have extended the discussion here to reflect this, and we also linked to the Sex Inclusive Research Framework (3) which highlights that in these situations researchers can utlise this argument provided it is supported with data for the biology of interest. 

      (3) The authors need to acknowledge that it can be important that the sample size is increased when examining more than one sex. If the sample size is too low for biological research, it will not be possible to determine whether or not a difference exists. Using statistical modelling, researchers have found that depending on the effect size, the sample size does need to increase. It is important to bare this in mind as exploratory analyses with small sample size will be extremely limiting and may also discourage further study in this area (or indeed as seen the literature - an exploratory first study with the use of males and females with limited sample size, only to show there is no "significance" and to justify this as an reason to only use males for the further studies in the work. 

      The reviewer raises a common problem: where researchers have frequently argued that if they find no sex differences in a pilot then they can proceed to study only one sex. The SAGER guidelines (1), and now funder guidelines (4, 5), challenge that position. Instead, the expectation is for inclusion as the default in all experiments (exploratory inclusion strategy) to allow generalisable results to be obtained. When the results are very different between the male and female samples, then this can be determined. This perspective shift (2) requires a change in mindset and understanding that the driver behind inclusion is of generalisability not exploration of sex differences. This has been added to the introduction as an additional paragraph exploring the drivers behind inclusion.  

      We agree with the reviewer that if the researcher is interested in sex differences in an effect (confirmatory inclusion strategy, aka sex as a primary variable) then the N will need to be higher.  However, in this situation, one, of course, must have male and female samples in the same experiment to allow the simultaneous exploration to assess the dependency on sex. 

      Reviewer #2 (Public review): 

      Summary:

      The investigators tested a workshop intervention to improve knowledge and decrease misconceptions about sex inclusive research. There were important findings that demonstrate the difficulty in changing opinions and knowledge about the importance of studying both males and females. While interventions can improve knowledge and decrease perceived barriers, the impact was small. 

      Strengths:

      The investigators included control groups and replicated the study in a second population of scientists. The results appear to be well substantiated. These are valuable findings that have practical implications for fields where sex is included as a biological variable to improve rigor and reproducibility. 

      Thank you for assessment and highlighting these strengths.  We appreciate your recognition of the value and practical implications of this work. 

      Weaknesses:

      I found the figures difficult to understand and would have appreciated more explanation of what is depicted, as well as greater space between the bars representing different categories. 

      We have improved the figures and figure legends to improve clarity. 

      Reviewer #3 (Public review):

      Summary:

      This manuscript aims to determine cultural biases and misconceptions in inclusive sex research and evaluate the efficacy of interventions to improve knowledge and shift perceptions to decrease perceived barriers for including both sexes in basic research. 

      Overall, this study demonstrates that despite the intention to include both sexes and a general belief in the importance of doing so, relatively few people routinely include both sexes. Further, the perceptions of barriers to doing so are high, including misconceptions surrounding sample size, disaggregation, and variability of females. There was also a substantial number of individuals without the statistical knowledge to appropriately analyze data in studies inclusive of sex. Interventions increased knowledge and decreased perception of barriers. 

      Strengths:

      (1) This manuscript provides evidence for the efficacy of interventions for changing attitudes and perceptions of research.

      (2) This manuscript also provides a training manual for expanding this intervention to broader groups of researchers.

      Thank you for highlighting these strengths. We appreciate your recognition that the intervention was effect in changing attitudes and perception. We deliberately chose to share the material to provide the resources to allow a wider engagement.  

      Weaknesses:

      The major weakness here is that the post-workshop assessment is a single time point, soon after the intervention. As this paper shows, intention for these individuals is already high, so does decreasing perception of barriers and increasing knowledge change behavior, and increase the number of studies that include both sexes? Similarly, does the intervention start to shift cultural factors? Do these contribute to a change in behavior? 

      Measuring change in behaviour following an intervention is challenging and hence we had implemented an intention score as a proxy for behaviour. We appreciate the benefit of a long-term analysis, but it was beyond the scope of this study and would need a larger dataset size to allow for attrition. We agree that the strategy implemented has weaknesses. We have extended the limitation section in the discussion to include these. 

      Reviewer #1 (Recommendations for the authors):  

      I would ask them to think about alternative explanations and ask for free-form responses, and to revise with the caveats written above - sample size does need to be increased depending on effect size, and that within sex studies are also important. Not all studies should focus on sex influences.  

      The inclusion of the additional paragraph in the introduction to clarify the objective of inclusion and the resulting impact on experimental design should address these recommendations.   

      We have also added the free-form responses as an additional supplementary file.  

      Reviewer #2 (Recommendations for the authors):  

      This is an important set of studies. My only recommendation to improve the data presentation so that it is clear what is depicted and how the analyses were conducted. I know it is in the methods, but reminding the reader would be helpful.  

      We have revisited the figures and included more information in the legends to explain the analysis and improve clarity.   

      Reviewer #3 (Recommendations for the authors):  

      There are parts in the introduction which read as contradictory and as such are confusing - for example, in the 3rd paragraph it states that little progress on sex inclusive research has been made, and in the following sentences it states that the proportion of published studies across sex has improved. The references in these two statements are from the same time range, so has this improved? Or not?  

      The introduction does include a summation statement on the position: “Whilst a positive step forward, this proportion still represents a minority of studies, and notably this inclusion was not associated with an increase in the proportion of studies that included data analysed by sex.” We have reworded the text to ensure it is internally consistent with this summary statement and this should increase clarity.

      In discussing the results, it is sometimes confusing what the percentages mean. For example, "the researchers reported only conducting sex inclusive research in <=55% of their studies over the past 5 years (55% in study 1 general population and 35% study 2 pre-assessment)." Does that mean 55% of people are conducting sex inclusive research, or does this mean only half of their studies? These two options have very different implications.

      We agree that the sentence is confusing and it has been reworded.  

      Addressing long-term assessments in attitude and action (ie, performing sex inclusive research) is a crucial addition, with data if possible, but at least substantive discussion.  

      We have add this to the limitation section in the discussion

      One minor but confusing point is the analogy comparing sex inclusive studies with attending the gym. The point is well taken - knowledge is not enough for behavior change. However, the argument here is that to increase sex inclusive research requires cultural change. To go to the gym, requires motivation.This seems like an oranges-to-lemons comparison (same family, different outcome when you bite into it).

      At the core, both scenarios involve the challenge of changing established habits and cultural norms in action based on knowledge (the right thing to do). The exercise scenario is a primary example provided by the original authors to describe how aspects of the theory of planned behaviour (perceived behavioural control, attitude, and social norms) may influence behavioural change. Understanding which of these aspects may drive or influence change is why we used this framework to understand our study population.  We disagree that is an oranges-to-lemons comparison.

      References

      (1) Heidari S, Babor TF, De Castro P, Tort S, Curno M. Sex and Gender Equity in Research: rationale for the SAGER guidelines and recommended use. Res Integr Peer Rev. 2016;1:2.

      (2) Karp NA. Navigating the paradigm shift of sex inclusive preclinical research and lessons learnt. Commun Biol. 2025;8(1):681.

      (3) Karp NA, Berdoy M, Gray K, Hunt L, Jennings M, Kerton A, et al. The Sex Inclusive Research Framework to address sex bias in preclinical research proposals. Nat Commun. 2025;16(1):3763.

      (4) MRC. Sex in experimental design - Guidance on new requirements https://www.ukri.org/councils/mrc/guidance-for-applicants/policies-and-guidance-forresearchers/sex-in-experimental-design/: UK Research and Innovation; 2022 [

      (5) Clayton JA, Collins FS. Policy: NIH to balance sex in cell and animal studies. Nature. 2014;509(7500):282-3.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      Summary:

      Asthenospermia, characterized by reduced sperm motility, is one of the major causes of male infertility. The "9 + 2" arranged MTs and over 200 associated proteins constitute the axoneme, the molecular machine for flagellar and ciliary motility. Understanding the physiological functions of axonemal proteins, particularly their links to male infertility, could help uncover the genetic causes of asthenospermia and improve its clinical diagnosis and management. In this study, the authors generated Ankrd5 null mice and found that ANKRD5-/- males exhibited reduced sperm motility and infertility. Using FLAG-tagged ANKRD5 mice, mass spectrometry, and immunoprecipitation (IP) analyses, they confirmed that ANKRD5 is localized within the N-DRC, a critical protein complex for normal flagellar motility. However, transmission electron microscopy (TEM) and cryo-electron tomography (cryo-ET) of sperm from Ankrd5 null mice did not reveal significant structural abnormalities.

      Strengths:

      The phenotypes observed in ANKRD5-/- mice, including reduced sperm motility and male infertility, are conversing. The authors demonstrated that ANKRD5 is an N-DRC protein that interacts with TCTE1 and DRC4. Most of the experiments are well designed and executed.

      Weaknesses:

      The last section of cryo-ET analysis is not convincing. "ANKRD5 depletion may impair buffering effect between adjacent DMTs in the axoneme".

      "In WT sperm, DMTs typically appeared circular, whereas ANKRD5-KO DMTs seemed to be extruded as polygonal. (Fig. S9B,D). ANKRD5-KO DMTs seemed partially open at the junction between the A- and B-tubes (Fig. S9B,D)." In the TEM images of 4E, ANKRD5-KO DMTs look the same as WT. The distortion could result from suboptimal sample preparation, imaging or data processing. Thus, the subsequent analyses and conclusions are not reliable.

      Thank you for your valuable advice. To validate the results of cryo-ET, we carefully analyzed the TEM results (previously we only focused on the global "9+2" structure of the axial filament) and found that deletion of ANKRD5 resulted in both normal and deformed DMT morphologies, which was consistent with the results observed by cryo-ET. At the same time, we have added the corresponding text and picture descriptions in the article:

      The text description we added is: “Upon re-examining the TEM data in light of the Cryo-ET findings, similar abnormalities were observed in the TEM images (Fig.4E, Fig. S10B). Notably, both intact and deformed DMT structures were consistently observed in both TEM and STA analyses, with the deformation of the B-tube being more obvious (Fig.4E, Fig. S10). ”

      This paper still requires significant improvements in writing and language refinement. Here is an example: "While N-DRC is critical for sperm motility, but the existence of additional regulators that coordinate its function remains unclear" - ill-formed sentences.

      We appreciate the reviewer’s valuable comment regarding the clarity of our writing. The sentence cited (“While N-DRC is critical for sperm motility, but the existence of additional regulators that coordinate its function remains unclear”) was indeed ill-formed. We have revised it to improve readability and precision. The corrected version now reads:“Although the N-DRC is critical for sperm motility, whether additional regulatory components coordinate its function remains unclear.” We have carefully re-examined the manuscript and refined the language throughout to ensure clarity and conciseness.

      Reviewer #2 (Public review):

      Summary:

      The manuscript investigates the role of ANKRD5 (ANKEF1) as a component of the N-DRC complex in sperm motility and male fertility. Using Ankrd5 knockout mice, the study demonstrates that ANKRD5 is essential for sperm motility and identifies its interaction with N-DRC components through IP-mass spectrometry and cryo-ET. The results provide insights into ANKRD5's function, highlighting its potential involvement in axoneme stability and sperm energy metabolism.

      Strengths:

      The authors employ a wide range of techniques, including gene knockout models, proteomics, cryo-ET, and immunoprecipitation, to explore ANKRD5's role in sperm biology.

      Weaknesses:

      “Limited Citations in Introduction: Key references on the role of N-DRC components (e.g.,DRC2, DRC4) in male infertility are missing, which weakens the contextual background.”

      We appreciate the reviewer’s valuable suggestion. To address this concern, we have added the following sentence in the Introduction:

      “Recent mammalian knockout studies further confirmed that loss of DRC2 or DRC4 results in severe sperm flagellar assembly defects, multiple morphological abnormalities of the sperm flagella (MMAF), and complete male infertility, highlighting their indispensable roles in spermatogenesis and reproduction [31].”

      This addition introduces up-to-date evidence on DRC2 and DRC4 functions in male infertility and strengthens the contextual background as recommended.

      Reviewer #1 (Recommendations for the authors):

      "Male infertility impacts 8%-12% of the global male population, with sperm motility defects contributing to 40%-50% of these cases [2,3]. " Is reference 3 proper? I don't see "sperm motility defects contributing to 40%-50%" of male infertility.

      Thank you for identifying this issue. You are correct—reference 3 does not support the statement about sperm motility defects comprising 40–50% of male infertility cases; it actually states:

      “Male factor infertility is when an issue with the man’s biology makes him unable to impregnate a woman. It accounts for between 40 to 50 percent of infertility cases and affects around 7 percent of men.”

      This was a misunderstanding on my part, and I apologize for the oversight.

      To correct this, we have replaced the statement with more accurate references:

      PMID: 33968937 confirms:

      “Asthenozoospermia accounts for over 80% of primary male infertility cases.”

      PMID: 33191078 defines asthenozoospermia (AZS) as reduced or absent sperm motility and notes it as a major cause of male infertility.

      We have updated the manuscript accordingly:

      In the Significance Statement: “Male infertility affects approximately 8%-12% of men globally, with defects in sperm motility accounting for over 80% of these cases.”

      In the Introduction: “Male infertility affects approximately 8% to 12% of the global male population, with defects in sperm motility accounting for over 80% of these cases[2,3].”

      Thank you again for your careful review and for giving us the opportunity to improve the accuracy of our manuscript.

      "Rather than bypassing the issue with ICSI, infertility from poor sperm motility could potentially be treated or even cured through stimulation of specific signaling pathways or gene therapy." Need references.

      We appreciate the reviewer’s insightful comment. In response, we have added three supporting references to the relevant sentence.

      The first reference (PMID: 39932044) demonstrates that cBiMPs and the PDE-10A inhibitor TAK-063 significantly and sustainably improve motility in human sperm with low activity, including cryopreserved samples, without inducing premature acrosome reaction or DNA damage. The second reference (PMID: 29581387) shows that activation of the PKA/PI3K/Ca²⁺ signaling pathways can reverse reduced sperm motility. The third reference (PMID: 33533741) reports that CRISPR-Cas9-mediated correction of a point mutation in Tex11<sup>PM/Y</sup> spermatogonial stem cells (SSCs) restores spermatogenesis in mice and results in the production of fertile offspring.

      These references provide mechanistic support and demonstrate the feasibility of treating poor sperm motility through targeted pathway modulation or gene therapy, thus reinforcing the validity of our statement.

      "Our findings indicate that ANKRD5 (Ankyrin repeat domain 5; also known as ANK5 or ANKEF1) interacts with N-DRC structure". The full name should be provided the first time ANKRD5 appears. Is ANKRD5 a component of N-DRC or does it interact with N-DRC?

      We thank the reviewer for the valuable suggestion. In response, we have moved the full name “Ankyrin repeat domain 5; also known as ANK5 or ANKEF1” to the abstract where ANKRD5 first appears, and have removed the redundant mention from the main text.

      Based on our experimental data, we consider ANKRD5 to be a novel component of the N-DRC (nexin-dynein regulatory complex), rather than merely an interacting partner. Therefore, we have revised the sentence in the main text to read:

      “Here, we demonstrate that ANKRD5 is a novel N-DRC component essential for maintaining sperm motility.”

      Fig 5E, numbers of TEM images should be added.

      We thank the reviewer for the suggestion. We would like to clarify that Fig. 5E does not contain TEM images, and it is likely that the reviewer was referring to Fig. 4E instead.

      In Fig. 4E, we conducted three independent experiments. In each experiment, 60 TEM cross-sectional images of sperm tails were analyzed for both Ankrd5 knockout and control mice.

      The findings were consistent across all replicates.

      We have updated the figure legend accordingly, which now reads:

      “Transmission electron microscopy (TEM) of sperm tails from control and Ankrd5 KO mice. Cross-sections of the midpiece, principal piece, and end piece were examined. Red dashed boxes highlight regions of interest, and the magnified views of these boxed areas are shown in the upper right corner of each image. In three independent experiments, 20 sperm cross-sections per mouse were analyzed for each group, with consistent results observed.”

      There are random "222" in the references. Please check and correct.

      I sincerely apologize for the errors caused by the reference management software, which resulted in the insertion of random "222" and similar numbering issues in the reference list. I have carefully reviewed and corrected the following problems:

      References 9, 11, 13, 26, 34, 63, and 64 had the number "222" mistakenly placed before the title; these have now been removed. References 15 and 18 had "111" incorrectly inserted before the title; this has also been corrected. Reference 36 had an erroneous "2" before the title and was found to be a duplicate of Reference 32; these have now been merged into a single citation. Additionally, References 22 and 26 were identified as duplicates of the same article and have been consolidated accordingly. 

      All these issues have been resolved to ensure the reference list is accurate and properly formatted.

      Reviewer #2 (Recommendations for the authors):

      The authors have already addressed most of the issues I am concerned about.

      In addition, we have also corrected some errors in the revised manuscript:

      (1) In Figure 3G, the y-axis label was previously marked as “Sperm count in the oviduct (10⁶)”, which has now been corrected to “Sperm count in the oviduct”.

      (2) All p-values have been reformatted to italic lowercase letters to comply with the journal style guidelines.

      Figure 6 Legend: A typographical error in the figure legend has been corrected. The text previously read “(A) The differentially expressed proteins of Ankrd5<sup>+/–</sup> and Ankrd5<sup>+/-</sup> were identified...”. This has now been amended to “(A) The differentially expressed proteins of Ankrd5<sup>+/–</sup> and Ankrd5<sup>+/–</sup> were identified...” to correctly represent the comparison between heterozygous and homozygous knockout groups.

      In the original Figure 4E, we added a zoom-in panel to the image to show the deformed DMT.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #2 (Public review): 

      Summary: 

      The paper describes the high-resolution structure of KdpFABC, a bacterial pump regulating intracellular potassium concentrations. The pump consists of a subunit with an overall structure similar to that of a canonical potassium channel and a subunit with a structure similar to a canonical ATP-driven ion pump. The ions enter through the channel subunit and then traverse the subunit interface via a long channel that lies parallel to the membrane to enter the pump, followed by their release into the cytoplasm. 

      The work builds on the previous structural and mechanistic studies from the authors' and other labs. While the overall architecture and mechanism have already been established, a detailed understanding was lacking. The study provides a 2.1 Å resolution structure of the E1-P state of the transport cycle, which precedes the transition to the E2 state, assumed to be the ratelimiting step. It clearly shows a single K+ ion in the selectivity filter of the channel and in the canonical ion binding site in the pump, resolving how ions bind to these key regions of the transporter. It also resolves the details of water molecules filling the tunnel that connects the subunits, suggesting that K+ ions move through the tunnel transiently without occupying welldefined binding sites. The authors further propose how the ions are released into the cytoplasm in the E2 state. The authors support the structural findings through mutagenesis and measurements of ATPase activity and ion transport by surface-supported membrane (SSM) electrophysiology. 

      Reviewer #3 (Public review): 

      Summary: 

      By expressing protein in a strain that is unable to phosphorylate KdpFABC, the authors achieve structures of the active wildtype protein, capturing a new intermediate state, in which the terminal phosphoryl group of ATP has been transferred to a nearby Asp, and ADP remains covalently bound. The manuscript examines the coupling of potassium transport and ATP hydrolysis by a comprehensive set of mutants. The most interesting proposal revolves around the proposed binding site for K+ as it exits the channel near T75. Nearby mutations to charged residues cause interesting phenotypes, such as constitutive uncoupled ATPase activity, leading to a model in which lysine residues can occupy/compete with K+ for binding sites along the transport pathway. 

      Strengths: 

      The high resolution (2.1 Å) of the current structure is impressive, and allows many new densities in the potassium transport pathway to be resolved. The authors are judicious about assigning these as potassium ions or water molecules, and explain their structural interpretations clearly. In addition to the nice structural work, the mechanistic work is thorough. A series of thoughtful experiments involving ATP hydrolysis/transport coupling under various pH and potassium concentrations bolsters the structural interpretations and lends convincing support to the mechanistic proposal. The SSME experiments are generally rigorous. 

      Weaknesses: 

      The present SSME experiments do not support quantitative comparisons of different mutants, as in Figures 4D and 5E. Only qualitative inferences can be drawn among different mutant constructs. 

      Thank you to both reviewers for your thorough review of our work. We acknowledge the limitations of SSME experiments in quantitative comparison of mutants and have revised the manuscript to address this point. In addition, we have included new ATPase data from reconstituted vesicles which we believe will help to strengthen our contention that both ATPase and transport are equally affected by Val496 mutations.

      Reviewer #2 (Recommendations for the authors): 

      I have a minor editorial comment: 

      Perhaps I am confused. However, in reference to the text in the Results: "Our WT complex displayed high levels of K+-dependent ATPase activity and generated robust transport currents (Fig. 1 - figure suppl. 1).", I do not see either K+-dependency of ATPase activity nor transport currents in Fig. 1 - figure suppl. 1. Perhaps the text needs to be edited for clarity. 

      Thank you for pointing this out. This confusion was caused by our removal of a panel from the revised manuscript, which depicted K+-dependent transport currents. Although this panel is somewhat redundant, given inclusion of raw SSME traces from all the mutants, it has been replaced as Fig. 1 - figure supplement 1F, thus providing a thorough characterization of the preparation used for cryo-EM analysis and supporting the statement quoted by this reviewer.

      Reviewer #3 (Recommendations for the authors): 

      The authors have provided a detailed description of the SSME data collection, and followed rigorous protocols to ensure that the currents measured on a particular sensor remained stable over time. 

      I still have reservations about the direct comparison of transport in the different mutants. Specifically, on page 6, the authors state that "The longer side chain of V496M reduces transport modestly with no effect on ATPase activity. V496R, which introduces positive charge, completely abolishes activity. V496W and V496H reduce both transport and ATPase activity by about half, perhaps due to steric hindrance for the former and partial protonation for the latter." And in figures 4D and 5B, by plotting all of the peak currents on the same graph, the authors are giving the data a quantitative veneer, when these different experiments really aren't directly comparable, especially in the absence of any controls for reconstitution efficiency. 

      In terms of overall conclusions, for the more drastic mutant phenotypes, I think it is completely reasonable to conclude that transport is not observed. But a 2-fold difference could easily result from differences in reconstitution or sensor preparation. My suggestion would be to show example traces rather than a numeric plot in 4D/5E, to convey the qualitative nature of the mutant-to-mutant comparisons, and to re-write the text to acknowledge the shortcomings of mutant-to-mutant comparisons with SSME, and avoid commenting on the more subtle phenotypes, such as modest decreases and reductions by about half. 

      Figure 4, supplement 1. What is S162D? I don't think it is mentioned in the main text. 

      We agree with the reviewer's point that quantitative comparison of different mutants by SSME is compromised by ambiguity in reconstitution. However, we do not think that display of raw SSME currents is an effective way to communicate qualitative effects to the general reader, given the complexity of these data (e.g., distinction between transient binding current seen in V496R and genuine, steady-state transport current seen in WT). So we have taken a compromise approach. To start, we have removed the transport data from the main figure (Fig. 4). Luckily, we had frozen and saved the batch of reconstituted proteoliposomes from Val496 mutants that had been used for transport assays. We therefore measured ATPase activities from these proteoliposomes - after adding a small amount of detergent to prevent buildup of electrochemical gradients (1 mg/ml decylmaltoside which is only slightly more than the critical micelle concentration of 0.87 mg/ml). Differences in ATPase activity from these proteoliposomes were very similar to those measured prior to reconstitution (i.e., data in Fig. 4d) indicating that reconstitution efficiencies were comparable for the various mutants. Furthermore, differences in SSME currents are very similar to these ATPase activities, suggesting that Val496 mutants did not affect energy coupling. These data are shown in the revised Fig. 4 - figure suppl. 1a, along with the SSME raw data and size-exclusion chromatography elution profiles (Fig. 4 - figure suppl. 1b-g). We also altered the text to point out the concern over comparing transport data from different mutants (see below). We hope that this revised presentation adequately supports the conclusion that Val496 mutations - and especially the V496R substitution - influence the passage of K+ through the tunnel without affecting mechanics of the ATP-dependent pump. 

      The paragraph in question now reads as follows (pg. 6-7, with additional changes to legends to Fig. 4 and Fig. 4 - figure suppl. 1):

      "In order to provide experimental evidence for K+ transport through the tunnel, we made a series of substitutions to Val496 in KdpA. This residue resides near the widest part of the tunnel and is fully exposed to its interior (Fig. 4a). We made substitutions to increase its bulk (V496M and V496W) and to introduce charge (V496E, V496R and V496H). We used the AlphaFold-3 artificial intelligence structure prediction program (Jumper et al., 2021) to generate structures of these mutants and to evaluate their potential impact on tunnel dimensions. This analysis predicts that V496W and V496R reduce the radius to well below the 1.4 Å threshold required for passage of K+ or water (Fig. 4c); V496E and V496M also constrict the tunnel, but to a lesser extent. Measurements of ATPase and transport activity (Fig. 4d) show that negative charge (V496E) has no effect. The or a longer side chain of (V496M) reduces transport modestly with have no apparent effect on ATPase activity. V496R, which introduces positive charge, almost completely abolishes activity. V496W and V496H reduce both transport and ATPase activity by about half, perhaps due to steric hindrance for the former and partial protonation for the latter. Transport activity of these mutants was also measured, but quantitative comparisons are hampered by potential inconsistency in reconstitution of proteoliposomes and in preparation of sensors for SSME. To account for differences in reconstitution, we compared ATPase activity and transport currents taken from the same batch of vesicles (Fig. 4 - figure suppl. 1a).  These data show that differences in ATPase activity of proteoliposomes was consistent with differences measured prior to reconstitution (Fig. 4d). Transport activity, which was derived from multiple sensors, mirrored ATPase activity, indicating that the Val496 mutants did not affect energy coupling, but simply modulated turnover rate of the pump."

      S162D was included as a negative control, together with D307A. However, given the inactive mutants discussed in Fig. 5 (Asp582 and Lys586 substitutions), these seem an unnecessary distraction and have been removed from Fig. 4 - figure suppl. 1.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      The study presents significant findings on the role of mitochondrial depletion in axons and its impact on neuronal proteostasis. It effectively demonstrates how the loss of axonal mitochondria and elevated levels of eIF2β contribute to autophagy collapse and neuronal dysfunction. The use of Drosophila as a model organism and comprehensive proteome analysis adds robustness to the findings.

      In this revision, the authors have responded thoughtfully to previous concerns. In particular, they have addressed the need for a quantitative analysis of age-dependent changes in eIF2β and eIF2α. By adding western blot data from multiple time points (7 to 63 days), they show that eIF2β levels gradually increase until middle age, then decline. In milton knockdown flies, this pattern appears shifted, supporting the idea that mitochondrial defects may accelerate aging-related molecular changes. These additions clarify the temporal dynamics of eIF2β and improve the overall interpretation.

      Other updates include appropriate corrections to figures and quantification methods. The authors have also revised some of their earlier mechanistic claims, presenting a more cautious interpretation of their findings.

      Overall, this work provides new insights into how mitochondrial transport defects may influence aging-related proteostasis through eIF2β. The manuscript is now more convincing, and the revisions address the main points raised earlier. I find the updated version much improved.

      Thank you so much for the review, insightful comments and encouragement. We appreciate it.  

      Reviewer #2 (Public review):

      In the manuscript, the authors aimed to elucidate the molecular mechanism that explains neurodegeneration caused by the depletion of axonal mitochondria. In Drosophila, starting with siRNA depletion of milton and Miro, the authors attempted to demonstrate that the depletion of axonal mitochondria induces the defect in autophagy. From proteome analyses, the authors hypothesized that autophagy is impacted by the abundance of eIF2β and the phosphorylation of eIF2α. The authors followed up the proteome analyses by testing the effects of eIF2β overexpression and depletion on autophagy. With the results from those experiments, the authors proposed a novel role of eIF2β in proteostasis that underlies neurodegeneration derived from the depletion of axonal mitochondria, which they suggest accelerates age-dependent changes rather than increasing their magnitude.

      Strong caution is necessary regarding the interpretation of translational regulation resulting from the milton KD. The effect of milton KD on translation appears subtle, if present at all, in the puromycin incorporation experiments in both the initial and revised versions. Additionally, the polysome profiling data in the revised manuscript lack the clear resolution for ribosomal subunits, monosomes, and polysomes that is typically expected in publications.

      Thank you so much for the review and insightful comments. We appreciate it.  

      Reviewer #2 (Recommendations for the authors):

      The revised manuscript demonstrates many improvements. The authors have provided a more comprehensive data set and a more detailed description of their results. Furthermore, their explanation of the Integrated Stress Response (ISR) has been corrected, and this correction is reflected in the data interpretation.

      As in the public review, I maintained my emphasis on the weakness of the claim on suppressed global translation, since the data are the same in the initial and the revised versions.

      Thank you for your review. We understand that further studies will be needed to elucidate the roles on mitochondrial distribution in global translation profile. We will keep working on it. 

      A few suggestions for minor corrections.

      (1) The order of figures in the revised version is disorganized.

      Thank you for pointing it out. We corrected the order. 

      (2) In Figure 1A, mitochondria is bound by milton, and kinesin is bound by Miro. Their roles should be opposite.

      Thank you for pointing it out, and we are sorry for the oversight. We corrected it.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Recommendations for the authors): 

      I will address here just some minor changes that would improve understanding, reproducibility, or cohesion with the literature.

      (1) It would be good to mention that the prostatic vesicle of this study is named vesicula granulorum in (Steniböck, 1966) and granule vesicle in (Hooge et al, 2007).

      We have now included this (line 90 of our revised manuscript).  

      (2) A slightly more detailed discussion of the germline genes would be interesting. For example, a potential function of pa1b3-2 and cgnl1-2 based on the similarity to known genes or on the conserved domains.

      Pa1b3-2 appears to encode an acetylhydrolase; cgnl1-2 is likely a cingulin family protein involved in cell junctions. However, given the evolutionary distance between acoels and model organisms in whom these genes have been studied, we believe it is premature to speculate on their function without substantial additional work. We believe this work would be more appropriate in a future publication focused on the molecular genetic underpinnings of Hofstenia’s reproductive systems and their development.  

      (3) It is mentioned that the animals can store sperm while lacking a seminal bursa "given that H. miamia can lay eggs for months after a single mating" (line 635) - this could also be self-fertilization, according to the authors' other findings.

      We agree that it is possible this is self-fertilization, and we believe we have represented this uncertainty accurately in the text. However, we do not think this is likely, because self-fertilization manifests as a single burst of egg laying (Fig. 6D). We discuss this in the Results (line 540). 

      (4) A source should be given for the tree in Figure 7B. 

      We have now included this source (line 736), and we apologize for the oversight.  

      (5) Either in the Methods or in the Results section, it would be good to give more details on why actin and FMRFamide and tropomyosin are chosen for the immunohistochemistry studies.

      We have now included more detail in the Methods (line 823). Briefly, these are previously-validated antibodies that we knew would label relevant morphology.

      (6) In the Methods "a standard protocol hematoxylin eosin" is mentioned. Even if this is a fairly common technique, more details or a reference should be provided.

      We have now included more detail, and a reference (lines 766-774).  

      (7) Given the historical placement of Acoela within Platyhelminthes and the fact that the readers might not be very familiar with this group of animals, two passages can be confusing: line 499 and lines 674-678.

      We have edited these sentences to clarify when we mean platyhelminthes, which addresses this confusion.  

      (8) A small addition to Table S1: Amphiscolops langerhansi also presents asexual reproduction through fission ([1], cited in [2]]).

      Thanks. We have included this in Table S1.

      (a) Hanson, E. D. 1960. 'Asexual Reproduction in Acoelous Turbellaria'. The Yale Journal of Biology and Medicine 33 (2): 107-11.

      (b) Hendelberg, Jan, and Bertil Åkesson. 1991. 'Studies of the Budding Process in Convolutriloba Retrogemma (Acoela, Platyhelminthes)'. In Turbellarian Biology: Proceedings of the Sixth International Symposium on the Biology of the Turbellaria, Held at Hirosaki, Japan, 7-12 August 1990, 11-17. Springer. 

      Reviewer #2 (Recommendations for the authors): 

      I do not have any major comments on the manuscript. By default, I feel descriptive studies are a critical part of the advancement of science, particularly if the data are of great quality - as is the case here. The manuscript addresses various topics and describes these adequately. My minor point would be that in some sections, it feels like one could have gone a bit deeper. I highlighted three examples in the weakness section above (deeper analysis of markers for germline; modes of oogenesis/spermatogenesis; or proposed model for sperm storage). For instance, ultrastructural data might have been informative. But as said, I don't see this as a major problem, more a "would have been nice to see".

      We have responded to these points in detail above.

    1. Author response:

      Reviewer #1 (Public review):

      Major Comments:

      (1) The abstract frames progressive renal dysfunction as a "central, disease-modifying feature" in both Gba1b and Parkin models, with systemic consequences including water retention, ionic hypersensitivity, and worsened neuro phenotypes. While the data demonstrates renal degeneration and associated physiological stress, the causal contribution of renal defects versus broader organismal frailty is not fully disentangled. Please consider adding causal experiments (e.g., temporally restricted renal rescue/knockdown) to directly establish kidney-specific contributions.

      We concur that this would help strengthen our conclusions. However, manipulating Gba1b in a tissue-specific manner remains challenging due to its propensity for secretion via extracellular vesicles (ECVs). Leo Pallanck and Marie Davis have elegantly shown that ectopic Gba1b expression in neurons and muscles (tissues with low predicted endogenous expression) is sufficient to rescue major organismal phenotypes. Consistent with this, we have been unable to generate clear tissue-specific phenotypes using Gba1b RNAi.

      We will pursue more detailed time-course experiments of the progression of renal pathology, (water weight, renal stem cell proliferation, redox defects, etc.) with the goal of identifying earlier-onset phenotypes that potentially drive dysfunction.

      (2) The manuscript shows multiple redox abnormalities in Gba1b mutants (reduced whole fly GSH, paradoxical mitochondrial reduction with cytosolic oxidation, decreased DHE, increased lipid peroxidation, and reduced peroxisome density/Sod1 mislocalization). These findings support a state of redox imbalance, but the driving mechanism remains broad in the current form. It is unclear if the dominant driver is impaired glutathione handling or peroxisomal antioxidant/β-oxidation deficits or lipid peroxidation-driven toxicity, or reduced metabolic flux/ETC activity. I suggest adding targeted readouts to narrow the mechanism.

      We agree that we have not yet established a core driver of redox imbalance. Identifying one is likely to be challenging, especially as our RNA-sequencing data from aged Gba1b<sup>⁻/⁻</sup> fly heads (Atilano et al., 2023) indicate that several glutathione S-transferases (GstD2, GstD5, GstD8, and GstD9) are upregulated. We can attempt overexpression of GSTs, which has been elegantly shown by Leo Pallanck to ameliorate pathology in Pink1/Parkin mutant fly brains. However, mechanisms that specifically suppress lipid peroxidation or its associated toxicity, independently of other forms of redox damage, remain poorly understood in Drosophila. Our position is there probably will not be one dominant driver of redox imbalance. Notably, CytB5 overexpression has been shown to reduce lipid peroxidation (Chen et al., 2017), and GstS1 has been reported to conjugate glutathione to the toxic lipid peroxidation product 4-HNE (Singh et al., 2001). Additionally, work from the Bellen lab demonstrated that overexpression of lipases, bmm or lip4, suppresses lipid peroxidation-mediated neurodegeneration (Liu et al., 2015). We will therefore test the effects of over-expressing CytB5, bmm and lip4 in Gba1b<sup>⁻/⁻</sup> flies to help further define the mechanism.

      (3) The observation that broad antioxidant manipulations (Nrf2 overexpression in tubules, Sod1/Sod2/CatA overexpression, and ascorbic acid supplementation) consistently shorten lifespan or exacerbate phenotypes in Gba1b mutants is striking and supports the idea of redox fragility. However, these interventions are broad. Nrf2 influences proteostasis and metabolism beyond redox regulation, and Sod1/Sod2/CatA may affect multiple cellular compartments. In the absence of dose-response testing or controls for potential off-target effects, the interpretation that these outcomes specifically reflect redox dyshomeostasis feels ahead of the data. I suggest incorporating narrower interpretations (e.g., targeting lipid peroxidation directly) to clarify which redox axis is driving the vulnerability.

      We are in agreement that Drosophila Cnc exhibits functional conservation with both Nrf1 and Nrf2, which have well-established roles in proteostasis and lysosomal biology that may exacerbate pre-existing lysosomal defects in Gba1b mutants. In our manuscript, Nrf2 manipulation forms part of a broader framework of evidence, including dietary antioxidant ascorbic acid and established antioxidant effectors CatA, Sod1, and Sod2. Together, these data indicate that Gba1b mutant flies display a deleterious response to antioxidant treatments or manipulations. To further characterise the redox state, we will quantify lipid peroxidation using Bodipy 581/591 and assess superoxide levels via DHE staining under our redox-altering experimental conditions.

      As noted above, we will attempt to modulate lipid peroxidation directly through CytB5 and GstS1 overexpression, acknowledging the caveat that this approach may not fully dissociate lipid peroxidation from other aspects of redox stress. We have also observed detrimental effects of PGC1α on the lifespan of Gba1b<sup>⁻/⁻</sup> flies and will further investigate its impact on redox status in the renal tubules.

      (4) This manuscript concludes that nephrocyte dysfunction does not exacerbate brain pathology. This inference currently rests on a limited set of readouts: dextran uptake and hemolymph protein as renal markers, lifespan as a systemic measure, and two brain endpoints (LysoTracker staining and FK2 polyubiquitin accumulation). While these data suggest that nephrocyte loss alone does not amplify lysosomal or ubiquitin stress, they may not fully capture neuronal function and vulnerability. To strengthen this conclusion, the authors could consider adding functional or behavioral assays (e.g., locomotor performance)

      We will address this suggestion by performing DAM activity assays and climbing assays in the Klf15; Gba1b<sup>⁻/⁻</sup> double mutants.

      (5) The manuscript does a strong job of contrasting Parkin and Gba1b mutants, showing impaired mitophagy in Malpighian tubules, complete nephrocyte dysfunction by day 28, FRUMS clearance defects, and partial rescue with tubule-specific Parkin re-expression. These findings clearly separate mitochondrial quality control defects from the lysosomal axis of Gba1b. However, the mechanistic contrast remains incomplete. Many of the redox and peroxisomal assays are only presented for Gba1b. Including matched readouts across both models (e.g., lipid peroxidation, peroxisome density/function, Grx1-roGFP2 compartmental redox status) would make the comparison more balanced and strengthen the conclusion that these represent distinct pathogenic routes.

      We agree that park<sup>⁻/⁻</sup> mutants have been characterised in greater detail than park<sup>⁻/⁻</sup>. The primary aim of our study was not to provide an exhaustive characterisation of park¹/¹, but rather to compare key shared and distinct mechanisms underlying renal dysfunction. We have included several relevant readouts for park<sup>⁻/⁻</sup> tubules (e.g., Figure 7D and 8H: mito-Grx1-roGFP2; Figure 8J: lipid peroxidation using BODIPY 581/591). To expand our characterisation of park¹/¹ flies, we will express the cytosolic Grx1 reporter and the peroxisomal marker YFP::Pts.

      (6) Rapamycin treatment is shown to rescue several renal phenotypes in Gba1b mutants (water retention, RSC proliferation, FRUMS clearance, lipid peroxidation) but not in Parkin, and mitophagy is not restored in Gba1b. This provides strong evidence that the two models engage distinct pathogenic pathways. However, the therapeutic interpretation feels somewhat overstated. Human relevance should be framed more cautiously, and the conclusions would be stronger with mechanistic markers of autophagy (e.g., Atg8a, Ref(2)p flux in Malpighian tubules) or with experiments varying dose, timing, and duration (short-course vs chronic rapamycin).

      We will measure Atg8a, polyubiquitin, and Ref(2)P levels in Gba1b<sup>⁻/⁻</sup> and park<sup>¹/¹</sup> tubules following rapamycin treatment. In our previous study focusing on the gut (Atilano et al., 2023), we showed that rapamycin treatment increased lysosomal area, as assessed using LysoTracker<sup>TM</sup>. We will extend this analysis to the renal tubules following rapamycin exposure. Another reviewer requested that we adopt more cautious language regarding the clinical translatability of this work, and we will amend this in Version 2.

      (7) Several systemic readouts used to support renal dysfunction (FRUMS clearance, salt stress survival) could also be influenced by general organismal frailty. To ensure these phenotypes are kidney-intrinsic, it would be helpful to include controls such as tissue-specific genetic rescue in Malpighian tubules or nephrocytes, or timing rescue interventions before overt systemic decline. This would strengthen the causal link between renal impairment and the observed systemic phenotypes.

      As noted in our response to point 1, we currently lack reliable approaches to manipulate Gba1b in a tissue-specific manner. However, we agree that it is important to distinguish kidney-intrinsic dysfunction from generalised organismal frailty. In the park model, we have already performed renal cell-autonomous rescue: re-expression of Park specifically in Malpighian tubule principal cells (C42-Gal4) throughout adulthood partially normalises water retention, whereas brain-restricted Park expression has no effect on renal phenotypes. Because rescuing Park only in the renal tubules is sufficient to correct a systemic fluid-handling phenotype in otherwise mutant animals, these findings indicate that the systemic defects are driven, at least in part, by renal dysfunction rather than nonspecific organismal frailty.

      To strengthen this causal link, we will now extend this same tubule-specific Park rescue (C42-Gal4 and the high-fidelity Malpighian tubule driver CG31272-Gal4) to additional systemic readouts raised by the reviewer. Specifically, we will assay FRUMS clearance and salt stress survival in rescued versus non-rescued park mutants to determine whether renal rescue also mitigates these systemic phenotypes.

      Reviewer #2 (Public review):

      (1) The authors claim that: "renal system dysfunction negatively impacts both organismal and neuronal health in Gba1b-/- flies, including autophagic-lysosomal status in the brain." This statement implies that renal impairments drive neurodegeneration. However, there is no direct evidence provided linking renal defects to neurodegeneration in this model. It is worth noting that Gba1b-/- flies are a model for neuronopathic Gaucher disease (GD): they accumulate lipids in their brains and present with neurodegeneration and decreased survival, as shown by Kinghorn et al. (The Journal of Neuroscience, 2016, 36, 11654-11670) and by others, which the authors failed to mention (Davis et al., PLoS Genet. 2016, 12: e1005944; Cabasso et al., J Clin Med. 2019, 8:1420; Kawasaki et al., Gene, 2017, 614:49-55).

      With the caveats noted in the responses below, we show that driving Nrf2 expression using the renal tubular driver C42 results in decreased survival, more extensive renal defects, and increased brain pathology in Gba1b<sup>⁻/⁻</sup> flies, but not in healthy controls. This suggests that a healthy brain can tolerate renal dysfunction without severe pathological consequences. Our findings therefore indicate that in Gba1b<sup>⁻/⁻</sup> flies, there may be an interaction between renal defects and brain pathology. We do not explicitly claim that renal impairments drive neurodegeneration; rather, we propose that manipulations exacerbating renal dysfunction can have organism-wide effects, ultimately impacting the brain.

      The reviewer is correct that our Gba1b<sup>⁻/⁻</sup> fly model represents a neuronopathic GD model with age-related pathology. Indeed, we reproduce the autophagic-lysosomal defects previously reported (Kinghorn et al., 2016) in Figure 5. We agree that the papers cited by the reviewer merit inclusion, and in Version 2 we will incorporate them into the following pre-existing sentence in the Results:

      “The gut and brain of Gba1b<sup>⁻/⁻</sup> flies, similar to macrophages in GD patients, are characterised by enlarged lysosomes (Kinghorn et al., 2016; Atilano et al., 2023).”

      (2) The authors tested brain pathology in two experiments:

      (a) To determine the consequences of abnormal nephrocyte function on brain health, they measured lysosomal area in the brain of Gba1b-/-, Klf15LOF, or stained for polyubiquitin. Klf15 is expressed in nephrocytes and is required for their differentiation. There was no additive effect on the increased lysosomal volume (Figure 3D) or polyubiquitin accumulation (Figure 3E) seen in Gba1b-/- fly brains, implying that loss of nephrocyte viability itself does not exacerbate brain pathology.

      (b) The authors tested the consequences of overexpression of the antioxidant regulator Nrf2 in principal cells of the kidney on neuronal health in Gba1b-/- flies, using the c42-GAL4 driver. They claim that "This intervention led to a significant increase in lysosomal puncta number, as assessed by LysoTrackerTM staining (Figure 5D), and exacerbated protein dyshomeostasis, as indicated by polyubiquitin accumulation and increased levels of the ubiquitin-autophagosome trafficker Ref(2)p/p62 in Gba1b-/- fly brains (Figure 5E). Interestingly, Nrf2 overexpression had no significant effect on lysosomal area or ubiquitin puncta in control brains, demonstrating that the antioxidant response specifically in Gba1b-/- flies negatively impacts disease states in the brain and renal system."Notably, c42-GAL4 is a leaky driver, expressed in salivary glands, Malpighian tubules, and pericardial cells (Beyenbach et al., Am. J. Cell Physiol. 318: C1107-C1122, 2020). Expression in pericardial cells may affect heart function, which could explain deterioration in brain function.

      Taken together, the contribution of renal dysfunction to brain health remains debatable.

      Based on the above, I believe the title should be changed to: Redox Dyshomeostasis Links Renal and Neuronal Dysfunction in Drosophila Models of Gaucher disease. Such a title will reflect the results presented in the manuscript

      We agree that C42-Gal4 is a leaky driver; unfortunately, this was true for all commonly used Malpighian tubule drivers available when we began the study. A colleague has recommended CG31272-Gal4 from the Perrimon lab’s recent publication (Xu et al., 2024) as a high-fidelity Malpighian tubule driver. If it proves to maintain principal-cell specificity throughout ageing in our hands, we will repeat key experiments using this driver.

      (3) The authors mention that Gba1b is not expressed in the renal system, which means that no renal phenotype can be attributed directly to any known GD pathology. They suggest that systemic factors such as circulating glycosphingolipids or loss of extracellular vesicle-mediated delivery of GCase may mediate renal toxicity. This raises a question about the validity of this model to test pathology in the fly kidney. According to Flybase, there is expression of Gba1b in renal structures of the fly.

      Our evidence suggesting that Gba1b is not substantially expressed in renal tissue is based on use of the Gba1b-CRIMIC-Gal4 line, which fails to drive expression of fluorescently tagged proteins in the Malpighian tubules and we have previously shown there is no expression within the nephrocytes with this driver line (Atilano et al., 2023). This does not exclude the possibility that Gba1b functions within the tubules. Notably, Leo Pallanck has provided compelling evidence that Gba1b is present in extracellular vesicles (ECVs) and given the role of the Malpighian tubules in haemolymph filtration, these cells are likely exposed to circulating ECVs. The lysosomal defects observed in Gba1b<sup>⁻/⁻</sup> tubules therefore suggest a potential role for Gba1b in this tissue.  

      John Vaughan and Thomas Clandinin have developed mCherry- and Lamp1.V5-tagged Gba1b constructs. We intend to express these in tissues shown by the Pallanck lab to release ECVs (e.g., neurons and muscle) and examine whether the protein can be detected in the tubules.

      (4) It is worth mentioning that renal defects are not commonly observed in patients with Gaucher disease. Relevant literature: Becker-Cohen et al., A Comprehensive Assessment of Renal Function in Patients With Gaucher Disease, J. Kidney Diseases, 2005, 46:837-844.

      We have identified five references indicating that renal involvement, while rare, does occur in association with GD. We agree that this is a valid citation and will include it in the revised introductory sentence:

      “However, renal dysfunction remains a rare symptom in GD patients (Smith et al., 1978; Chander et al., 1979; Siegel et al., 1981; Halevi et al., 1993).”

      (5) In the discussion, the authors state: "Together, these findings establish renal degeneration as a driver of systemic decline in Drosophila models of GD and PD..." and go on to discuss a brain-kidney axis in PD. However, since this study investigates a GD model rather than a PD model, I recommend omitting this paragraph, as the connection to PD is speculative and not supported by the presented data.

      Our position is that Gba1b<sup>⁻/⁻</sup> represents a neuronopathic Gaucher disease model with mechanistic relevance to PD. The severity of GBA1 mutations correlates with the extent of GBA1/GCase loss of function and, consequently, with increased PD risk. Likewise, biallelic park<sup>⁻/⁻</sup> mutants cause a severe and heritable form of PD, and the Drosophila park<sup>⁻/⁻</sup> model is a well-established and widely recognised system that has been instrumental in elucidating how Parkin and Pink1 mutations drive PD pathogenesis.

      We therefore see no reason to omit this paragraph. While some aspects are inherently speculative, such discussion is appropriate and valuable when addressing mechanisms underlying a complex and incompletely understood disease, provided interpretations remain measured. At no point do we claim that our work demonstrates a direct brain-renal axis. Rather, our data indicate that renal dysfunction is a disease-modifying feature in these models, aligning with emerging epidemiological evidence linking PD and renal impairment.

      (6) The claim: "If confirmed, our findings could inform new biomarker strategies and therapeutic targets for GBA1 mutation carriers and other at-risk groups. Maintaining renal health may represent a modifiable axis of intervention in neurodegenerative disease," extends beyond the scope of the experimental evidence. The authors should consider tempering this statement or providing supporting data.

      (7) The conclusion, "we uncover a critical and previously overlooked role for the renal system in GD and PD pathogenesis," is too strong given the data presented. As no mechanistic link between renal dysfunction and neurodegeneration has been established, this claim should be moderated.

      We agree that these sections may currently overstate our findings. In Version 2, we will revise them to ensure our claims remain balanced, while retaining the key points that arise from our data and clearly indicating where conclusions require confirmation (“if confirmed”) or additional study (“warrants further investigation”).

      “If confirmed, our findings could inform new biomarker strategies and therapeutic targets for patients with GD and PD. Maintaining renal health may represent a modifiable axis of intervention in these diseases.”

      “We uncover a notable and previously underappreciated role for the renal system in GD and PD, which now warrants further investigation.”

      (8) The relevance of Parkin mutant flies is questionable, and this section could be removed from the manuscript.

      We intend to include the data for the Parkin loss-of-function mutants, as these provide essential support for the PD-related findings discussed in our manuscript. To our knowledge, this represents the first demonstration that Parkin mutants display defects in Malpighian tubule function and water homeostasis. We therefore see no reason to remove these findings. Furthermore, as Reviewer 1 specifically requested additional experiments using the Park fly model, we plan to incorporate these analyses in the revised manuscript.

      Minor comments:

      (1)  Figure 1G: The FRUMS assay is not shown for Gba1b-/- flies.

      The images in Figure 1G illustrate representative stages of dye clearance. We have quantified the clearance time course for both genotypes. During this process, the tubules of Gba1b<sup>⁻/⁻</sup> flies, similar to controls, sequentially resemble each of the three example images. As the Gba1b<sup>⁻/⁻</sup> tubules appear morphologically identical to controls, differing only in population-level clearance dynamics, we do not feel that including additional example images would provide further informative value.

      (2) In panels D and F of Figure 2, survival of control and Gba1b-/- flies in the presence of 4% NaCl is presented. However, longevity is different (up to 10 days in D and ~3 days in F for control). The authors should explain this.

      We agree. In our experience, feeding-based stress survival assays show considerable variability between experiments, and we therefore interpret results only within individual experimental replicates. We have observed similar variability in oxidative stress, starvation, and xenobiotic survival assays, which may reflect batch-specific or environmental effects.

      (3) In Figure 7F, the representative image does not correspond to the quantification; the percentage of endosome-negative nephrocytes seems to be higher for the control than for the park1/1 flies. Please check this.

      The example images are correctly oriented. Typically, an endosome-negative nephrocyte shows no dextran uptake, whereas an endosome-positive nephrocyte displays a ring of puncta around the cell periphery. In park¹/¹ mutants, dysfunctional nephrocytes exhibit diffuse dextran staining throughout the cell, accompanied by diffuse DAPI signal, indicating a complete loss of membrane integrity and likely cell death. We have 63× images from the preparations shown in Figure 7F demonstrating this. In Version 2, we will include apical and medial z-slices of the nephrocytes to illustrate these findings (to be added as supplementary   data).

      (4) In Figure 7H, the significance between control and park1/1 flies in the FRUMS assay is missing.

      We observe significant dye clearance from the haemolymph; however, the difference in complete clearance from the tubules does not reach statistical significance. This may speculatively reflect alterations in specific aspects of tubule function, where absorption and transcellular flux are affected, but subsequent clearance from the tubule lumen remains intact. We do not feel that our current data provide sufficient resolution to draw detailed conclusions about tubule physiology at this level.

      Reviewer #3 (Public review):

      Weaknesses:

      The paper relies mostly on the biallelic Gba1b mutant, which may reflect dysfunction in Gaucher's patients, though this has yet to be fully explored. The claims for the heterozygous allele and a role in Parkinson's is a little more tenuous, making assumptions that heterozygosity is a similar but milder phenotype than the full loss-of-function.

      We agree with the reviewer that studying heterozygotes may provide valuable insight into GBA1-associated PD. We will therefore assess whether subtle renal defects are detectable in Gba1b<sup>⁻/⁻</sup> heterozygotes. We clearly state that GBA1 mutations act as a risk factor for PD rather than a Mendelian inherited cause. Consistent with findings from Gba heterozygous mice, Gba1b<sup>⁻/⁻</sup> flies display minimal phenotypes (Kinghorn et al. 2016), and any observable effects are expected to be very mild and age dependent.

      (1) Figure 1c, the loss of stellate cells. What age are the MTs shown? Is this progressive or developmental?

      These experiments were conducted on flies that were three weeks of age, as were all manipulations unless otherwise stated. We will ensure that this information is clearly indicated in the figure legends in Version 2. We did not observe changes in stellate cell number at three days of age, and this result will be included in the supplementary material in Version 2. Our data therefore suggest that this is a progressive phenotype.

      (2) I might have missed this, but for Figure 3, do the mutant flies start with a similar average weight, or are they bloated?

      We will perform an age-related time course of water weight in response to Reviewer 1’s comments. For all experiments, fly eggs are age-matched and seeded below saturation density to ensure standardised conditions. Gba1b mutant flies do not exhibit any defects in body size or timing of eclosion.

      (3) On 2F, add to the graph that 4% NaCl (or if it is KCL) is present for all conditions, just to make the image self-sufficient to read.

      Many thanks for the suggestion. We agree that this will increase clarity and will make this amendment in Version 2 of the manuscript

      (4) P13 - rephrase, 'target to either the mitochondria or the cytosol' (as it is phrased, it sounds as though you are doing both at the same time).

      We agree and we plan to revise the sentence as follows:

      Original:

      “To further evaluate the glutathione redox potential (E<sub>GSH</sub>) in MTs, we utilised the redox-sensitive green, fluorescent biosensor Grx1-roGFP2, targeted to both the mitochondria and cytosol (Albrecht et al., 2011).”

      Revised:

      “To further evaluate the glutathione redox potential (E<sub>GSH</sub>) in MTs, we utilised the redox-sensitive fluorescent biosensor Grx1-roGFP2, targeted specifically to either the mitochondria or the cytosol using mito- or cyto-tags, respectively (Albrecht et al., 2011).”

      (5) In 6F - the staining appears more intense in the Park mutant - perhaps add asterisks or arrowheads to indicate the nephrocytes so that the reader can compare the correct parts of the image?

      Reviewer 2 reached the same interpretation. Typically, an endosome-negative nephrocyte shows no dextran uptake, whereas an endosome-positive nephrocyte displays a ring of puncta around the cell periphery. In park¹/¹ mutants, dysfunctional nephrocytes exhibit diffuse dextran staining throughout the cell, accompanied by diffuse DAPI signal, indicative of a complete loss of membrane integrity and likely cell death. We have 63× images from the preparations shown in Figure 7F demonstrating this, and in Version 2 we will include apical and medial z-slices of the nephrocytes to illustrate these findings (to be added as supplementary data).

      (6) In the main results text - need some description/explanation of the SOD1 v SOD2 distribution (as it is currently understood) in the cell - SOD2 being predominantly mitochondrial. This helps arguments later on.

      Thank you for this suggestion. We plan to amend the text as follows:

      “Given that Nrf2 overexpression shortens lifespan in Gba1b<sup>⁻/⁻</sup> flies, we investigated the effects of overexpressing its downstream antioxidant targets, Sod1, Sod2, and CatA, both ubiquitously using the tub-Gal4 driver and with c42-Gal4, which expresses in PCs.”

      to:

      “Given that Nrf2 overexpression shortens lifespan in Gba1b<sup>⁻/⁻</sup> flies, we investigated the effects of overexpressing its downstream antioxidant targets, Sod1, Sod2, and CatA, both ubiquitously using the tub-Gal4 driver and with c42-Gal4, which expresses in PCs. Sod1 and CatA function primarily in the cytosol and peroxisomes, whereas Sod2 is localised to the mitochondria. Sod1 and Sod2 catalyse the dismutation of superoxide radicals to hydrogen peroxide, while CatA subsequently degrades hydrogen peroxide to water and oxygen.”

      (7) Figure 1G, what age are the flies? Same for 3D and E, 4C,D,E, 5B - please check the ages of flies for all of the imaging figures; this information appears to have been missed out.

      As stated above, all experiments were conducted on three-week-old flies unless otherwise specified. In Version 2 of the manuscript, we will ensure this information is included consistently in the figure legends to prevent any potential confusion.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1:

      Specifically, the authors need to define the DFG conformation using criteria accepted in the field, for example, see https://klifs.net/index.php.

      We thank the reviewer for this suggestion. In the manuscript, we use pseudodihedral and bond angle-based DFG definitions that have been previously established by literature cited in the study (re-iterated below) to unambiguously define the side-chain conformational states of the DFG motif. As we are interested in the specific mechanics of DFG flips under different conditions, we’ve found that the descriptors defined below are sufficient to distinguish between DFG states and allow a more direct comparison with previously-reported results in the literature using different methods.

      We amended the text to be more clear as to those definitions and their choice:

      DFG angle definitions:

      Phe382/Cg, Asp381/OD2, Lys378/O

      Source: Structural Characterization of the Aurora Kinase B "DFG-flip" Using Metadynamics. Lakkaniga NR, Balasubramaniam M, Zhang S, Frett B, Li HY. AAPS J. 2019 Dec 18;22(1):14. doi: 10.1208/s12248-019-0399-6. PMID: 31853739; PMCID: PMC7905835.

      “Finally, we chose the angle formed by Phe382's gamma carbon, Asp381's protonated side chain oxygen (OD2), and Lys378's backbone oxygen as PC3 based on observations from a study that used a similar PC to sample the DFG flip in Aurora Kinase B using metadynamics \cite{Lakkaniga2019}. This angular PC3 should increase or decrease (based on the pathway) during the DFG flip, with peak differences at intermediate DFG configurations, and then revert to its initial state when the flip concludes.”

      DFG pseudodihedral definitions:

      Ala380/Cb, Ala380/Ca, Asp381/Ca, Asp381/Cg

      Ala380/Cb, Ala380/CA, Phe382/CA, Phe382Cg

      Source: Computational Study of the “DFG-Flip” Conformational Transition in c-Abl and c-Src Tyrosine Kinases. Yilin Meng, Yen-lin Lin, and Benoît Roux The Journal of Physical Chemistry B 2015 119 (4), 1443-1456 DOI: 10.1021/jp511792a

      “For downstream analysis, we used two pseudodihedrals previously defined in the existing Abl1 DFG flip simulation literature \cite{Meng2015} to identify and discriminate between DFG states. The first (dihedral 1) tracks the flip state of Asp381, and is formed by the beta carbon of Ala380, the alpha carbon of Ala380, the alpha carbon of Asp381, and the gamma carbon of Asp381. The second (dihedral 2) tracks the flip state of Phe382, and is formed by the beta carbon of Ala380, the alpha carbon of Ala380, the alpha carbon of Phe381, and the gamma carbon of Phe381. These pseudodihedrals, when plotted in relation to each other, clearly distinguish between the initial DFG-in state, the target DFG-out state, and potential intermediate states in which either Asp381 or Phe381 has flipped.”

      Convergence needs to be demonstrated for estimating the population difference between different conformational states.

      We agree that demonstrating convergence is important for accurate estimations of population differences between conformational states. However, as the DFG flip is a complex and concerted conformational change with an energy barrier of 30 kcal/mol [1], and considering the traditional limitations of methods like weighted ensemble molecular dynamics (WEMD), it would take an unrealistic amount of GPU time (months) to observe convergence in our simulations. As discussed in the text (see examples below), we caveat our energy estimations by explicitly mentioning that the state populations we report are not converged and are indicative of a much larger energy barrier in the mutant.

      “These relative probabilities qualitatively agree with the large expected free energy barrier for the DFG-in to DFG-out transition (~32 kcal/mol), and with our observation of a putative metastable DFG-inter state that is missed by NMR experiments due to its low occupancy.”

      “As an important caveat, it is unlikely that the DFG flip free energy barriers of over 70 kcal/mol estimated for the Abl1 drug-resistant variants quantitatively match the expected free energy barrier for their inactivation. Rather, our approximate free energy barriers are a symptom of the markedly increased simulation time required to sample the DFG flip in the variants relative to the wild-type, which is a strong indicator of the drastically reduced propensity of the variants to complete the DFG flip. Although longer WE simulations could allow us to access the timescales necessary for more accurately sampling the free energy barriers associated with the DFG flip in Abl1's drug-resistant compound mutants, the computational expense of running WE for 200 iterations is already large (three weeks with 8 NVIDIA RTX3900 GPUs for one replicate); this poses a logistical barrier to attempting to sample sufficient events to be able to fully characterize how the reaction path and free energy barrier change for the flip associated with the mutations. Regardless, the results of our WE simulations resoundingly show that the Glu255Lys/Val and Thr315Ile compound mutations drastically reduce the probability for DFG flip events in Abl1.”

      (1) Conformational states dynamically populated by a kinase determine its function. Tao Xie et al., Science 370, eabc2754 (2020). DOI:10.1126/science.abc2754

      The DFG flip needs to be sampled several times to establish free energy difference.

      Our simulations have captured thousands of correlated and dozens of uncorrelated DFG flip events. The per-replicate free energy differences are computed based on the correlated transitions. Please consult the WEMD literature (referenced below and in the manuscript, references 34 and 36) for more information on how WEMD allows the sampling of multiple such events and subsequent estimation of probabilities:

      Zuckermann et al (2017) 10.1146/annurev-biophys-070816-033834

      Chong et al (2021) 10.1021/acs.jctc.1c01154

      The free energy plots do not appear to show an intermediate state as claimed.

      Both the free energy plots and the representative/anecdotal trajectories analyzed in the study show a saddle point when Asp381 has flipped but Phe382 has not (which defines the DFG-inter state), we observe a distinct change in probability when going to the pseudodihedral values associated with DFG-inter to DFG-up or DFG-out. We removed references to the putative state S1 as we we agree with the reviewer that its presence is unlikely given the data we show.

      The trajectory length of 7 ns in both Figure 2 and Figure 4 needs to be verified, as it is extremely short for a DFG flip that has a high free energy barrier.

      We appreciate this point. To clarify, the 7 ns segments corresponds to a collated trajectory extracted from the tens of thousands of walkers that compose the WEMD ensemble, and represent just the specific moment at which the dihedral flips occur rather than the entire flip process. On average, our WEMD simulations sample over 3 us of aggregate simulation time before the first DFG flip event is observed, in line with a high energy barrier. This is made clear in the manuscript excerpt below: “Over an aggregate simulation time of over 20 $\mu$s, we have collected dozens of uncorrelated and unbiased inactivation events, starting from the lowest energy conformation of the Abl1 kinase core (PDB 6XR6) \cite{Xie2020}.”

      The free energy scale (100 kT) appears to be one order of magnitude too large.

      As discussed in the text and quoted in response to comment 2, the exponential splitting nature of WEMD simulations (where the probability of individual walkers are split upon crossing each bin threshold) often leads to unrealistically high energy barriers for rare events. This is not unexpected, and as discussed in the text, we consider that value to be a qualitative measurement of the decreased probability of a DFG flip in Abl1 mutants, and not a direct measurement of energy barriers.

      Setting the DFG-Asp to the protonated state is not justified, because in the DFG-in state, the DFG-Asp is clearly deprotonated.

      According to previous publications, DFG-Asp is frequently protonated in the DFG-in state of Abl1 kinase. For instance, as quoted from Hanson, Chodera, et al., Cell Chem Bio (2019), “C onsistent with previous simulations on the DFG-Asp-out/in interconversion of Abl kinase we only observe the DFG flip with protonated Asp747 ( Shan et al., 2009 ). We showed previously that the pKa for the DFG-Asp in Abl is elevated at 6.5.”

      Finally, the authors should discuss their work in the context of the enormous progress made in theoretical studies and mechanistic understanding of the conformational landscape of protein kinases in the last two decades, particularly with regard to the DFG flip. and The study is not very rigorous. The major conclusions do not appear to be supported. The claim that it is the first unbiased simulation to observe DFG flip is not true. For example, Hanson, Chodera et al (Cell Chem Biol 2019), Paul, Roux et al (JCTC 2020), and Tsai, Shen et al (JACS 2019) have also observed the DFG flip.

      We thank the reviewer for pointing out these issues. We have revised the manuscript to better contextualize our claims within the limitations of the method and to acknowledge previous work by Hanson, Chodera et al., Paul, Roux et al., and Tsai, Shen et al.

      The updated excerpt is described below

      “Through our work, we have simulated an ensemble of DFG flip pathways in a wild-type kinase and its variants with atomistic resolution and without the use of biasing forces, also reporting the effects of inhibitor-resistant mutations in the broader context of kinase inactivation likelihood with such level of detail. “

      Reviewer #2:

      I appreciated the discussion of the strengths/weaknesses of weighted ensemble simulations. Am I correct that this method doesn't do anything to explicitly enhance sampling along orthogonal degrees of freedom? Maybe a point worth mentioning if so.

      Yes, this is correct. We added a sentence to WEMD summary section of Results and Discussion discussing it.

      “As a supervised enhanced sampling method, WE employs progress coordinates (PCs) to track the time-dependent evolution of a system from one or more basis states towards a target state. Although weighted ensemble simulations are unbiased in the sense that no biasing forces are added over the course of the simulations, the selection of progress coordinates and the bin definitions can potentially bias the results towards specific pathways \cite{Zuckerman2017}. Additionally, traditional WEMD simulations do not explicitly enhance sampling along orthogonal degrees of freedom (those not captured by the progress coordinates). In practice, this means that insufficient PC definitions can lead to poor sampling.”

      I don't understand Figure 3C. Could the authors instead show structures corresponding to each of the states in 3B, and maybe also a representative structure for pathways 1 and 2?

      We have remade Figure 3. We removed 3B and accompanying discussion as upon review we were not confident on the significance of the LPATH results where it pertains to the probability of intermediate states. We replaced 3B with a summary of the pathways 1 and 2 in regards to the Phe382 flip (which is the most contrasting difference).

      Why introduce S1 and DFG-inter? And why suppose that DFG-inter is what corresponds to the excited state seen by NMR?

      As a consequence of dropping the LPATH analysis, we also removed mentions to S1 as it further analysis made it hard to distinguish from DFG-in, For DFG-inter, we mention that conformation because (a) it is shared by both flipping mechanisms that we have found, and (b) it seems relevant for pharmacology, as it has been observed in other kinases such as Aurora B (PDB 2WTV), as Asp381 flipping before Phe382 creates space in the orthosteric kinase pocket which could be potentially targeted by an inhibitor.

      It would be nice to have error bars on the populations reported in Figure 3.

      Agreed, upon review we decided do drop the populations as we were not confident on the significance of the LPATH results where it pertains to the probability of intermediate states.

      I'm confused by the attempt to relate the relative probabilities of states to the 32 kca/mol barrier previously reported between the states. The barrier height should be related to the probability of a transition. The DFG-out state could be equiprobable with the DFG-in state and still have a 32 kcal/mol barrier separating them.

      Thanks for the correction, we agree with the reviewer and have amended the discussion to reflect this. Since we are starting our simulations in the DFG-in state, the probability of walkers arriving in DFG-out in our steady state WEMD simulations should (assuming proper sampling) represent the probability of the transition. We incorrectly associated the probability of the DFG-out state itself with the probability of the transition.

      How do the relative probabilities of the DFG-in/out states compare to experiments, like NMR?

      Previous NMR work has found the population of apo DFG in (PDB 6XR6) in solution to be around 88% for wild-type ABL1, and 6% for DFG out (PDB 6XR7). The remaining 6% represents post-DFG-out state (PDB 6XRG) where the activation loop has folded in near the hinge, which we did not simulate due to the computational cost associated with it. The same study reports the barrier height from DFG-in to DFG-out to be estimated at around 30 kcal/mol.

      (1) Conformational states dynamically populated by a kinase determine its function. Tao Xie et al., Science 370, eabc2754 (2020). DOI:10.1126/science.abc2754

      (we already have that in the text, just need to quote here)

      “Do the staggered and concerted DFG flip pathways mentioned correspond to pathways 1 and 2 in Figure 3B, or is that a concept from previous literature?”

      Yes, we have amended Figure 3B to be clearer. In previous literature both pathways have been observed [1], although not specifically defined.

      Source: Computational Study of the “DFG-Flip” Conformational Transition in c-Abl and c-Src Tyrosine Kinases. Yilin Meng, Yen-lin Lin, and Benoît Roux The Journal of Physical Chemistry B 2015 119 (4), 1443-1456 DOI: 10.1021/jp511792a

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary

      Query: In this manuscript, the authors introduce Gcoupler, a Python-based computational pipeline designed to identify endogenous intracellular metabolites that function as allosteric modulators at the G protein-coupled receptor (GPCR) - Gα protein interface. Gcoupler is comprised of four modules:

      I. Synthesizer - identifies protein cavities and generates synthetic ligands using LigBuilder3

      II. Authenticator - classifies ligands into high-affinity binders (HABs) and low-affinity binders (LABs) based on AutoDock Vina binding energies

      III. Generator - trains graph neural network (GNN) models (GCM, GCN, AFP, GAT) to predict binding affinity using synthetic ligands

      IV. BioRanker - prioritizes ligands based on statistical and bioactivity data

      The authors apply Gcoupler to study the Ste2p-Gpa1p interface in yeast, identifying sterols such as zymosterol (ZST) and lanosterol (LST) as modulators of GPCR signaling. Our review will focus on the computational aspects of the work. Overall, we found the Gcoupler approach interesting and potentially valuable, but we have several concerns with the methods and validation that need to be addressed prior to publication/dissemination.

      We express our gratitude to Reviewer #1 for their concise summary and commendation of our work. We sincerely apologize for the lack of sufficient detail in summarizing the underlying methods employed in Gcoupler, as well as its subsequent experimental validations using yeast, human cell lines, and primary rat cardiomyocyte-based assays.

      We wish to state that substantial improvements have been made in the revised manuscript, every section has been elaborated upon to enhance clarity. Please refer to the point-by-point response below and the revised manuscript.

      Query: (1) The exact algorithmic advancement of the Synthesizer beyond being some type of application wrapper around LigBuilder is unclear. Is the grow-link approach mentioned in the methods already a component of LigBuilder, or is it custom? If it is custom, what does it do? Is the API for custom optimization routines new with the Synthesizer, or is this a component of LigBuilder? Is the genetic algorithm novel or already an existing software implementation? Is the cavity detection tool a component of LigBuilder or novel in some way? Is the fragment library utilized in the Synthesizer the default fragment library in LigBuilder, or has it been customized? Are there rules that dictate how molecule growth can occur? The scientific contribution of the Synthesizer is unclear. If there has not been any new methodological development, then it may be more appropriate to just refer to this part of the algorithm as an application layer for LigBuilder.

      We appreciate Reviewer #1's constructive suggestion. We wish to emphasize that

      (1) The LigBuilder software comprises various modules designed for distinct functions. The Synthesizer in Gcoupler strategically utilizes two of these modules: "CAVITY" for binding site detection and "BUILD" for de novo ligand design.

      (2) While both modules are integral to LigBuilder, the Synthesizer plays a crucial role in enabling their targeted, automated, and context-aware application for GPCR drug discovery.

      (3) The CAVITY module is a structure-based protein binding site detection program, which the Synthesizer employs for identifying ligand binding sites on the protein surface.

      (4) The Synthesizer also leverages the BUILD module for constructing molecules tailored to the target protein, implementing a fragment-based design strategy using its integrated fragment library.

      (5) The GROW and LINK methods represent two independent approaches encompassed within the aforementioned BUILD module.

      Author response image 1.

      Schematic representation of the key strategy used in the Synthesizer module of Gcoupler.

      Our manuscript details the "grow-link" hybrid approach, which was implemented using a genetic algorithm through the following stages:

      (1) Initial population generation based on a seed structure via the GROW method.

      (2) Selection of "parent" molecules from the current population for inclusion in the mating pool using the LINK method.

      (3) Transfer of "elite" molecules from the current population to the new population.

      (4) Population expansion through structural manipulations (mutation, deletion, and crossover) applied to molecules within the mating pool.

      Please note, the outcome of this process is not fixed, as it is highly dependent on the target cavity topology and the constraint parameters employed for population evaluation. Synthesizer customizes generational cycles and optimization parameters based on cavity-specific constraints, with the objective of either generating a specified number of compounds or comprehensively exploring chemical diversity against a given cavity topology.

      While these components are integral to LigBuilder, Synthesizer's innovation lies

      (1) in its programmatic integration and dynamic adjustment of these modules.

      (2) Synthesizer distinguishes itself not by reinventing these algorithms, but by their automated coordination, fine-tuning, and integration within a cavity-specific framework.

      (3) It dynamically modifies generation parameters according to cavity topology and druggability constraints, a capability not inherently supported by LigBuilder.

      (4) This renders Synthesizer particularly valuable in practical scenarios where manual optimization is either inefficient or impractical.

      In summary, Synthesizer offers researchers a streamlined interface, abstracting the technical complexities of LigBuilder and thereby enabling more accessible and reproducible ligand generation pipelines, especially for individuals with limited experience in structural or cheminformatics tools.

      Query: (2) The use of AutoDock Vina binding energy scores to classify ligands into HABs and LABs is problematic. AutoDock Vina's energy function is primarily tuned for pose prediction and displays highly system-dependent affinity ranking capabilities. Moreover, the HAB/LAB thresholds of -7 kcal/mol or -8 kcal/mol lack justification. Were these arbitrarily selected cutoffs, or was benchmarking performed to identify appropriate cutoffs? It seems like these thresholds should be determined by calibrating the docking scores with experimental binding data (e.g., known binders with measured affinities) or through re-scoring molecules with a rigorous alchemical free energy approach.

      We again express our gratitude to Reviewer #1 for these inquiries. We sincerely apologize for the lack of sufficient detail in the original version of the manuscript. In the revised manuscript, we have ensured the inclusion of a detailed rationale for every threshold utilized to prioritize high-affinity binders. Please refer to the comprehensive explanation below, as well as the revised manuscript, for further details.

      We would like to clarify that:

      (1) The Authenticator module is not solely reliant on absolute binding energy values for classification. Instead, it calculates binding energies for all generated compounds and applies a statistical decision-making layer to define HAB and LAB classes.

      (2) Rather than using fixed thresholds, the module employs distribution-based methods, such as the Empirical Cumulative Distribution Function (ECDF), to assess the overall energy landscape of the compound set. We then applied multiple statistical tests to evaluate the HAB and LAB distributions and determine an optimal, data-specific cutoff that balances class sizes and minimizes overlap.

      (3) This adaptive approach avoids rigid thresholds and instead ensures context-sensitive classification, with safeguards in place to maintain adequate representation of both classes for downstream model training, and in this way, the framework prioritizes robust statistical reasoning over arbitrary energy cutoffs and aims to reduce the risks associated with direct reliance on Vina scores alone.

      (4) To assess the necessity and effectiveness of the Authenticator module, we conducted a benchmarking analysis where we deliberately omitted the HAB and LAB class labels, treating the compound pool as a heterogeneous, unlabeled dataset. We then performed random train-test splits using the Synthesizer-generated compounds and trained independent models.

      (5) The results from this approach demonstrated notably poorer model performance, indicating that arbitrary or unstructured data partitioning does not effectively capture the underlying affinity patterns. These experiments highlight the importance of using the statistical framework within the Authenticator module to establish meaningful, data-driven thresholds for distinguishing High- and Low-Affinity Binders. The cutoff values are thus not arbitrary but emerge from a systematic benchmarking and validation process tailored to each dataset.

      Please note: While calibrating docking scores with experimental binding affinities or using rigorous methods like alchemical free energy calculations can improve precision, these approaches are often computationally intensive and reliant on the availability of high-quality experimental data, a major limitation in many real-world screening scenarios.

      In summary, the primary goal of Gcoupler is to enable fast, scalable, and broadly accessible screening, particularly for cases where experimental data is sparse or unavailable. Incorporating such resource-heavy methods would not only significantly increase computational overhead but also undermine the framework’s intended usability and efficiency for large-scale applications. Instead, our workflow relies on statistically robust, data-driven classification methods that balance speed, generalizability, and practical feasibility.

      Query: (3) Neither the Results nor Methods sections provide information on how the GNNs were trained in this study. Details such as node features, edge attributes, standardization, pooling, activation functions, layers, dropout, etc., should all be described in detail. The training protocol should also be described, including loss functions, independent monitoring and early stopping criteria, learning rate adjustments, etc.

      We again thank Reviewer #1 for this suggestion. We would like to mention that in the revised manuscript, we have added all the requested details. Please refer to the points below for more information.

      (1) The Generator module of Gcoupler is designed as a flexible and automated framework that leverages multiple Graph Neural Network architectures, including Graph Convolutional Model (GCM), Graph Convolutional Network (GCN), Attentive FP, and Graph Attention Network (GAT), to build classification models based on the synthetic ligand datasets produced earlier in the pipeline.

      (2) By default, Generator tests all four models using standard hyperparameters provided by the DeepChem framework (https://deepchem.io/), offering a baseline performance comparison across architectures. This includes pre-defined choices for node features, edge attributes, message-passing layers, pooling strategies, activation functions, and dropout values, ensuring reproducibility and consistency. All models are trained with binary cross-entropy loss and support default settings for early stopping, learning rate, and batch standardization where applicable.

      (3) In addition, Generator supports model refinement through hyperparameter tuning and k-fold cross-validation (default: 3 folds). Users can either customize the hyperparameter grid or rely on Generator’s recommended parameter ranges to optimize model performance. This allows for robust model selection and stability assessment of tuned parameters.

      (4) Finally, the trained models can be used to predict binding probabilities for user-supplied compounds, making it a comprehensive and user-adaptive tool for ligand screening.

      Based on the reviewer #1 suggestion, we have now added a detailed description about the Generator module of Gcoupler, and also provided relevant citations regarding the DeepChem workflow.

      Query: (4) GNN model training seems to occur on at most 500 molecules per training run? This is unclear from the manuscript. That is a very small number of training samples if true. Please clarify. How was upsampling performed? What were the HAB/LAB class distributions? In addition, it seems as though only synthetically generated molecules are used for training, and the task is to discriminate synthetic molecules based on their docking scores. Synthetic ligands generated by LigBuilder may occupy distinct chemical space, making classification trivial, particularly in the setting of a random split k-folds validation approach. In the absence of a leave-class-out validation, it is unclear if the model learns generalizable features or exploits clear chemical differences. Historically, it was inappropriate to evaluate ligand-based QSAR models on synthetic decoys such as the DUD-E sets - synthetic ligands can be much more easily distinguished by heavily parameterized ligand-based machine learning models than by physically constrained single-point docking score functions.

      We thank reviewer #1 for these detailed technical queries. We would like to clarify that:

      (1) The recommended minimum for the training set is 500 molecules, but users can add as many synthesized compounds as needed to thoroughly explore the chemical space related to the target cavity.

      (2) Our systematic evaluation demonstrated that expanding the training set size consistently enhanced model performance, especially when compared to AutoDock docking scores. This observation underscores the framework's scalability and its ability to improve predictive accuracy with more training compounds.

      (3) The Authenticator module initially categorizes all synthesized molecules into HAB and LAB classes. These labeled molecules are then utilized for training the Generator module. To tackle class imbalance, the class with fewer data points undergoes upsampling. This process aims to achieve an approximate 1:1 ratio between the two classes, thereby ensuring balanced learning during GNN model training.

      (4) The Authenticator module's affinity scores are the primary determinant of the HAB/LAB class distribution, with a higher cutoff for HABs ensuring statistically significant class separation. This distribution is also indirectly shaped by the target cavity's topology and druggability, as the Synthesizer tends to produce more potent candidates for cavities with favorable binding characteristics.

      (5) While it's true that synthetic ligands may occupy distinct chemical space, our benchmarking exploration for different sites on the same receptor still showed inter-cavity specificity along with intra-cavity diversity of the synthesized molecules.

      (6) The utility of random k-fold validation shouldn't be dismissed outright; it provides a reasonable estimate of performance under practical settings where class boundaries are often unknown. Nonetheless, we agree that complementary validation strategies like leave-class-out could further strengthen the robustness assessment.

      (7) We agree that using synthetic decoys like those from the DUD-E dataset can introduce bias in ligand-based QSAR model evaluations if not handled carefully. In our workflow, the inclusion of DUD-E compounds is entirely optional and only considered as a fallback, specifically in scenarios where the number of low-affinity binders (LABs) synthesized by the Synthesizer module is insufficient to proceed with model training.

      (8) The primary approach relies on classifying generated compounds based on their derived affinity scores via the Authenticator module. However, in rare cases where this results in a heavily imbalanced dataset, DUD-E compounds are introduced not as part of the core benchmarking, but solely to maintain minimal class balance for initial model training. Even then, care is taken to interpret results with this limitation in mind. Ultimately, our framework is designed to prioritize data-driven generation of both HABs and LABs, minimizing reliance on synthetic decoys wherever possible.

      Author response image 2.

      Scatter plots depicting the segregation of High/Low-Affinity Metabolites (HAM/LAM) (indicated in green and red) identified using Gcoupler workflow with 100% training data. Notably, models trained on lesser training data size (25%, 50%, and 75% of HAB/LAB) severely failed to segregate HAM and LAM (along Y-axis). X-axis represents the binding affinity calculated using IC4-specific docking using AutoDock.

      Based on the reviewer #1’s suggestion, we have now added all these technical details in the revised version of the manuscript.

      Query: (5) Training QSAR models on docking scores to accelerate virtual screening is not in itself novel (see here for a nice recent example: https://www.nature.com/articles/s43588-025-00777-x), but can be highly useful to focus structure-based analysis on the most promising areas of ligand chemical space; however, we are perplexed by the motivation here. If only a few hundred or a few thousand molecules are being sampled, why not just use AutoDock Vina? The models are trained to try to discriminate molecules by AutoDock Vina score rather than experimental affinity, so it seems like we would ideally just run Vina? Perhaps we are misunderstanding the scale of the screening that was done here. Please clarify the manuscript methods to help justify the approach.

      We acknowledge the effectiveness of training QSAR models on docking scores for prioritizing chemical space, as demonstrated by the referenced study (https://www.nature.com/articles/s43588-025-00777-x) on machine-learning-guided docking screen frameworks.

      We would like to mention that:

      (1) While such protocols often rely on extensive pre-docked datasets across numerous protein targets or utilize a highly skewed input distribution, training on as little as 1-10% of ligand-protein complexes and testing on the remainder in iterative cycles.

      (2) While powerful for ultra-large libraries, this approach can introduce bias towards the limited training set and incur significant overhead in data curation, pre-computation, and infrastructure.

      (3) In contrast, Gcoupler prioritizes flexibility and accessibility, especially when experimental data is scarce and large pre-docked libraries are unavailable. Instead of depending on fixed docking scores from external pipelines, Gcoupler integrates target-specific cavity detection, de novo compound generation, and model training into a self-contained, end-to-end framework. Its QSAR models are trained directly on contextually relevant compounds synthesized for a given binding site, employing a statistical classification strategy that avoids arbitrary thresholds or precomputed biases.

      (4) Furthermore, Gcoupler is open-source, lightweight, and user-friendly, making it easily deployable without the need for extensive infrastructure or prior docking expertise. While not a complete replacement for full-scale docking in all use cases, Gcoupler aims to provide a streamlined and interpretable screening framework that supports both focused chemical design and broader chemical space exploration, without the computational burden associated with deep learning docking workflows.

      (5) Practically, even with computational resources, manually running AutoDock Vina on millions of compounds presents challenges such as format conversion, binding site annotation, grid parameter tuning, and execution logistics, all typically requiring advanced structural bioinformatics expertise.

      (6) Gcoupler's Authenticator module, however, streamlines this process. Users only need to input a list of SMILES and a receptor PDB structure, and the module automatically handles compound preparation, cavity mapping, parameter optimization, and high-throughput scoring. This automation reduces time and effort while democratizing access to structure-based screening workflows for users without specialized expertise.

      Ultimately, Gcoupler's motivation is to make large-scale, structure-informed virtual screening both efficient and accessible. The model serves as a surrogate to filter and prioritize compounds before deeper docking or experimental validation, thereby accelerating targeted drug discovery.

      Query: (6) The brevity of the MD simulations raises some concerns that the results may be over-interpreted. RMSD plots do not reliably compare the affinity behavior in this context because of the short timescales coupled with the dramatic topological differences between the ligands being compared; CoQ6 is long and highly flexible compared to ZST and LST. Convergence metrics, such as block averaging and time-dependent MM/GBSA energies, should be included over much longer timescales. For CoQ6, the authors may need to run multiple simulations of several microseconds, identify the longest-lived metastable states of CoQ6, and perform MM/GBSA energies for each state weighted by each state's probability.

      We appreciate Reviewer #1's suggestion regarding simulation length, as it is indeed crucial for interpreting molecular dynamics (MD) outcomes. We would like to mention that:

      (1) Our simulation strategy varied based on the analysis objective, ranging from short (~5 ns) runs for preliminary or receptor-only evaluations to intermediate (~100 ns) and extended (~550 ns) runs for receptor-ligand complex validation and stability assessment.

      (2) Specifically, we conducted three independent 100 ns MD simulations for each receptor-metabolite complex in distinct cavities of interest. This allowed us to assess the reproducibility and persistence of binding interactions. To further support these observations, a longer 550 ns simulation was performed for the IC4 cavity, which reinforced the 100 ns findings by demonstrating sustained interaction stability over extended timescales.

      (3) While we acknowledge that even longer simulations (e.g., in the microsecond range) could provide deeper insights into metastable state transitions, especially for highly flexible molecules like CoQ6, our current design balances computational feasibility with the goal of screening multiple cavities and ligands.

      (4) In our current workflow, MM/GBSA binding free energies were calculated by extracting 1000 representative snapshots from the final 10 ns of each MD trajectory. These configurations were used to compute time-averaged binding energies, incorporating contributions from van der Waals, electrostatic, polar, and non-polar solvation terms. This approach offers a more reliable estimate of ligand binding affinity compared to single-point molecular docking, as it accounts for conformational flexibility and dynamic interactions within the binding cavity.

      (5) Although we did not explicitly perform state-specific MM/GBSA calculations weighted by metastable state probabilities, our use of ensemble-averaged energy estimates from a thermally equilibrated segment of the trajectory captures many of the same benefits. We acknowledge, however, that a more rigorous decomposition based on metastable state analysis could offer finer resolution of binding behavior, particularly for highly flexible ligands like CoQ6, and we consider this a valuable direction for future refinement of the framework.

      Reviewer #2 (Public review):

      Summary:

      Query: Mohanty et al. present a new deep learning method to identify intracellular allosteric modulators of GPCRs. This is an interesting field for e.g. the design of novel small molecule inhibitors of GPCR signalling. A key limitation, as mentioned by the authors, is the limited availability of data. The method presented, Gcoupler, aims to overcome these limitations, as shown by experimental validation of sterols in the inhibition of Ste2p, which has been shown to be relevant molecules in human and rat cardiac hypertrophy models. They have made their code available for download and installation, which can easily be followed to set up software on a local machine.

      Strengths:

      Clear GitHub repository

      Extensive data on yeast systems

      We sincerely thank Reviewer #2 for their thorough review, summary, and appreciation of our work. We highly value their comments and suggestions.

      Weaknesses:

      Query: No assay to directly determine the affinity of the compounds to the protein of interest.

      We thank Reviewer #2 for raising these insightful questions. During the experimental design phase, we carefully accounted for validating the impact of metabolites in the rescue response by pheromone.

      We would like to mention that we performed an array of methods to validate our hypothesis and observed similar rescue effects. These assays include:

      a. Cell viability assay (FDA/PI Flourometry-based)

      b. Cell growth assay

      c. FUN1<sup>TM</sup>-based microscopy assessment

      d. Shmoo formation assays

      e. Mating assays

      f. Site-directed mutagenesis-based loss of function

      g. ransgenic reporter-based assay

      h. MAPK signaling assessment using Western blot.

      i. And via computational techniques.

      Concerning the in vitro interaction studies of Ste2p and metabolites, we made significant efforts to purify Ste2p by incorporating a His tag at the N-terminal. Despite dedicated attempts over the past year, we were unsuccessful in purifying the protein, primarily due to our limited expertise in protein purification for this specific system. As a result, we opted for genetic-based interventions (e.g., point mutants), which provide a more physiological and comprehensive approach to demonstrating the interaction between Ste2p and the metabolites.

      Author response image 3.

      (a) Affinity purification of Ste2p from Saccharomyces cerevisiae. Western blot analysis using anti-His antibody showing the distribution of Ste2p in various fractions during the affinity purification process. The fractions include pellet, supernatant, wash buffer, and sequential elution fractions (1–4). Wild-type and ste2Δ strains served as positive and negative controls, respectively. (b) Optimization of Ste2p extraction protocol. Ponceau staining (left) and Western blot analysis using anti-His antibody (right) showing Ste2p extraction efficiency. The conditions tested include lysis buffers containing different concentrations of CHAPS detergent (0.5%, 1%) and glycerol (10%, 20%).

      Furthermore, in addition to the clarification above, we have added the following statement in the discussion section to tone down our claims: “A critical limitation of our study is the absence of direct binding assays to validate the interaction between the metabolites and Ste2p. While our results from genetic interventions, molecular dynamics simulations, and docking studies strongly suggest that the metabolites interact with the Ste2p-Gpa1 interface, these findings remain indirect. Direct binding confirmation through techniques such as surface plasmon resonance, isothermal titration calorimetry, or co-crystallization would provide definitive evidence of this interaction. Addressing this limitation in future work would significantly strengthen our conclusions and provide deeper insights into the precise molecular mechanisms underlying the observed phenotypic effects.”

      We request Reviewer #2 to kindly refer to the assays conducted on the point mutants created in this study, as these experiments offer robust evidence supporting our claims.

      Query: In conclusion, the authors present an interesting new method to identify allosteric inhibitors of GPCRs, which can easily be employed by research labs. Whilst their efforts to characterize the compounds in yeast cells, in order to confirm their findings, it would be beneficial if the authors show their compounds are active in a simple binding assay.

      We express our gratitude and sincere appreciation for the time and effort dedicated by Reviewer #2 in reviewing our manuscript. We are confident that our clarifications address the reviewer's concerns.

      Reviewer #3 (Public review):

      Summary:

      Query: In this paper, the authors introduce the Gcoupler software, an open-source deep learning-based platform for structure-guided discovery of ligands targeting GPCR interfaces. Overall, this manuscript represents a field-advancing contribution at the intersection of AI-based ligand discovery and GPCR signaling regulation.

      Strengths:

      The paper presents a comprehensive and well-structured workflow combining cavity identification, de novo ligand generation, statistical validation, and graph neural network-based classification. Notably, the authors use Gcoupler to identify endogenous intracellular sterols as allosteric modulators of the GPCR-Gα interface in yeast, with experimental validations extending to mammalian systems. The ability to systematically explore intracellular metabolite modulation of GPCR signaling represents a novel and impactful contribution. This study significantly advances the field of GPCR biology and computational ligand discovery.

      We thank and appreciate Reviewer #3 for vesting time and efforts in reviewing our manuscript and for appreciating our efforts.

      Recommendations for the authors:

      Reviewing Editor Comments:

      We encourage the authors to address the points raised during revision to elevate the assessment from "incomplete" to "solid" or ideally "convincing." In particular, we ask the authors to improve the justification for their methodological choices and to provide greater detail and clarity regarding each computational layer of the pipeline.

      We are grateful for the editors' suggestions. We have incorporated significant revisions into the manuscript, providing comprehensive technical details to prevent any misunderstandings. Furthermore, we meticulously explained every aspect of the computational workflow.

      Reviewer #2 (Recommendations for the authors):

      Query: Would it be possible to make the package itself pip installable?

      Yes, it already exists under the testpip repository and we have now migrated it to the main pip. Please access the link from here: https://pypi.org/project/gcoupler/

      Query: I am confused by the binding free energies reported in Supplementary Figure 8. Is the total DG reported that of the protein-ligand complex? If that is the case, the affinities of the ligands would be extremely high. They are also very far off from the reported -7 kcal/mol active/inactive cut-off.

      We thank Reviewer #2 for this query. We would like to mention that we have provided a detailed explanation in the point-by-point response to Reviewer #2's original comment. Briefly, to clarify, the -7 kcal/mol active/inactive cutoff mentioned in the manuscript refers specifically to the docking-based binding free energies (ΔG) calculated using AutoDock or AutoDock Vina, which are used for compound classification or validation against the Gcoupler framework.

      In contrast, the binding free energies reported in Supplementary Figure 8 are obtained through the MM-GBSA method, which provides a more detailed and physics-based estimate of binding affinity by incorporating solvation and enthalpic contributions. It is well-documented in the literature that MM-GBSA tends to systematically underestimate absolute binding free energies when compared to experimental values (10.2174/1568026616666161117112604; Table 1).

      Author response image 4.

      Scatter plot comparing the predicted binding affinity calculated by Docking and MM/GBSA methods, against experimental ΔG (10.1007/s10822-023-00499-0)

      Our use of MM-GBSA is not to match experimental ΔG directly, but rather to assess relative binding preferences among ligands. Despite its limitations in predicting absolute affinities, MM-GBSA is known to perform better than docking for ranking compounds by their binding potential. In this context, an MM-GBSA energy value still reliably indicates stronger predicted binding, even if the numerical values appear extremely higher than typical experimental or docking-derived cutoffs.

      Thus, the two energy values, docking-based and MM-GBSA, serve different purposes in our workflow. Docking scores are used for classification and thresholding, while MM-GBSA energies provide post hoc validation and a higher-resolution comparison of binding strength across compounds.

      To corroborate their findings, can the authors include direct binding affinity assays for yeast and human Ste2p? This will help in establishing whether the observed phenotypic effects are indeed driven by binding of the metabolites.

      We thank Reviewer #2 for raising these insightful questions. During the experimental design phase, we carefully accounted for validating the impact of metabolites in the rescue response by pheromone.

      We would like to mention that we performed an array of methods to validate our hypothesis and observed similar rescue effects. These assays include:

      a. Cell viability assay (FDA/PI Flourometry- based)

      b. Cell growth assay

      c. FUN1<sup>TM</sup>-based microscopy assessment

      d. Shmoo formation assays

      e. Mating assays

      f. Site-directed mutagenesis-based loss of function

      g. Transgenic reporter-based assay

      h. MAPK signaling assessment using Western blot.

      i. And via computational techniques.

      Concerning the in vitro interaction studies of Ste2p and metabolites, we made significant efforts to purify Ste2p by incorporating a His tag at the N-terminal. Despite dedicated attempts over the past year, we were unsuccessful in purifying the protein, primarily due to our limited expertise in protein purification for this specific system. As a result, we opted for genetic-based interventions (e.g., point mutants), which provide a more physiological and comprehensive approach to demonstrating the interaction between Ste2p and the metabolites.

      Furthermore, in addition to the clarification above, we have added the following statement in the discussion section to tone down our claims: “A critical limitation of our study is the absence of direct binding assays to validate the interaction between the metabolites and Ste2p. While our results from genetic interventions, molecular dynamics simulations, and docking studies strongly suggest that the metabolites interact with the Ste2p-Gpa1 interface, these findings remain indirect. Direct binding confirmation through techniques such as surface plasmon resonance, isothermal titration calorimetry, or co-crystallization would provide definitive evidence of this interaction. Addressing this limitation in future work would significantly strengthen our conclusions and provide deeper insights into the precise molecular mechanisms underlying the observed phenotypic effects.”

      We request Reviewer #2 to kindly refer to the assays conducted on the point mutants created in this study, as these experiments offer robust evidence supporting our claims.

      Did the authors perform expression assays to make sure the mutant proteins were similarly expressed to wt?

      We thank reviewer #2 for this comment. We would like to mention that:

      (1) In our mutants (S75A, T155D, L289K)-based assays, all mutants were generated using integration at the same chromosomal TRP1 locus under the GAL1 promoter and share the same C-terminal CYC1 terminator sequence used for the reconstituted wild-type (rtWT) construct, thus reducing the likelihood of strain-specific expression differences.

      (2) Furthermore, all strains were grown under identical conditions using the same media, temperature, and shaking parameters. Each construct underwent the same GAL1 induction protocol in YPGR medium for identical durations, ensuring uniform transcriptional activation across all strains and minimizing culture-dependent variability in protein expression.

      (3) Importantly, both the rtWT and two of the mutants (T155D, L289K) retained α-factor-induced cell death (PI and FUN1-based fluorometry and microscopy; Figure 4c-d) and MAPK activation (western blot; Figure 4e), demonstrating that the mutant proteins are expressed at levels sufficient to support signalling.

      Reviewer #3 (Recommendations for the authors):

      My comments that would enhance the impact of this method are:

      (1) While the authors have compared the accuracy and efficiency of Gcoupler to AutoDock Vina, one of the main points of Gcoupler is the neural network module. It would be beneficial to have it evaluated against other available deep learning ligand generative modules, such as the following: 10.1186/s13321-024-00829-w, 10.1039/D1SC04444C.

      Thank you for the observation. To clarify, our benchmarking of Gcoupler’s accuracy and efficiency was performed against AutoDock, not AutoDock Vina. This choice was intentional, as AutoDock is one of the most widely used classical techniques in computer-aided drug design (CADD) for obtaining high-resolution predictions of ligand binding energy, binding poses, and detailed atomic-level interactions with receptor residues. In contrast, AutoDock Vina is primarily optimized for large-scale virtual screening, offering faster results but typically with lower resolution and limited configurational detail.

      Since Gcoupler is designed to balance accuracy with computational efficiency in structure-based screening, AutoDock served as a more appropriate reference point for evaluating its predictions.

      We agree that benchmarking against other deep learning-based ligand generative tools is important for contextualizing Gcoupler’s capabilities. However, it's worth noting that only a few existing methods focus specifically on cavity- or pocket-driven de novo drug design using generative AI, and among them, most are either partially closed-source or limited in functionality.

      While PocketCrafter (10.1186/s13321-024-00829-w) offers a structure-based generative framework, it differs from Gcoupler in several key respects. PocketCrafter requires proprietary preprocessing tools, such as the MOE QuickPrep module, to prepare protein pocket structures, limiting its accessibility and reproducibility. In addition, PocketCrafter’s pipeline stops at the generation of cavity-linked compounds and does not support any further learning from the generated data.

      Similarly, DeepLigBuilder (10.1039/D1SC04444C) provides de novo ligand generation using deep learning, but the source code is not publicly available, preventing direct benchmarking or customization. Like PocketCrafter, it also lacks integrated learning modules, which limits its utility for screening large, user-defined libraries or compounds of interest.

      Additionally, tools like AutoDesigner from Schrödinger, while powerful, are not publicly accessible and hence fall outside the scope of open benchmarking.

      Author response table 1.

      Comparison of de novo drug design tools. SBDD refers to Structure-Based Drug Design, and LBDD refers to Ligand-Based Drug Design.

      In contrast, Gcoupler is a fully open-source, end-to-end platform that integrates both Ligand-Based and Structure-Based Drug Design. It spans from cavity detection and molecule generation to automated model training using GNNs, allowing users to evaluate and prioritize candidate ligands across large chemical spaces without the need for commercial software or advanced coding expertise.

      (2) In Figure 2, the authors mention that IC4 and IC5 potential binding sites are on the direct G protein coupling interface ("This led to the identification of 17 potential surface cavities on Ste2p, with two intracellular regions, IC4 and IC5, accounting for over 95% of the Ste2p-Gpa1p interface (Figure 2a-b, Supplementary Figure 4j-n)..."). Later, however, in Figure 4, when discussing which residues affect the binding of the metabolites the most, the authors didn't perform MD simulations of mutant STE2 and just Gpa1p (without metabolites present). It would be beneficial to compare the binding of G protein with and without metabolites present, as these interface mutations might be affecting the binding of G protein by itself.

      Thank you for this insightful suggestion. While we did not perform in silico MD simulations of the mutant Ste2-Gpa1 complex in the absence of metabolites, we conducted experimental validation to functionally assess the impact of interface mutations. Specifically, we generated site-directed mutants (S75A, L289K, T155D) and expressed them in a ste2Δ background to isolate their effects.

      As shown in the Supplementary Figure, these mutants failed to rescue cells from α-factor-induced programmed cell death (PCD) upon metabolite pre-treatment. This was confirmed through fluorometry-based viability assays, FUN1<sup>TM</sup> staining, and p-Fus3 signaling analysis, which collectively monitor MAPK pathway activation (Figure 4c–e).

      Importantly, the induction of PCD in response to α-factor in these mutants demonstrates that G protein coupling is still functionally intact, indicating that the mutations do not interfere with Gpa1 binding itself. However, the absence of rescue by metabolites strongly suggests that the mutated residues play a direct role in metabolite binding at the Ste2p–Gpa1p interface, thus modulating downstream signaling.

      While further MD simulations could provide structural insight into the isolated mutant receptor–G protein interaction, our experimental data supports the functional relevance of metabolite binding at the identified interface.

      (3) While the experiments, performed by the authors, do support the hypothesis that metabolites regulate GPCR signaling, there are no experiments evaluating direct biophysical measurements (e.g., dissociation constants are measured only in silicon).

      We thank Reviewer #3 for raising these insightful comments. We would like to mention that we performed an array of methods to validate our hypothesis and observed similar rescue effects. These assays include:

      a. Cell viability assay (FDA/PI Flourometry- based)

      b. Cell growth assay

      c. FUN1<sup>TM</sup>-based microscopy assessment

      d. Shmoo formation assays

      e. Mating assays

      f. Site-directed mutagenesis-based loss of function

      g. Transgenic reporter-based assay

      h. MAPK signaling assessment using Western blot.

      i. And via computational techniques.

      Concerning the direct biophysical measurements of Ste2p and metabolites, we made significant efforts to purify Ste2p by incorporating a His tag at the N-terminal, with the goal of performing Microscale Thermophoresis (MST) and Isothermal Titration Calorimetry (ITC) measurements. Despite dedicated attempts over the past year, we were unsuccessful in purifying the protein, primarily due to our limited expertise in protein purification for this specific system. As a result, we opted for genetic-based interventions (e.g., point mutants), which provide a more physiological and comprehensive approach to demonstrating the interaction between Ste2p and the metabolites.

      Furthermore, in addition to the clarification above, we have added the following statement in the discussion section to tone down our claims: “A critical limitation of our study is the absence of direct binding assays to validate the interaction between the metabolites and Ste2p. While our results from genetic interventions, molecular dynamics simulations, and docking studies strongly suggest that the metabolites interact with the Ste2p-Gpa1 interface, these findings remain indirect. Direct binding confirmation through techniques such as surface plasmon resonance, isothermal titration calorimetry, or co-crystallization would provide definitive evidence of this interaction. Addressing this limitation in future work would significantly strengthen our conclusions and provide deeper insights into the precise molecular mechanisms underlying the observed phenotypic effects.”

      (4) The authors do not discuss the effects of the metabolites at their physiological concentrations. Overall, this manuscript represents a field-advancing contribution at the intersection of AI-based ligand discovery and GPCR signaling regulation.

      We thank reviewer #3 for this comment and for recognising the value of our work. Although direct quantification of intracellular free metabolite levels is challenging, several lines of evidence support the physiological relevance of our test concentrations.

      - Genetic validation supports endogenous relevance: Our genetic screen of 53 metabolic knockout mutants showed that deletions in biosynthetic pathways for these metabolites consistently disrupted the α-factor-induced cell death, with the vast majority of strains (94.4%) resisting the α-factor-induced cell death, and notably, a subset even displayed accelerated growth in the presence of α‑factor. This suggests that endogenous levels of these metabolites normally provide some degree of protection, supporting their physiological role in GPCR regulation.

      - Metabolomics confirms in vivo accumulation: Our untargeted metabolomics analysis revealed that α-factor-treated survivors consistently showed enrichment of CoQ6 and zymosterol compared to sensitive cells. This demonstrates that these metabolites naturally accumulate to protective levels during stress responses, validating their biological relevance.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      This study investigates the sex determination mechanism in the clonal ant Ooceraea biroi, focusing on a candidate complementary sex determination (CSD) locus-one of the key mechanisms supporting haplodiploid sex determination in hymenopteran insects. Using whole genome sequencing, the authors analyze diploid females and the rarely occurring diploid males of O. biroi, identifying a 46 kb candidate region that is consistently heterozygous in females and predominantly homozygous in diploid males. This region shows elevated genetic diversity, as expected under balancing selection. The study also reports the presence of an lncRNA near this heterozygous region, which, though only distantly related in sequence, resembles the ANTSR lncRNA involved in female development in the Argentine ant, Linepithema humile (Pan et al. 2024). Together, these findings suggest a potentially conserved sex determination mechanism across ant species. However, while the analyses are well conducted and the paper is clearly written, the insights are largely incremental. The central conclusion - that the sex determination locus is conserved in ants - was already proposed and experimentally supported by Pan et al. (2024), who included O. biroi among the studied species and validated the locus's functional role in the Argentine ant. The present study thus largely reiterates existing findings without providing novel conceptual or experimental advances.

      Although it is true that Pan et al., 2024 demonstrated (in Figure 4 of their paper) that the synteny of the region flanking ANTSR is conserved across aculeate Hymenoptera (including O. biroi), Reviewer 1’s claim that that paper provides experimental support for the hypothesis that the sex determination locus is conserved in ants is inaccurate. Pan et al., 2024 only performed experimental work in a single ant species (Linepithema humile) and merely compared reference genomes of multiple species to show synteny of the region, rather than functionally mapping or characterizing these regions.

      Other comments:

      The mapping is based on a very small sample size: 19 females and 16 diploid males, and these all derive from a single clonal line. This implies a rather high probability for false-positive inference. In combination with the fact that only 11 out of the 16 genotyped males are actually homozygous at the candidate locus, I think a more careful interpretation regarding the role of the mapped region in sex determination would be appropriate. The main argument supporting the role of the candidate region in sex determination is based on the putative homology with the lncRNA involved in sex determination in the Argentine ant, but this argument was made in a previous study (as mentioned above).

      Our main argument supporting the role of the candidate region in sex determination is not based on putative homology with the lncRNA in L. humile. Instead, our main argument comes from our genetic mapping (in Fig. 2), and the elevated nucleotide diversity within the identified region (Fig. 4). Additionally, we highlight that multiple genes within our mapped region are homologous to those in mapped sex determining regions in both L. humile and Vollenhovia emeryi, possibly including the lncRNA.

      In response to the Reviewer’s assertion that the mapping is based on a small sample size from a single clonal line, we want to highlight that we used all diploid males available to us. Although the primary shortcoming of a small sample size is to increase the probability of a false negative, small sample sizes can also produce false positives. We used two approaches to explore the statistical robustness of our conclusions. First, we generated a null distribution by randomly shuffling sex labels within colonies and calculating the probability of observing our CSD index values by chance (shown in Fig. 2). Second, we directly tested the association between homozygosity and sex using Fisher’s Exact Test (shown in Supplementary Fig. S2). In both cases, the association of the candidate locus with sex was statistically significant after multiple-testing correction using the Benjamini-Hochberg False Discovery Rate. These approaches are clearly described in the “CSD Index Mapping” section of the Methods.

      We also note that, because complementary sex determination loci are expected to evolve under balancing selection, our finding that the mapped region exhibits a peak of nucleotide diversity lends orthogonal support to the notion that the mapped locus is indeed a complementary sex determination locus.

      The fourth paragraph of the results and the sixth paragraph of the discussion are devoted to explaining the possible reasons why only 11/16 genotyped males are homozygous in the mapped region. The revised manuscript will include an additional sentence (in what will be lines 384-388) in this paragraph that includes the possible explanation that this locus is, in fact, a false positive, while also emphasizing that we find this possibility to be unlikely given our multiple lines of evidence.

      In response to Reviewer 1’s suggestion that we carefully interpret the role of the mapped region in sex determination, we highlight our careful wording choices, nearly always referring to the mapped locus as a “candidate sex determination locus” in the title and throughout the manuscript. For consistency, the revised manuscript version will change the second results subheading from “The O. biroi CSD locus is homologous to another ant sex determination locus but not to honeybee csd” to “O. biroi’s candidate CSD locus is homologous to another ant sex determination locus but not to honeybee csd,” and will add the word “candidate” in what will be line 320 at the beginning of the Discussion, and will change “putative” to “candidate” in what will be line 426 at the end of the Discussion.

      In the abstract, it is stated that CSD loci have been mapped in honeybees and two ant species, but we know little about their evolutionary history. But CSD candidate loci were also mapped in a wasp with multi-locus CSD (study cited in the introduction). This wasp is also parthenogenetic via central fusion automixis and produces diploid males. This is a very similar situation to the present study and should be referenced and discussed accordingly, particularly since the authors make the interesting suggestion that their ant also has multi-locus CSD and neither the wasp nor the ant has tra homologs in the CSD candidate regions. Also, is there any homology to the CSD candidate regions in the wasp species and the studied ant?

      In response to Reviewer 1’s suggestion that we reference the (Matthey-Doret et al. 2019) study in the context of diploid males being produced via losses of heterozygosity during asexual reproduction, the revised manuscript will include (in what will be lines 123-126) the highlighted portion of the following sentence: “Therefore, if O. biroi uses CSD, diploid males might result from losses of heterozygosity at sex determination loci (Fig. 1C), similar to what is thought to occur in other asexual Hymenoptera that produce diploid males (Rabeling and Kronauer 2012; Matthey-Doret et al. 2019).”

      We note, however, that in their 2019 study, Matthey-Doret et al. did not directly test the hypothesis that diploid males result from losses of heterozygosity at CSD loci during asexual reproduction, because the diploid males they used for their mapping study came from inbred crosses in a sexual population of that species.

      We address this further below, but we want to emphasize that we do not intend to argue that O. biroi has multiple CSD loci. Instead, we suggest that additional, undetected CSD loci is one possible explanation for the absence of diploid males from any clonal line other than clonal line A. In response to Reviewer 1’s suggestion that we reference the (Matthey-Doret et al. 2019) study in the context of multilocus CSD, the revised manuscript version will include the following additional sentence in the fifth paragraph of the discussion (in what will be lines 372-374): “Multi-locus CSD has been suggested to limit the extent of diploid male production in asexual species under some circumstances (Vorburger 2013; Matthey-Doret et al. 2019).”

      Regarding Reviewer 2’s question about homology between the putative CSD loci from the (Matthey-Doret et al. 2019) study and O. biroi, we note that there is no homology. The revised manuscript version will have an additional Supplementary Table (which will be the new Supplementary Table S3) that will report the results of this homology search. The revised manuscript will also include the following additional sentence in the Results, in what will be lines 172-174: “We found no homology between the genes within the O. biroi CSD index peak and any of the genes within the putative L. fabarum CSD loci (Supplementary Table S3).”

      The authors used different clonal lines of O. biroi to investigate whether heterozygosity at the mapped CSD locus is required for female development in all clonal lines of O. biroi (L187-196). However, given the described parthenogenesis mechanism in this species conserves heterozygosity, additional females that are heterozygous are not very informative here. Indeed, one would need diploid males in these other clonal lines as well (but such males have not yet been found) to make any inference regarding this locus in other lines.

      We agree that a full mapping study including diploid males from all clonal lines would be preferable, but as stated earlier in that same paragraph, we have only found diploid males from clonal line A. We stand behind our modest claim that “Females from all six clonal lines were heterozygous at the CSD index peak, consistent with its putative role as a CSD locus in all O. biroi.” In the revised manuscript version, this sentence (in what will be lines 199-201) will be changed slightly in response to a reviewer comment below: “All females from all six clonal lines (including 26 diploid females from clonal line B) were heterozygous at the CSD index peak, consistent with its putative role as a CSD locus in all O. biroi.”

      Reviewer #2 (Public review):

      The manuscript by Lacy et al. is well written, with a clear and compelling introduction that effectively conveys the significance of the study. The methods are appropriate and well-executed, and the results, both in the main text and supplementary materials, are presented in a clear and detailed manner. The authors interpret their findings with appropriate caution.

      This work makes a valuable contribution to our understanding of the evolution of complementary sex determination (CSD) in ants. In particular, it provides important evidence for the ancient origin of a non-coding locus implicated in sex determination, and shows that, remarkably, this sex locus is conserved even in an ant species with a non-canonical reproductive system that typically does not produce males. I found this to be an excellent and well-rounded study, carefully analyzed and well contextualized.

      That said, I do have a few minor comments, primarily concerning the discussion of the potential 'ghost' CSD locus. While the authors acknowledge (line 367) that they currently have no data to distinguish among the alternative hypotheses, I found the evidence for an additional CSD locus presented in the results (lines 261-302) somewhat limited and at times a bit difficult to follow. I wonder whether further clarification or supporting evidence could already be extracted from the existing data. Specifically:

      We agree with Reviewer 2 that the evidence for a second CSD locus is limited. In fact, we do not intend to advocate for there being a second locus, but we suggest that a second CSD locus is one possible explanation for the absence of diploid males outside of clonal line A. In our initial version, we intentionally conveyed this ambiguity by titling this section “O. biroi may have one or multiple sex determination loci.” However, we now see that this leads to undue emphasis on the possibility of a second locus. In the revised manuscript, we will split this into two separate sections: “Diploid male production differs across O. biroi clonal lines” and “O. biroi lacks a tra-containing CSD locus.”

      (1) Line 268: I doubt the relevance of comparing the proportion of diploid males among all males between lines A and B to infer the presence of additional CSD loci. Since the mechanisms producing these two types of males differ, it might be more appropriate to compare the proportion of diploid males among all diploid offspring. This ratio has been used in previous studies on CSD in Hymenoptera to estimate the number of sex loci (see, for example, Cook 1993, de Boer et al. 2008, 2012, Ma et al. 2013, and Chen et al., 2021). The exact method might not be applicable to clonal raider ants, but I think comparing the percentage of diploid males among the total number of (diploid) offspring produced between the two lineages might be a better argument for a difference in CSD loci number.

      We want to re-emphasize here that we do not wish to advocate for there being two CSD loci in O. biroi. Rather, we want to explain that this is one possible explanation for the apparent absence of diploid males outside of clonal line A. We hope that the modifications to the manuscript described in the previous response help to clarify this.

      Reviewer 2 is correct that comparing the number of diploid males to diploid females does not apply to clonal raider ants. This is because males are vanishingly rare among the vast numbers of females produced. We do not count how many females are produced in laboratory stock colonies, and males are sampled opportunistically. Therefore, we cannot report exact numbers. However, we will add the highlighted portion of the following sentence (in what will be lines 268-270) to the revised manuscript: “Despite the fact that we maintain more colonies of clonal line B than of clonal line A in the lab, all the diploid males we detected came from clonal line A.”

      (2) If line B indeed carries an additional CSD locus, one would expect that some females could be homozygous at the ANTSR locus but still viable, being heterozygous only at the other locus. Do the authors detect any females in line B that are homozygous at the ANTSR locus? If so, this would support the existence of an additional, functionally independent CSD locus.

      We thank the reviewer for this suggestion, and again we emphasize that we do not want to argue in favor of multiple CSD loci. We just want to introduce it as one possible explanation for the absence of diploid males outside of clonal line A.

      The 26 sequenced diploid females from clonal line B are all heterozygous at the mapped locus, and the revised manuscript will clarify this in what will be lines 199-201. Previously, only six of those diploid females were included in Supplementary Table S2, and that will be modified accordingly.

      (3) Line 281: The description of the two tra-containing CSD loci as "conserved" between Vollenhovia and the honey bee may be misleading. It suggests shared ancestry, whereas the honey bee csd gene is known to have arisen via a relatively recent gene duplication from fem/tra (10.1038/nature07052). It would be more accurate to refer to this similarity as a case of convergent evolution rather than conservation.

      In the sentence that Reviewer 2 refers to, we are representing the assertion made in the (Miyakawa and Mikheyev 2015) paper in which, regarding their mapping of a candidate CSD locus that contains two linked tra homologs, they write in the abstract: “these data support the prediction that the same CSD mechanism has indeed been conserved for over 100 million years.” In that same paper, Miyakawa and Mikheyev write in the discussion section: “As ants and bees diverged more than 100 million years ago, sex determination in honey bees and V. emeryi is probably homologous and has been conserved for at least this long.”

      As noted by Reviewer 2, this appears to conflict with a previously advanced hypothesis: that because fem and csd were found in Apis mellifera, Apis cerana, and Apis dorsata, but only fem was found in Mellipona compressipes, Bombus terrestris, and Nasonia vitripennis, that the csd gene evolved after the honeybee (Apis) lineage diverged from other bees (Hasselmann et al. 2008). However, it remains possible that the csd gene evolved after ants and bees diverged from N. vitripennis, but before the divergence of ants and bees, and then was subsequently lost in B. terrestris and M. compressipes. This view was previously put forward based on bioinformatic identification of putative orthologs of csd and fem in bumblebees and in ants [(Schmieder et al. 2012), see also (Privman et al. 2013)]. However, subsequent work disagreed and argued that the duplications of tra found in ants and in bumblebees represented convergent evolution rather than homology (Koch et al. 2014). Distinguishing between these possibilities will be aided by additional sex determination locus mapping studies and functional dissection of the underlying molecular mechanisms in diverse Aculeata.

      Distinguishing between these competing hypotheses is beyond the scope of our paper, but the revised manuscript will include additional text to incorporate some of this nuance. We will include these modified lines below (in what will be lines 287-295), with the additions highlighted:

      “A second QTL region identified in V. emeryi (V.emeryiCsdQTL1) contains two closely linked tra homologs, similar to the closely linked honeybee tra homologs, csd and fem (Miyakawa and Mikheyev 2015). This, along with the discovery of duplicated tra homologs that undergo concerted evolution in bumblebees and ants (Schmieder et al. 2012; Privman et al. 2013) has led to the hypothesis that the function of tra homologs as CSD loci is conserved with the csd-containing region of honeybees (Schmieder et al. 2012; Miyakawa and Mikheyev 2015). However, other work has suggested that tra duplications occurred independently in honeybees, bumblebees, and ants (Hasselmann et al. 2008; Koch et al. 2014), and it remains to be demonstrated that either of these tra homologs acts as a primary CSD signal in V. emeryi.”

      (4) Finally, since the authors successfully identified multiple alleles of the first CSD locus using previously sequenced haploid males, I wonder whether they also observed comparable allelic diversity at the candidate second CSD locus. This would provide useful supporting evidence for its functional relevance.

      As is already addressed in the final paragraph of the results and in Supplementary Fig. S4, there is no peak of nucleotide diversity in any of the regions homologous to V.emeryiQTL1, which is the tra-containing candidate sex determination locus (Miyakawa and Mikheyev 2015). In the revised manuscript, the relevant lines will be 307-310. We want to restate that we do not propose that there is a second candidate CSD locus in O. biroi, but we simply raise the possibility that multi-locus CSD *might* explain the absence of diploid males from clonal lines other than clonal line A (as one of several alternative possibilities).

      Overall, these are relatively minor points in the context of a strong manuscript, but I believe addressing them would improve the clarity and robustness of the authors' conclusions.

      Reviewer #3 (Public review):

      Summary:

      The sex determination mechanism governed by the complementary sex determination (CSD) locus is one of the mechanisms that support the haplodiploid sex determination system evolved in hymenopteran insects. While many ant species are believed to possess a CSD locus, it has only been specifically identified in two species. The authors analyzed diploid females and the rarely occurring diploid males of the clonal ant Ooceraea biroi and identified a 46 kb CSD candidate region that is consistently heterozygous in females and predominantly homozygous in males. This region was found to be homologous to the CSD locus reported in distantly related ants. In the Argentine ant, Linepithema humile, the CSD locus overlaps with an lncRNA (ANTSR) that is essential for female development and is associated with the heterozygous region (Pan et al. 2024). Similarly, an lncRNA is encoded near the heterozygous region within the CSD candidate region of O. biroi. Although this lncRNA shares low sequence similarity with ANTSR, its potential functional involvement in sex determination is suggested. Based on these findings, the authors propose that the heterozygous region and the adjacent lncRNA in O. biroi may trigger female development via a mechanism similar to that of L. humile. They further suggest that the molecular mechanisms of sex determination involving the CSD locus in ants have been highly conserved for approximately 112 million years. This study is one of the few to identify a CSD candidate region in ants and is particularly noteworthy as the first to do so in a parthenogenetic species.

      Strengths:

      (1) The CSD candidate region was found to be homologous to the CSD locus reported in distantly related ant species, enhancing the significance of the findings.

      (2) Identifying the CSD candidate region in a parthenogenetic species like O. biroi is a notable achievement and adds novelty to the research.

      Weaknesses

      (1) Functional validation of the lncRNA's role is lacking, and further investigation through knockout or knockdown experiments is necessary to confirm its involvement in sex determination.

      See response below.

      (2) The claim that the lncRNA is essential for female development appears to reiterate findings already proposed by Pan et al. (2024), which may reduce the novelty of the study.

      We do not claim that the lncRNA is essential for female development in O. biroi, but simply mention the possibility that, as in L. humile, it is somehow involved in sex determination. We do not have any functional evidence for this, so this is purely based on its genomic position immediately adjacent to our mapped candidate region. We agree with the reviewer that the study by Pan et al. (2024) decreases the novelty of our findings. Another way of looking at this is that our study supports and bolsters previous findings by partially replicating the results in a different species.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      L307-308 should state homozygous for either allele in THE MAJORITY of diploid males.

      This will be fixed in the revised manuscript, in what will be line 321.

      Reviewer #3 (Recommendations for the authors):

      The association between heterozygosity in the CSD candidate region and female development in O. biroi, along with the high sequence homology of this region to CSD loci identified in two distantly related ant species, is not sufficient to fully address the evolution of the CSD locus and the mechanisms of sex determination.

      Given that functional genetic tools, such as genome editing, have already been established in O. biroi, I strongly recommend that the authors investigate the role of the lncRNA through knockout or knockdown experiments and assess its impact on the sex-specific splicing pattern of the downstream tra gene.

      Although knockout experiments of the lncRNA would be illuminating, the primary signal of complementary sex determination is heterozygosity. As is clearly stated in our manuscript and that of (Pan et al. 2024), it does not appear to be heterozygosity within the lncRNA that induces female development, but rather heterozygosity in non-transcribed regions linked to the lncRNA. Therefore, future mechanistic studies of sex determination in O. biroi, L. humile, and other ants should explore how homozygosity or heterozygosity of this region impacts the sex determination cascade, rather than focusing (exclusively) on the lncRNA.

      With this in mind, we developed three sets of guide RNAs that cut only one allele within the mapped CSD locus, with the goal of producing deletions within the highly variable region within the mapped locus. This would lead to functional hemizygosity or homozygosity within this region, depending on how the cuts were repaired. We also developed several sets of PCR primers to assess the heterozygosity of the resultant animals. After injecting 1,162 eggs over several weeks and genotyping the hundreds of resultant animals with PCR, we confirmed that we could induce hemizygosity or homozygosity within this region, at least in ~1/20 of the injected embryos. Although it is possible to assess the sex-specificity of the splice isoform of tra as a proxy for sex determination phenotypes (as done by (Pan et al. 2024)), the ideal experiment would assess male phenotypic development at the pupal stage. Therefore, over several more weeks, we injected hundreds more eggs with these reagents and reared the injected embryos to the pupal stage. However, substantial mortality was observed, with only 12 injected eggs developing to the pupal stage. All of these were female, and none of them had been successfully mutated.

      In conclusion, we agree with the reviewer that functional experiments would be useful, and we made extensive attempts to conduct such experiments. However, these experiments turned out to be extremely challenging with the currently available protocols. Ultimately, we therefore decided to abandon these attempts.  

      We opted not to include these experiments in the paper itself because we cannot meaningfully interpret their results. However, we are pleased that, in this response letter, we can include a brief description for readers interested in attempting similar experiments.

      Since O. biroi reproduces parthenogenetically and most offspring develop into females, observing a shift from female- to male-specific splicing of tra upon early embryonic knockout of the lncRNA would provide much stronger evidence that this lncRNA is essential for female development. Without such functional validation, the authors' claim (lines 36-38) seems to reiterate findings already proposed by Pan et al. (2024) and, as such, lacks sufficient novelty.

      We have responded to the issue of “lack of novelty” above. But again, the actual CSD locus in both O. biroi and L. humile appears to be distinct from (but genetically linked to) the lncRNA, and we have no experimental evidence that the putative lncRNA in O. biroi is involved in sex determination at all. Because of this, and given the experimental challenges described above, we do not currently intend to pursue functional studies of the lncRNA.

      References

      Hasselmann M, Gempe T, Schiøtt M, Nunes-Silva CG, Otte M, Beye M. 2008. Evidence for the evolutionary nascence of a novel sex determination pathway in honeybees. Nature 454:519–522.

      Koch V, Nissen I, Schmitt BD, Beye M. 2014. Independent Evolutionary Origin of fem Paralogous Genes and Complementary Sex Determination in Hymenopteran Insects. PLOS ONE 9:e91883.

      Matthey-Doret C, van der Kooi CJ, Jeffries DL, Bast J, Dennis AB, Vorburger C, Schwander T. 2019. Mapping of multiple complementary sex determination loci in a parasitoid wasp. Genome Biology and Evolution 11:2954–2962.

      Miyakawa MO, Mikheyev AS. 2015. QTL mapping of sex determination loci supports an ancient pathway in ants and honey bees. PLOS Genetics 11:e1005656.

      Pan Q, Darras H, Keller L. 2024. LncRNA gene ANTSR coordinates complementary sex determination in the Argentine ant. Science Advances 10:eadp1532.

      Privman E, Wurm Y, Keller L. 2013. Duplication and concerted evolution in a master sex determiner under balancing selection. Proceedings of the Royal Society B: Biological Sciences 280:20122968.

      Rabeling C, Kronauer DJC. 2012. Thelytokous parthenogenesis in eusocial Hymenoptera. Annual Review of Entomology 58:273–292.

      Schmieder S, Colinet D, Poirié M. 2012. Tracing back the nascence of a new sex-determination pathway to the ancestor of bees and ants. Nature Communications 3:1–7.

      Vorburger C. 2013. Thelytoky and Sex Determination in the Hymenoptera: Mutual Constraints. Sexual Development 8:50–58.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review): 

      Jiang et al. present a measure of phenological lag by quantifying the effects of abiotic constraints on the differences between observed and expected phenological changes, using a combination of previously published phenology change data for 980 species, and associated climate data for study sites. They found that, across all samples, observed phenological responses to climate warming were smaller than expected responses for both leafing and flowering spring events. They also show that data from experimental studies included in their analysis exhibited increased phenological lag compared to observational studies, possibly as a result of reduced sensitivity to climatic changes. Furthermore, the authors present evidence that spatial trends in phenological responses to warming may differ than what would be expected from phenological sensitivity, due to the seasonal timing of when warming occurs. Thus, climate change may not result in geographic convergences of phenological responses. This study presents an interesting way to separate the individual effects of climate change and other abiotic changes on the phenological responses across sites and species. 

      Strengths: 

      A straightforward mathematical definition of phenological lag allows for this method to potentially be applied in different geographic contexts. Where data exists, other researchers can partition the effects of various abiotic forcings on phenological responses that differ from those expected from warming sensitivity alone. 

      Identifying phenological lag, and associated contributing factors, provides a method by which more nuanced predictions of phenological responses to climate change can be made. Thus, this study could improve ecological forecasting models. 

      Weaknesses: 

      The analysis here could be more robust. A more thorough examination of phenological lag would provide stronger evidence that the framework presented has utility. The differences in phenologica lag by study approach, species origin, region, and growth form are interesting, and could be expanded. For example, the authors have the data to explore the relationships between phenological lag and the quantitative variables included in the final model (altitude, latitude, mean annual temperature) and other spatial or temporal variables. This would also provide stronger evidence for the author's claims about potential mechanisms that contribute to phenological lag. 

      We did examine the relationships of phenological lag with geographic or climatic variables in our analyses. Other than the weak correlations with latitude and altitude cited in the discussion section (lines 292-293), phenological lag was not related to mean annual temperature or long-term precipitation for both leafing and flowering.  

      The authors include very little data visualizations, and instead report results and model statistics in tables. This is difficult to interpret and may obscure underlying patterns in the data. Including visual representations of variable distributions and between-variable relationships, in addition to model statistics, provides stronger evidence than model statistics alone. 

      Table 2 shows the influences of geographic or climatic variables, particularly those related to drivers of spring phenology, i.e., budburst temperature, forcing change, and phenological lag, on phenological changes. As quantitative contributions of these drivers have been extracted, the influences of remaining variables are either minor or insignificant. Thus, examination of variable distributions, which has been done in previous syntheses, is probably not necessary.         

      Some of independent variables were apparently correlated (r <0.6), e.g., MAT with altitude and latitude, budburst temperature with forcing change and spring warming, and forcing change with spring warming.

      Reviewer #3 (Public review): 

      Summary: 

      The authors developed a new phenological lag metric and applied this analytical framework to a global dataset to synthesize shifts in spring phenology and assess how abiotic constraints influence spring phenology. 

      Strengths: 

      The dataset developed in this study is extensive, and the phenological lag metric is valuable. 

      Weaknesses: 

      The stability of the method used in this study needs improvement, particularly in the calculation of forcing requirements. In addition, the visualization of the results (such as Table 1) should be enhanced. 

      Not clear how to improve the calculation of forcing accumulation.    

      Recommendations for the authors: 

      Editor (Recommendations for the authors): 

      To improve the robustness of the metric and the conclusions drawn, we recommend that the authors: 

      Test the sensitivity of their results to different base temperature thresholds and to nonlinear forcing response models, even for a subset of species. The proposed framework relies on an accurate understanding of species-specific thermal responses, which remain poorly resolved.

      Different above-zero base temperatures have been used previously, although justifications are mostly following previous work. As we indicated in our first responses, the use of above-zero base temperatures underestimates forcing from low temperatures, which has more impacts on species with early spring phenology or in areas of cold climate due to greater proportions of forcing accumulations from low temperatures. The use of high base temperatures can lead to an interpretation that early season species require little or no forcing to break buds, which is biologically incorrect. Thus, the use of above-zero base temperatures would be more appropriate for particular locations or species than for meta-analysis across different spring phenology and climatic conditions. 

      The research on multiple warming is limited in terms of levels of warming used (mostly one and occasionally two) for assessing non-linear forcing responses. This can be the focus of future work.  

      Our framework is based on drivers of spring phenology and not dependent on “accurate understanding of species-specific thermal responses”. However, a good understanding of species- and site-specific responses to phenological constraints (e.g., insufficient winter chilling, photoperiod, and environmental stresses) does help determine the nature of phenological lag. All these are explained in our paper.     

      Analyze relationships between phenological lag and additional geographic or climatic gradients already available in the dataset (e.g., latitude, mean annual temperature, interannual variability). 

      We did examine the relationships of phenological lag with geographic or climatic variables in our analyses. Other than the weak correlations with latitude and altitude cited in the discussion section (lines 292-293), phenological lag was not related to mean annual temperature or long-term precipitation for both leafing and flowering.  

      Our objective is to understand changes in spring phenology and differences in plant phenological responses across different functional groups or climatic regions, although our approach can be used to study interannual variability of spring phenology. Our metadata are compiled for comparing warmer vs control treatments (often multiyear averages), not for assessing interannual variability.      

      Improve data visualization to better convey how phenological lag varies across environmental and biological contexts. 

      See responses above.

      Consider incorporating explicit uncertainty estimates around phenological lag calculations.  These steps would improve the interpretability and generalizability of the framework, helping it move from a useful heuristic to a more robust and empirically grounded analytical tool. 

      The calculation of phenological lag is based on drivers of spring phenology with uncertainty depending primarily on uncertainty in phenological observations. Previous uncertainty assessments can be used here (see a few selected studies below).   

      Alles, G.R., Comba, J.L., Vincent, J.M., Nagai, S. and Schnorr, L.M., 2020. Measuring phenology uncertainty with large scale image processing. Ecological Informatics, 59, p.101109.

      Liu G, Chuine I, Denéchère R, Jean F, Dufrêne E, Vincent G, Berveiller D, Delpierre N. Higher sample sizes and observer intercalibration are needed for reliable scoring of leaf phenology in trees. Journal of Ecology. 2021 Jun;109(6):2461-74.

      Tang, J., Körner, C., Muraoka, H., Piao, S., Shen, M., Thackeray, S.J. and Yang, X., 2016.Emerging opportunities and challenges in phenology: a review. Ecosphere, 7(8), p.e01436. 

      Nagai, S., Inoue, T., Ohtsuka, T., Yoshitake, S., Nasahara, K.N. and Saitoh, T.M., 2015. Uncertainties involved in leaf fall phenology detected by digital camera. Ecological Informatics, 30, pp.124-132.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer#1 (Public Review):

      In the current article, Octavia Soegyono and colleagues study "The influence of nucleus accumbens shell D1 and D2 neurons on outcome-specific Pavlovian instrumental transfer", building on extensive findings from the same lab. While there is a consensus about the specific involvement of the Shell part of the Nucleus Accumbens (NAc) in specific stimulus-based actions in choice settings (and not in General Pavlovian instrumental transfer - gPIT, as opposed to the Core part of the NAc), mechanisms at the cellular and circuitry levels remain to be explored. In the present work, using sophisticated methods (rat Cre-transgenic lines from both sexes, optogenetics and the well-established behavioral paradigm outcome-specific PIT - sPIT), Octavia Soegyono and colleagues decipher the diOerential contribution of dopamine receptors D1 and D2 expressing-spiny projection neurons (SPNs).

      After validating the viral strategy and the specificity of the targeting (immunochemistry and electrophysiology), the authors demonstrate that while both NAc Shell D1- and D2SPNs participate in mediating sPIT, NAc Shell D1-SPNs projections to the Ventral Pallidum (VP, previously demonstrated as crucial for sPIT), but not D2-SPNs, mediates sPIT. They also show that these eOects were specific to stimulus-based actions, as valuebased choices were left intact in all manipulations.

      This is a well-designed study and the results are well supported by the experimental evidence. The paper is extremely pleasant to read and add to the current literature.

      We thank the Reviewer for their positive assessment.

      Comments on revisions:  

      We thank the authors for their detailed responses and for addressing our comments and concerns.

      To further improve consistency and transparency, we kindly request that the authors provide, for Supplemental Figures S1-S4, panels E (raw data for lever presses during the PIT test), the individual data points together with the corresponding statistical analyses in the figure legends.

      Panel E of Figures S1-S4 now includes the individual data points. The outcome-specific data have already been analysed, and we report these analyses in the main manuscript. These analyses are more informative than those requested by the Reviewer since they report the net eFects of the stimuli on choice between actions while controlling for potential individual baseline instrumental performance. All data remain fully transparent and are publicly available on an online repository in accordance with eLife policies (see relevant section in Materials and Methods).  

      In addition, regarding Supplemental Figure S3, panel E, we note the absence of a PIT eOect in the eYFP group under the ON condition, which appears to diOer from the net response reported in the main Figure 5, panel B. Could the authors clarify this apparent discrepancy?

      We apologize for the error, which has now been corrected. 

      We also note a discrepancy between the authors' statement in their response ("40 rats excluded based on post-mortem analyses") and the number of excluded animals reported in the Materials and Methods section, which adds up to 47. We kindly ask the authors to clarify this point for consistency.

      We thank the Reviewer for identifying the error reported in our initial response. The total number of animals excluded was 47, as reported in the manuscript. 

      Finally, as a minor point, we suggest indicating the total number of animals used in the study in the Materials and Methods section.

      The total number of animals has been included in the Materials and Methods section.

      Reviewer #2 (Public Review):

      Summary:

      This manuscript by Soegyono et a. describes a series of experiments designed to probe the involvement of dopamine D1 and D2 neurons within the nucleus accumbens shell in outcome-specific Pavlovian-instrumental transfer (osPIT), a well-controlled assay of cueguided action selection based on congruent outcome associations. They used an optogenetic approach to phasically silence NAc shell D1 (D1-Cre mice) or D2 (A2a-Cre mice) neurons during a subset of osPIT trials. Both manipulations disrupted cue-guided action selection but had no eOects on negative control measures/tasks (concomitant approach behavior, separate valued guided choice task), nor were any osPIT impairments found in reporter only control groups. Separate experiments revealed that selective inhibition of NAc shell D1 but not D2 inputs to ventral pallidum were required for osPIT expression, thereby advancing understanding of the basal ganglia circuitry underpinning this important aspect of decision making.

      Strengths:

      The combinatorial viral and optogenetic approaches used here were convincingly validated through anatomical tract-tracing and ex vivo electrophysiology. The behavioral assays are sophisticated and well-controlled to parse cue and value guided action selection. The inclusion of reporter only control groups is rigorous and rules out nonspecific eOects of the light manipulation. The findings are novel and address a critical question in the literature. Prior work using less decisive methods had implicated NAc shell D1 neurons in osPIT but suggested that D2 neurons may not be involved. The optogenetic manipulations used in the current study provides a more direct test of their involvement and convincingly demonstrate that both populations play an important role. Prior work had also implicated NAc shell connections to ventral pallidum in osPIT, but the current study reveals the selective involvement of D1 but not D2 neurons in this circuit. The authors do a good job of discussing their findings, including their nuanced interpretation that NAc shell D2 neurons may contribute to osPIT through their local regulation of NAc shell microcircuitry.

      We thank the Reviewer for their positive assessment.

      Weaknesses:

      The current study exclusively used an optogenetic approach to probe the function of D1 and D2 NAc shell neurons. Providing a complementary assessment with chemogenetics or other appropriate methods would strengthen conclusions, particularly the novel demonstration for D2 NAc shell involvement. Likewise, the null result of optically inhibiting D2 inputs to ventral pallidum leaves open the possibility that a more complete or sustained disruption of this pathway may have impaired osPIT.

      We acknowledge the reviewer's valuable suggestion that demonstrating NAc-S D1- and D2-SPNs engagement in outcome-specific PIT through another technique would strengthen our optogenetic findings. Several approaches could provide this validation. Chemogenetic manipulation, as the reviewer suggested, represents one compelling option. Alternatively, immunohistochemical assessment of phosphorylated histone H3 at serine 10 (P-H3) oFers another promising avenue, given its established utility in reporting striatal SPNs plasticity in the dorsal striatum (Matamales et al., 2020). We hope to complete such an assessment in future work since it would address the limitations of previous work that relied solely on ERK1/2 phosphorylation measures in NAc-S SPNs (Laurent et al., 2014). The manuscript was modified to report these future avenues of research (page 12). 

      Regarding the null result from optical silencing of D2 terminals in the ventral pallidum, we agree with the reviewer's assessment. While we acknowledge this limitation in the current manuscript (page 13), we aim to address this gap in future studies to provide a more complete mechanistic understanding of the circuit.

      Conclusions:

      The research described here was successful in providing critical new insights into the contributions of NAc D1 and D2 neurons in cue-guided action selection. The authors' data interpretation and conclusions are well reasoned and appropriate. They also provide a thoughtful discussion of study limitations and implications for future research. This research is therefore likely to have a significant impact on the field.

      We thank the Reviewer for their positive assessment.

      Comments on revisions:

      I have reviewed the rebuttal and revised manuscript and have no remaining concerns.

      We are pleased to have addressed the Reviewer’s query.

      References

      Laurent, V., Bertran-Gonzalez, J., Chieng, B. C., & Balleine, B. W. (2014). δ-Opioid and Dopaminergic Processes in Accumbens Shell Modulate the Cholinergic Control of Predictive Learning and Choice. J Neurosci, 34(4), 1358-1369. https://doi.org/10.1523/JNEUROSCI.4592-13.2014

      Matamales, M., McGovern, A. E., Mi, J. D., Mazzone, S. B., Balleine, B. W., & BertranGonzalez, J. (2020). Local D2- to D1-neuron transmodulation updates goal-directed learning in the striatum. Science, 367(6477), 549-555. https://doi.org/10.1126/science.aaz5751

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      This paper describes a number of patterns of epistasis in a large fitness landscape dataset recently published by Papkou et al. The paper is motivated by an important goal in the field of evolutionary biology to understand the statistical structure of epistasis in protein fitness landscapes, and it capitalizes on the unique opportunities presented by this new dataset to address this problem. 

      The paper reports some interesting previously unobserved patterns that may have implications for our understanding of fitness landscapes and protein evolution. In particular, Figure 5 is very intriguing. However, I have two major concerns detailed below. First, I found the paper rather descriptive (it makes little attempt to gain deeper insights into the origins of the observed patterns) and unfocused (it reports what appears to be a disjointed collection of various statistics without a clear narrative. Second, I have concerns with the statistical rigor of the work. 

      (1) I think Figures 5 and 7 are the main, most interesting, and novel results of the paper. However, I don't think that the statement "Only a small fraction of mutations exhibit global epistasis" accurately describes what we see in Figure 5. To me, the most striking feature of this figure is that the effects of most mutations at all sites appear to be a mixture of three patterns. The most interesting pattern noted by the authors is of course the "strong" global epistasis, i.e., when the effect of a mutation is highly negatively correlated with the fitness of the background genotype. The second pattern is a "weak" global epistasis, where the correlation with background fitness is much weaker or non-existent. The third pattern is the vertically spread-out cluster at low-fitness backgrounds, i.e., a mutation has a wide range of mostly positive effects that are clearly not correlated with fitness. What is very interesting to me is that all background genotypes fall into these three groups with respect to almost every mutation, but the proportions of the three groups are different for different mutations. In contrast to the authors' statement, it seems to me that almost all mutations display strong global epistasis in at least a subset of backgrounds. A clear example is C>A mutation at site 3. 

      (1a) I think the authors ought to try to dissect these patterns and investigate them separately rather than lumping them all together and declaring that global epistasis is rare. For example, I would like to know whether those backgrounds in which mutations exhibit strong global epistasis are the same for all mutations or whether they are mutation- or perhaps positionspecific. Both answers could be potentially very interesting, either pointing to some specific site-site interactions or, alternatively, suggesting that the statistical patterns are conserved despite variation in the underlying interactions. 

      (1b) Another rather remarkable feature of this plot is that the slopes of the strong global epistasis patterns seem to be very similar across mutations. Is this the case? Is there anything special about this slope? For example, does this slope simply reflect the fact that a given mutation becomes essentially lethal (i.e., produces the same minimal fitness) in a certain set of background genotypes? 

      (1c) Finally, how consistent are these patterns with some null expectations? Specifically, would one expect the same distribution of global epistasis slopes on an uncorrelated landscape? Are the pivot points unusually clustered relative to an expectation on an uncorrelated landscape? 

      (1d) The shapes of the DFE shown in Figure 7 are also quite interesting, particularly the bimodal nature of the DFE in high-fitness (HF) backgrounds. I think this bimodality must be a reflection of the clustering of mutation-background combinations mentioned above. I think the authors ought to draw this connection explicitly. Do all HF backgrounds have a bimodal DFE? What mutations occupy the "moving" peak? 

      (1e) In several figures, the authors compare the patterns for HF and low-fitness (LF) genotypes. In some cases, there are some stark differences between these two groups, most notably in the shape of the DFE (Figure 7B, C). But there is no discussion about what could underlie these differences. Why are the statistics of epistasis different for HF and LF genotypes? Can the authors at least speculate about possible reasons? Why do HF and LF genotypes have qualitatively different DFEs? I actually don't quite understand why the transition between bimodal DFE in Figure 7B and unimodal DFE in Figure 7C is so abrupt. Is there something biologically special about the threshold that separates LF and HF genotypes? My understanding was that this was just a statistical cutoff. Perhaps the authors can plot the DFEs for all backgrounds on the same plot and just draw a line that separates HF and LF backgrounds so that the reader can better see whether the DFE shape changes gradually or abruptly.

      (1f) The analysis of the synonymous mutations is also interesting. However I think a few additional analyses are necessary to clarify what is happening here. I would like to know the extent to which synonymous mutations are more often neutral compared to non-synonymous ones. Then, synonymous pairs interact in the same way as non-synonymous pair (i.e., plot Figure 1 for synonymous pairs)? Do synonymous or non-synonymous mutations that are neutral exhibit less epistasis than non-neutral ones? Finally, do non-synonymous mutations alter epistasis among other mutations more often than synonymous mutations do? What about synonymous-neutral versus synonymous-non-neutral. Basically, I'd like to understand the extent to which a mutation that is neutral in a given background is more or less likely to alter epistasis between other mutations than a non-neutral mutation in the same background. 

      (2) I have two related methodological concerns. First, in several analyses, the authors employ thresholds that appear to be arbitrary. And second, I did not see any account of measurement errors. For example, the authors chose the 0.05 threshold to distinguish between epistasis and no epistasis, but why this particular threshold was chosen is not justified. Another example: is whether the product s12 × (s1 + s2) is greater or smaller than zero for any given mutation is uncertain due to measurement errors. Presumably, how to classify each pair of mutations should depend on the precision with which the fitness of mutants is measured. These thresholds could well be different across mutants. We know, for example, that low-fitness mutants typically have noisier fitness estimates than high-fitness mutants. I think the authors should use a statistically rigorous procedure to categorize mutations and their epistatic interactions. I think it is very important to address this issue. I got very concerned about it when I saw on LL 383-388 that synonymous stop codon mutations appear to modulate epistasis among other mutations. This seems very strange to me and makes me quite worried that this is a result of noise in LF genotypes. 

      Thank you for your review of the manuscript. In the revised version, we have addressed both major criticisms, as detailed below.

      When carefully examining the plots in Figure 5 independently, we indeed observe that the fitness effect of a mutation on different genetic backgrounds can be classified into three characteristic patterns. Our reasoning for these patterns is as follows:

      Strong correlation: Typically observed when the mutation is lethal across backgrounds. Linear regression of mutations exhibiting strong global epistasis shows slopes close to −1 and pivot points near −0.7 (Table S4). Since the reported fitness threshold is −0.508, these mutations push otherwise functional backgrounds into the non-functional range, consistent with lethal effects.

      Weak correlation: Observed when a mutation has no significant effect on fitness across backgrounds, consistent with neutrality.

      No correlation: Out of the 261,333 reported variants, 243,303 (93%) lie below the fitness threshold of −0.508, indicating that the low-fitness region is densely populated by nonfunctional variants. The “strong correlation” and “weak correlation” lines intersect in this zone. Most mutations in this region have little effect (neutral), but occasional abrupt fitness increases correspond to “resurrecting” mutations, the converse of lethal changes. For example, mutations such as X→G at locus 4 or X→A at locus 5 restore function, while the reverse changes (e.g. C→A at locus 3) are lethal.

      Thus, the “no-correlation” pattern is largely explained by mutations that reverse the effect of lethal changes, effectively resurrecting non-functional variants. In the revised manuscript, we highlight these nuances within the broader classification of fitness effect versus background fitness (pp. 10–13).

      Additional analyses included in the revision:

      Synonymous vs. non-synonymous pairs: We repeated the Figure 1 analysis for synonymous–synonymous pairs. As expected, synonymous pairs exhibit lower overall frequencies of epistasis, consistent with their greater neutrality. However, the qualitative spectrum remains similar: positive and negative epistasis dominate, while sign epistasis is rare (Supplementary Figs. S6–S7, S9–S10).

      Fitness effect vs. epistasis change: We tested whether the mean fitness effect of a mutation correlates with the percent of cases in which it changes the nature of epistasis. No correlation was found (R² ≈ 0.11), and this analysis is now included in the revised manuscript.

      Epistasis-modulating ability: Non-synonymous mutations more frequently alter the interactions between other mutations than synonymous substitutions. Within synonymous substitutions, the subset with measurable fitness effects disproportionately contributes to epistasis modulation. Thus, the ability of synonymous substitutions to modulate epistasis arises primarily from the non-neutral subset.

      These analyses clarify the role of synonymous mutations in reshaping epistasis on the folA landscape.

      Revision of statistical treatment of epistasis:

      In our original submission, we used an arbitrary threshold of 0.05 to classify the presence or absence of epistasis, following Papkou et al., who based conclusions on a single experimental replicate. However, as the reviewer correctly noted, this does not adequately account for measurement variability across different genotypes.

      In the revised manuscript, we adopt a statistically rigorous framework that incorporates replicate-based error directly. Specifically, we now use the mean fitness across six independent replicates, together with the corresponding standard deviation, to classify fitness peaks and epistasis. This eliminates arbitrary thresholds and ensures that epistatic classifications reflect the precision of measurements for each genotype.

      This revision led to both quantitative and qualitative changes:

      For high-fitness genotypes, the core patterns of higher-order (“fluid”) epistasis remain robust (Figures 2–3).

      For low-fitness genotypes, incorporating replicate-based error removed spurious fluidity effects, yielding a more accurate characterization of epistasis (Figures 2–3; Supplementary Figs. S6–S7, S9–S10).

      We describe these methodological changes in detail in the revised Methods section and provide updated code.

      Together, these revisions directly address the reviewer’s concerns. They improve the statistical rigor of our analysis, strengthen the robustness of our conclusions, and underscore the importance of accounting for measurement error in large-scale fitness landscape studies—a point we now emphasize in the manuscript.

      Reviewer #2 (Public review): 

      Significance: 

      This paper reanalyzes an experimental fitness landscape generated by Papkou et al., who assayed the fitness of all possible combinations of 4 nucleotide states at 9 sites in the E. coli DHFR gene, which confers antibiotic resistance. The 9 nucleotide sites make up 3 amino acid sites in the protein, of which one was shown to be the primary determinant of fitness by Papkou et al. This paper sought to assess whether pairwise epistatic interactions differ among genetic backgrounds at other sites and whether there are major patterns in any such differences. They use a "double mutant cycle" approach to quantify pairwise epistasis, where the epistatic interaction between two mutations is the difference between the measured fitness of the double-mutant and its predicted fitness in the absence of epistasis (which equals the sum of individual effects of each mutation observed in the single mutants relative to the reference genotype). The paper claims that epistasis is "fluid," because pairwise epistatic effects often differs depending on the genetic state at the other site. It also claims that this fluidity is "binary," because pairwise effects depend strongly on the state at nucleotide positions 5 and 6 but weakly on those at other sites. Finally, they compare the distribution of fitness effects (DFE) of single mutations for starting genotypes with similar fitness and find that despite the apparent "fluidity" of interactions this distribution is well-predicted by the fitness of the starting genotype. 

      The paper addresses an important question for genetics and evolution: how complex and unpredictable are the effects and interactions among mutations in a protein? Epistasis can make the phenotype hard to predict from the genotype and also affect the evolutionary navigability of a genotype landscape. Whether pairwise epistatic interactions depend on genetic background - that is, whether there are important high-order interactions -- is important because interactions of order greater than pairwise would make phenotypes especially idiosyncratic and difficult to predict from the genotype (or by extrapolating from experimentally measured phenotypes of genotypes randomly sampled from the huge space of possible genotypes). Another interesting question is the sparsity of such high-order interactions: if they exist but mostly depend on a small number of identifiable sequence sites in the background, then this would drastically reduce the complexity and idiosyncrasy relative to a landscape on which "fluidity" involves interactions among groups of all sites in the protein. A number of papers in the recent literature have addressed the topics of high-order epistasis and sparsity and have come to conflicting conclusions. This paper contributes to that body of literature with a case study of one published experimental dataset of high quality. The findings are therefore potentially significant if convincingly supported. 

      Validity: 

      In my judgment, the major conclusions of this paper are not well supported by the data. There are three major problems with the analysis. 

      (1) Lack of statistical tests. The authors conclude that pairwise interactions differ among backgrounds, but no statistical analysis is provided to establish that the observed differences are statistically significant, rather than being attributable to error and noise in the assay measurements. It has been established previously that the methods the authors use to estimate high-order interactions can result in inflated inferences of epistasis because of the propagation of measurement noise (see PMID 31527666 and 39261454). Error propagation can be extreme because first-order mutation effects are calculated as the difference between the measured phenotype of a single-mutant variant and the reference genotype; pairwise effects are then calculated as the difference between the measured phenotype of a double mutant and the sum of the differences described above for the single mutants. This paper claims fluidity when this latter difference itself differs when assessed in two different backgrounds. At each step of these calculations, measurement noise propagates. Because no statistical analysis is provided to evaluate whether these observed differences are greater than expected because of propagated error, the paper has not convincingly established or quantified "fluidity" in epistatic effects. 

      (2) Arbitrary cutoffs. Many of the analyses involve assigning pairwise interactions into discrete categories, based on the magnitude and direction of the difference between the predicted and observed phenotypes for a pairwise mutant. For example, the authors categorize as a positive pairwise interaction if the apparent deviation of phenotype from prediction is >0.05, negative if the deviation is <-0.05, and no interaction if the deviation is between these cutoffs. Fluidity is diagnosed when the category for a pairwise interaction differs among backgrounds. These cutoffs are essentially arbitrary, and the effects are assigned to categories without assessing statistical significance. For example, an interaction of 0.06 in one background and 0.04 in another would be classified as fluid, but it is very plausible that such a difference would arise due to error alone. The frequency of epistatic interactions in each category as claimed in the paper, as well as the extent of fluidity across backgrounds, could therefore be systematically overestimated or underestimated, affecting the major conclusions of the study. 

      (3) Global nonlinearities. The analyses do not consider the fact that apparent fluidity could be attributable to the fact that fitness measurements are bounded by a minimum (the fitness of cells carrying proteins in which DHFR is essentially nonfunctional) and a maximum (the fitness of cells in which some biological factor other than DHFR function is limiting for fitness). The data are clearly bounded; the original Papkou et al. paper states that 93% of genotypes are at the low-fitness limit at which deleterious effects no longer influence fitness. Because of this bounding, mutations that are strongly deleterious to DHFR function will therefore have an apparently smaller effect when introduced in combination with other deleterious mutations, leading to apparent epistatic interactions; moreover, these apparent interactions will have different magnitudes if they are introduced into backgrounds that themselves differ in DHFR function/fitness, leading to apparent "fluidity" of these interactions. This is a well-established issue in the literature (see PMIDs 30037990, 28100592, 39261454). It is therefore important to adjust for these global nonlinearities before assessing interactions, but the authors have not done this. 

      This global nonlinearity could explain much of the fluidity claimed in this paper. It could explain the observation that epistasis does not seem to depend as much on genetic background for low-fitness backgrounds, and the latter is constant (Figure 2B and 2C): these patterns would arise simply because the effects of deleterious mutations are all epistatically masked in backgrounds that are already near the fitness minimum. It would also explain the observations in Figure 7. For background genotypes with relatively high fitness, there are two distinct peaks of fitness effects, which likely correspond to neutral mutations and deleterious mutations that bring fitness to the lower bound of measurement; as the fitness of the background declines, the deleterious mutations have a smaller effect, so the two peaks draw closer to each other, and in the lowest-fitness backgrounds, they collapse into a single unimodal distribution in which all mutations are approximately neutral (with the distribution reflecting only noise). Global nonlinearity could also explain the apparent "binary" nature of epistasis. Sites 4 and 5 change the second amino acid, and the Papkou paper shows that only 3 amino acid states (C, D, and E) are compatible with function; all others abolish function and yield lower-bound fitness, while mutations at other sites have much weaker effects. The apparent binary nature of epistasis in Figure 5 corresponds to these effects given the nonlinearity of the fitness assay. Most mutations are close to neutral irrespective of the fitness of the background into which they are introduced: these are the "non-epistatic" mutations in the binary scheme. For the mutations at sites 4 and 5 that abolish one of the beneficial mutations, however, these have a strong background-dependence: they are very deleterious when introduced into a high-fitness background but their impact shrinks as they are introduced into backgrounds with progressively lower fitness. The apparent "binary" nature of global epistasis is likely to be a simple artifact of bounding and the bimodal distribution of functional effects: neutral mutations are insensitive to background, while the magnitude of the fitness effect of deleterious mutations declines with background fitness because they are masked by the lower bound. The authors' statement is that "global epistasis often does not hold." This is not established. A more plausible conclusion is that global epistasis imposed by the phenotype limits affects all mutations, but it does so in a nonlinear fashion. 

      In conclusion, most of the major claims in the paper could be artifactual. Much of the claimed pairwise epistasis could be caused by measurement noise, the use of arbitrary cutoffs, and the lack of adjustment for global nonlinearity. Much of the fluidity or higher-order epistasis could be attributable to the same issues. And the apparently binary nature of global epistasis is also the expected result of this nonlinearity. 

      We thank the reviewer for raising this important concern. We fully agree that the use of arbitrary thresholds in the earlier version of the manuscript, together with the lack of an explicit treatment of measurement error, could compromise the rigor of our conclusions. To address this, we have undertaken a thorough re-analysis of the folA landscape.

      (1)  Incorporating measurement error and avoiding noise-driven artifacts

      In the original version, we followed Papkou et al. in using a single experimental replicate and applying fixed thresholds to classify epistasis. As the reviewer correctly notes, this approach allows noise to propagate from single-mutant measurements to double-mutant effects, and ultimately to higher-order epistasis.

      In the revised analysis, we now:

      Use the mean fitness across all six independent replicates for each genotype.

      Incorporate the corresponding standard deviation as a measure of experimental error.

      Classify epistatic interactions only when differences between a genotype and its neighbors exceed combined error margins, rather than using a fixed cutoff.

      This ensures that observed changes in epistasis are statistically distinguishable from noise. Details are provided in the revised Methods section and updated code.

      (2) Replacing arbitrary thresholds with error-based criteria

      Previously, we used an arbitrary ±0.05 cutoff to define the presence/absence of epistasis. As the reviewer notes, this could misclassify interactions (e.g. labeling an effect as “fluid” when the difference lies within error). In the revised framework, these thresholds have been eliminated. Instead, interactions are classified based on whether their distributions overlap within replicate variance.

      This approach scales naturally with measurement precision, which differs between high-fitness and low-fitness genotypes, and removes the need for a universal cutoff.

      (3) Consequences of re-analysis

      Implementing this revised framework produced several important updates:

      High-fitness backgrounds: The qualitative picture of higher-order (“fluid”) epistasis remains robust. The patterns reported originally are preserved.

      Low-fitness backgrounds: Accounting for replicate variance revealed that part of the previously inferred “fluidity” arose from noise. These spurious effects are now removed, giving a more conservative but more accurate view of epistasis in non-functional regions.

      Fitness peaks: Our replicate-aware analysis identifies 127 peaks, compared to 514 in Papkou et al. Importantly, all 127 peaks occur in functional regions of the landscape. This difference highlights the importance of replicate-based error treatment: relying on a single run without demonstrating repeatability can yield artifacts.

      (4) Addressing bounding effects and terminology

      We also agree with the reviewer that bounding effects, arising from the biological limits of fitness, can create apparent nonlinearities in the genotype–phenotype map. To clarify this, we made the following changes:

      Terminology: We now use the term higher-order epistasis instead of fluid epistasis, emphasizing that the observed background-dependence involves more than two mutations and cannot be explained by global nonlinearities alone.

      We also clarify the definitions of sign-epistasis used in this work.

      By replacing arbitrary cutoffs with replicate-based error estimates and by explicitly considering bounding effects, we have substantially increased the rigor of our analysis. While this reanalysis led to both quantitative and qualitative changes in some regions, the central conclusion remains unchanged: higher-order epistasis is pervasive in the folA landscape, especially in functional backgrounds.

      All analysis scripts and codes are provided as Supplementary Material.

      Reviewer #3 (Public review): 

      Summary: 

      The authors have studied a previously published large dataset on the fitness landscape of a 9 base-pair region of the folA gene. The objective of the paper is to understand various aspects of epistasis in this system, which the authors have achieved through detailed and computationally expensive exploration of the landscape. The authors describe epistasis in this system as "fluid", meaning that it depends sensitively on the genetic background, thereby reducing the predictability of evolution at the genetic level. However, the study also finds two robust patterns. The first is the existence of a "pivot point" for a majority of mutations, which is a fixed growth rate at which the effect of mutations switches from beneficial to deleterious (consistent with a previous study on the topic). The second is the observation that the distribution of fitness effects (DFE) of mutations is predicted quite well by the fitness of the genotype, especially for high-fitness genotypes. While the work does not offer a synthesis of the multitude of reported results, the information provided here raises interesting questions for future studies in this field. 

      Strengths: 

      A major strength of the study is its detailed and multifaceted approach, which has helped the authors tease out a number of interesting epistatic properties. The study makes a timely contribution by focusing on topical issues like the prevalence of global epistasis, the existence of pivot points, and the dependence of DFE on the background genotype and its fitness. The methodology is presented in a largely transparent manner, which makes it easy to interpret and evaluate the results. 

      The authors have classified pairwise epistasis into six types and found that the type of epistasis changes depending on background mutations. Switches happen more frequently for mutations at functionally important sites. Interestingly, the authors find that even synonymous mutations in stop codons can alter the epistatic interaction between mutations in other codons. Consistent with these observations of "fluidity", the study reports limited instances of global epistasis (which predicts a simple linear relationship between the size of a mutational effect and the fitness of the genetic background in which it occurs). Overall, the work presents some evidence for the genetic context-dependent nature of epistasis in this system. 

      Weaknesses: 

      Despite the wealth of information provided by the study, there are some shortcomings of the paper which must be mentioned. 

      (1) In the Significance Statement, the authors say that the "fluid" nature of epistasis is a previously unknown property. This is not accurate. What the authors describe as "fluidity" is essentially the prevalence of certain forms of higher-order epistasis (i.e., epistasis beyond pairwise mutational interactions). The existence of higher-order epistasis is a well-known feature of many landscapes. For example, in an early work, (Szendro et. al., J. Stat. Mech., 2013), the presence of a significant degree of higher-order epistasis was reported for a number of empirical fitness landscapes. Likewise, (Weinreich et. al., Curr. Opin. Genet. Dev., 2013) analysed several fitness landscapes and found that higher-order epistatic terms were on average larger than the pairwise term in nearly all cases. They further showed that ignoring higher-order epistasis leads to a significant overestimate of accessible evolutionary paths. The literature on higher-order epistasis has grown substantially since these early works. Any future versions of the present preprint will benefit from a more thorough contextual discussion of the literature on higher-order epistasis.

      (2) In the paper, the term 'sign epistasis' is used in a way that is different from its wellestablished meaning. (Pairwise) sign epistasis, in its standard usage, is said to occur when the effect of a mutation switches from beneficial to deleterious (or vice versa) when a mutation occurs at a different locus. The authors require a stronger condition, namely that the sum of the individual effects of two mutations should have the opposite sign from their joint effect. This is a sufficient condition for sign epistasis, but not a necessary one. The property studied by the authors is important in its own right, but it is not equivalent to sign epistasis. 

      (3) The authors have looked for global epistasis in all 108 (9x12) mutations, out of which only 16 showed a correlation of R^2 > 0.4. 14 out of these 16 mutations were in the functionally important nucleotide positions. Based on this, the authors conclude that global epistasis is rare in this landscape, and further, that mutations in this landscape can be classified into one of two binary states - those that exhibit global epistasis (a small minority) and those that do not (the majority). I suspect, however, that a biologically significant binary classification based on these data may be premature. Unsurprisingly, mutational effects are stronger at the functional sites as seen in Figure 5 and Figure 2, which means that even if global epistasis is present for all mutations, a statistical signal will be more easily detected for the functionally important sites. Indeed, the authors show that the means of DFEs decrease linearly with background fitness, which hints at the possibility that a weak global epistatic effect may be present (though hard to detect) in the individual mutations. Given the high importance of the phenomenon of global epistasis, it pays to be cautious in interpreting these results. 

      (4) The study reports that synonymous mutations frequently change the nature of epistasis between mutations in other codons. However, it is unclear whether this should be surprising, because, as the authors have already noted, synonymous mutations can have an impact on cellular functions. The reader may wonder if the synonymous mutations that cause changes in epistatic interactions in a certain background also tend to be non-neutral in that background. Unfortunately, the fitness effect of synonymous mutations has not been reported in the paper. 

      (5) The authors find that DFEs of high-fitness genotypes tend to depend only on fitness and not on genetic composition. This is an intriguing observation, but unfortunately, the authors do not provide any possible explanation or connect it to theoretical literature. I am reminded of work by (Agarwala and Fisher, Theor. Popul. Biol., 2019) as well as (Reddy and Desai, eLife, 2023) where conditions under which the DFE depends only on the fitness have been derived. Any discussion of possible connections to these works could be a useful addition.  

      We thank the reviewer for the summary of our work and for highlighting both its strengths and areas for improvement. We have carefully considered the points raised and revised the manuscript accordingly. The revised version:

      (1) Clarifies the conceptual framework. We emphasize the distinction between background-dependent, higher-order epistasis and global nonlinearities. To avoid ambiguity, we have replaced the term “fluid” epistasis with higher-order epistasis throughout, in line with prior literature (e.g. Szendro et al., 2013; Weinreich et al., 2013). We now explicitly situate our results in the context of these studies and clarify our definitions of epistasis, correcting the earlier error where “strong sign epistasis” was used in place of “sign epistasis.”

      (2) Improves statistical rigor. We now incorporate replicate variance and statistical error criteria in place of arbitrary thresholds. This ensures that classification of epistasis reflects experimental precision rather than fixed, arbitrary cutoffs.

      (3) Expands treatment of synonymous mutations. We now explicitly analyze synonymous mutations, separating those that are neutral from those that are non-neutral. Our results show that non-neutral synonymous mutations are disproportionately responsible for altering epistatic interactions, while neutral synonymous mutations rarely do so. We also report the fitness effects of synonymous mutations directly and include new analyses showing that there is no correlation between the mean fitness effect of a synonymous mutation and the frequency with which it alters epistasis (Supplementary Fig. S11).

      These revisions strengthen both the rigor and the clarity of the manuscript. We hope they address the reviewer’s concerns and make the significance of our findings, particularly the siteresolved quantification of higher-order epistasis in the folA landscape, including in synonymous mutations, more apparent.

      Reviewing Editor Comments: 

      Key revision suggestions: 

      (1) Please quantify the impact of measurement noise on your conclusions, and perform statistical analysis to determine whether the observed differences of epistasis due to different backgrounds are statistically significant. 

      (2) Please investigate how your conclusions depend on the cutoffs, and consider choosing them based on statistical criteria. 

      (3) Please reconsider the possible role of global epistasis. In particular, the effect of bounds on fitness values. All reviewers are concerned that all claims, including about global epistasis, may be consistent with a simple null model where most low fitness genotypes are non-functional and variation in their fitness is simply driven by measurement noise. Please provide a convincing argument rejecting this model. 

      More generally, we recommend that you consider all suggestions by reviewers, including those about results, but also those about terminology and citing relevant works. 

      Thank you for your guidance. We have substantially revised the manuscript to incorporate the reviewers’ suggestions. In addition to addressing the three central issues raised, we have refined terminology, expanded the discussion of prior work, and clarified the presentation of our main results. We believe these changes significantly strengthen both the rigor and the impact of the study. We are grateful to the Reviewing Editor and reviewers for their constructive feedback.

      In the revised manuscript, we address the three major points as follows:

      (1) Quantifying measurement noise and statistical significance. We now use the average of six independent experimental runs for each genotype, together with the corresponding standard deviations, to explicitly quantify measurement uncertainty. Pairwise and higher-order epistasis are assessed relative to these error estimates, rather than against fixed thresholds. This ensures that differences across genetic backgrounds are statistically distinguishable from noise.

      (2) Replacing arbitrary cutoffs with statistical criteria. We have eliminated the use of arbitrary thresholds. Instead, classification of interactions (positive, negative, or neutral epistasis) is based on whether fitness differences exceed replicate variance. This approach scales naturally with measurement precision. While some results change quantitatively for high-fitness backgrounds and qualitatively for low-fitness backgrounds, our central conclusions remain robust.

      (3) Analysis of synonymous mutations. We now separately analyze synonymous mutations to test their role in altering epistasis. Our results show that there is no correlation between the average fitness effect of a synonymous mutation and the frequency with which it changes epistatic interactions.

      We have revised terminology for clarity (replacing “fluid” with higher-order epistasis) and updated the Discussion to place our work in the broader context of the literature on higher-order epistasis.

      Finally, we have rewritten the entire manuscript to improve clarity, refine the narrative flow, and ensure that the presentation more crisply reflects the subject of the study

      Reviewer #1 (Recommendations for the authors): 

      MINOR COMMENTS 

      (1) Lines 102-107. Papkou's definition of non-functional genotypes makes sense since it is based on the fact that some genotypes are statistically indistinguishable in terms of fitness from mutants with premature stop codons in folA. It doesn't really matter whether to call them low fitness or non-functional, but it would be helpful to explain the basis for this distinction. 

      Thank you for raising this point. To maintain consistency with the original dataset and analysis, we retain Papkou et al.’s nomenclature and refer to these genotypes as “functional” or “non-functional.” 

      (2) Lines 111-112. I think the authors need to briefly explain here how they define the absence of epistasis. They do so in the Methods, but this information is essential and needs to be conveyed to the reader in the Results as well. 

      Thank you for the suggestion. We agree that this definition is essential for readers to follow the Results. In the revised manuscript, we have added a brief explanation at the start of the Results section clarifying how we define the absence of epistasis. Specifically, we now state that two mutations are considered non-epistatic when the observed fitness of the double mutant is statistically indistinguishable (within error of six replicates) from the additive expectation based on the single mutants. This ensures that the Results section is selfcontained, while full details remain in the Methods.

      (3) Lines 142 and elsewhere. The authors introduce the qualifier "fluid" to describe the fact that the value or sign of pairwise epistasis changes across genetic backgrounds. I don't see a need for this new terminology, since it is already captured adequately by the term "higher-order epistasis". The epistasis field is already rife with jargon, and I would prefer if new terms were introduced only when absolutely necessary. 

      Thank you for this helpful suggestion. We agree that introducing new terminology is unnecessary here. In the revised manuscript, we have replaced the term “fluid” epistasis with “higher-order epistasis” throughout, to align with established usage and avoid adding jargon.

      (4) Figure 6. I don't think this is the best way of showing that the pivot points are clustered. A histogram would be more appropriate and would take less space. However it would allow the authors to display a null distribution to demonstrate that this clustering is indeed surprising. 

      (5) Lines 320-321. Mann-Whitney U tests whether one distribution is systematically shifted up or down relative to the other. Please change the language here. It looks like the authors also performed the Kolmogorov-Smirnoff test, which is appropriate, but it doesn't look like the results are reported anywhere. Please report. 

      (6) Lines 330-334. The fact that HF genotypes seem to have more similar DFEs than LF genotypes is somewhat counterintuitive. Could this be an artifact of the fact that any two random HF genotypes are more similar to each other than any two randomly sampled LF genotypes? 

      (7) Lines 427. The sentence "The set of these selected variants are assigned their one hamming distance neighbours to construct a new 𝑛-base sequence space" is confusing. I think it is pretty clear how to construct a n-base sequence space, and this sentence adds more confusion than it removes. 

      Thank you for raising this point. To maintain consistency with the original dataset and analysis, we retain Papkou et al.’s nomenclature and refer to these genotypes as “functional” or “non-functional.” 

      We now start the results section of the manuscript with a brief description of how each type of epistasis is defined. Specifically, we now state that two mutations are considered non-epistatic when the observed fitness of the double mutant is statistically indistinguishable (within the error of six replicates) from the additive expectation based on the single mutants. This ensures that the Results section is self-contained, while full details remain in the Methods.

      We also agree that introducing new terminology is unnecessary. In the revised manuscript, we have replaced the term “fluid” epistasis with “higher-order epistasis” throughout, to align with established usage and avoid adding jargon. Finally, we concur that the identified sentence was unnecessary and potentially confusing; it has been removed from the revised manuscript to improve clarity. In fact, we have rewritten the entire manuscript for better flow and readability. 

      Reviewer #2 (Recommendations for the authors): 

      (1) Supplementary Figure S2A and S3 seem to be the same. 

      (3) The classification scheme for reciprocal sign/single sign/other sign epistasis differs from convention and should be made more explicit or renamed. 

      (4) Re the claim that high and low fitness backgrounds have different frequencies of the various types of epistasis: 

      Are the frequency distributions of the different types of epistasis statistically different between high and low fitness backgrounds statistically significant? It seems that they follow similar general patterns, and the sample size is much smaller for high fitness backgrounds so more variance in their distributions is expected. 

      Do bounding of fitness measurements play a role in generating the differences in types of epistasis seen in high vs. low-fitness backgrounds? If many variants are at the lower bound of the fitness assay, then positive epistasis might simply be less detectable for these backgrounds (which seems to be the biggest difference between high/low fitness backgrounds). 

      (5) In Figure 4B, points are not independent, because the mutation effects are calculated for all mutations in all backgrounds, rather than with reference to a single background or fluorescence value. The same mutations are therefore counted many times. 

      (6) It is not clear how the "pivot growth rate" was calculated or what the importance of this metric is. 

      (7) In the introduction, the justification for reanalyzing the Papkou et al dataset in particular is not clear. 

      (8) Epistasis at the nucleotide level is expected because of the genetic code: fitness and function are primarily affected by amino acid changes, and nucleotide mutations will affect amino acids depending on the state at other nucleotide sites in the same codon. For the most part, this is not explicitly taken account of in the paper. I recommend separating apparent epistasis due to the genetic code from that attributable to dependence among codons. 

      Thank you for noting this. Figure S2A shows results for high-fitness peaks only, whereas Figure S3 shows results for all peaks across the landscape. We have now made this distinction explicit in the figure legends and main text of the revised manuscript. 

      In the revised analysis, peaks are defined using the average fitness across six experimental replicates along with the corresponding standard deviation. Each genotype is compared with all single-step neighbors, and it is classified as a peak only if its mean fitness is significantly higher than all neighbors (p < 0.05). This procedure explicitly accounts for measurement error and replaces the arbitrary thresholding used previously. Full details are now described in the Methods.

      To avoid confusion, we now state our definitions explicitly at the start of the analysis. We have now corrected our definition in the text. We define sign epistasis as a one where at least one mutation switches from being beneficial to deleterious. 

      We have clarified our motivation in the Introduction. The Papkou et al. dataset is the most comprehensive experimental map of a complete 9-bp region of folA and provides six independent replicates, making it uniquely suited for testing hypotheses about backgrounddependent epistasis. Importantly, Papkou et al. based their conclusions on a single run, whereas our reanalysis incorporates replicate means and variances, leading to substantive differences—for example, a reduction in reported peaks from 514 to 127. By recalibrating the analysis, we provide a more rigorous account of this landscape and highlight how methodological choices affect conclusions.

      We also agree that some nucleotide-level epistasis reflects the structure of the genetic code (i.e., codon degeneracy and context-dependence of amino acid substitutions). In the revised manuscript, we explicitly separate epistasis attributable to codon structure from epistasis arising among codons. For example, synonymous mutations that alter epistasis within codons are treated separately from those affecting interactions across codons, and this distinction is now clearly indicated in the Results.

      Reviewer #3 (Recommendations for the authors): 

      (1) The analysis of peak density and accessibility in the paragraph starting on line 96 seems a bit out of context. Its connection with the various forms of epistasis treated in the rest of the paper is unclear. 

      (2) As mentioned in the Public Review, the term 'sign epistasis' has been used in a non-standard way. My suggestion would be to use a different term. Even a slightly modified term, such as "strong sign epistasis", should help to avoid any confusion. 

      (3)  mentioned in the public review that it is not clear whether the synonymous mutations that change the type of epistasis also tend to be non-neutral. This issue could be addressed by computing, for example, the fitness effects of all synonymous mutations for backgrounds and mutation pairs where a switch in epistasis occurs, and comparing it with fitness effects where no such switch occurs. 

      (4) Do the authors have any proposal for why synonymous mutations seem to cause more frequent changes in epistasis in low-fitness backgrounds? Related to this, is there any systematic difference between the types of switch caused by synonymous mutations in the low- versus high-fitness backgrounds? 

      (5) It is unclear exactly how the pivot points were determined, especially since the data for many mutations is noisy. The protocol should be provided in the Methods section. 

      (6) Line 303: possible typo, "accurate" --> "inaccurate". 

      (7) The value of Delta used for the "phenotypic DFE" has not been mentioned in the main text (including Methods).

      We agree that the connection needed to be clearer. In the revised manuscript, we (i) relocate and retitle this material as a brief “Landscape overview” preceding the epistasis analyses, (ii) explicitly link multi-peakedness and path accessibility to epistasis (e.g., multi-peak structure implies the presence of sign/reciprocal-sign epistasis; accessibility is shaped by background-dependent effects), and (iii) move derivations to the Supplement. We also recomputed peak density and accessibility using replicate-averaged fitness with replicate SDs, so the overview and downstream epistasis sections now use a single, error-aware landscape (updated in Figs. 1–3, with cross-references in the text).

      We have aligned our terminology and now state definitions upfront. 

      After replacing fixed cutoffs with replicate-based error criteria, switches are more frequent in high-fitness backgrounds (Fig. 3). Mechanistically, near the lower fitness bound, deleterious effects are masked (global nonlinearity), reducing apparent switching. Functional/high-fitness backgrounds allow both beneficial and deleterious outcomes, so background-dependent (higher-order) interactions manifest more readily. Switch types also vary by background fitness: high-fitness backgrounds show more sign/strong-sign switches, whereas low-fitness backgrounds show mostly magnitude reclassifications (Fig. 3C; Supplement Fig. Sx).

      Finally, we corrected a typo by replacing “accurate” with “inaccurate” and now define Δ (equal to 0.05) in the main text (in Results and Figure 8 caption).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      Dendrotweaks provides its users with a solid tool to implement, visualize, tune, validate, understand, and reduce single-neuron models that incorporate complex dendritic arbors with differential distribution of biophysical mechanisms. The visualization of dendritic segments and biophysical mechanisms therein provide users with an intuitive way to understand and appreciate dendritic physiology.

      Strengths:

      (1) The visualization tools are simplified, elegant, and intuitive.

      (2) The ability to build single-neuron models using simple and intuitive interfaces.

      (3) The ability to validate models with different measurements.

      (4) The ability to systematically and progressively reduce morphologically-realistic neuronal models.

      Weaknesses:

      (1) Inability to account for neuron-to-neuron variability in structural, biophysical, and physiological properties in the model-building and validation processes.

      We agree with the reviewer that it is important to account for neuron-to-neuron variability. The core approach of DendroTweaks, and its strongest aspect, is the interactive exploration of how morpho-electric parameters affect neuronal activity. In light of this, variability can be achieved through the interactive updating of the model parameters with widgets. In a sense, by adjusting a widget (e.g., channel distribution or kinetics), a user ends up with a new instance of a cell in the parameter space and receives almost real-time feedback on how this change affected neuronal activity. This approach is much simpler than implementing complex optimization protocols for different parameter sets, which would detract from the interactivity aspect of the GUI. In its revised version, DendroTweaks also accounts for neuron-to-neuron morphological variability, as channel distributions are now based on morphological domains (rather than the previous segment-specific approach). This makes it possible to apply the same biophysical configuration across various morphologies. Overall, both biophysical and morphological variability can be explored within DendroTweaks. 

      (2) Inability to account for the many-to-many mapping between ion channels and physiological outcomes. Reliance on hand-tuning provides a single biased model that does not respect pronounced neuron-to-neuron variability observed in electrophysiological measurements.

      We acknowledge the challenge of accounting for degeneracy in the relation between ion channels and physiological outcomes and the importance of capturing neuron-to-neuron variability. One possible way to address this, as we mention in the Discussion, is to integrate automated parameter optimization algorithms alongside the existing interactive hand-tuning with widgets. In its revised version, DendroTweaks can integrate with Jaxley (Deistler et al., 2024) in addition to NEURON. The models created in DendroTweaks can now be run with Jaxley (although not all types of models, see the limitations in the Discussion), and their parameters can be optimized via automated and fast gradient-based parameter optimization, including optimization of heterogeneous channel distributions. In particular, a key advantage of integrating Jaxley with DendroTweaks was its NMODL-to-Python converter, which significantly reduced the need to manually re-implement existing ion channel models for Jaxley (see here: https://dendrotweaks.readthedocs.io/en/latest/tutorials/convert_to_jaxley.html).

      (1) Michael Deistler, Kyra L. Kadhim, Matthijs Pals, Jonas Beck, Ziwei Huang, Manuel Gloeckler, Janne K. Lappalainen, Cornelius Schröder, Philipp Berens, Pedro J. Gonçalves, Jakob H. Macke Differentiable simulation enables large-scale training of detailed biophysical models of neural dynamics bioRxiv 2024.08.21.608979; doi:https://doi.org/10.1101/2024.08.21.608979

      Lack of a demonstration on how to connect reduced models into a network within the toolbox.

      Building a network of reduced models is an exciting direction, yet beyond the scope of this manuscript, whose primary goal is to introduce DendroTweaks and highlight its capabilities. DendroTweaks is designed for single-cell modeling, aiming to cover its various aspects in great detail. Of course, we expect refined single-cell models, both detailed and simplified, to be further integrated into networks. But this does not need to occur within DendroTweaks. We believe this network-building step is best handled by dedicated network simulation platforms. To facilitate the network-building process, we extended the exporting capabilities of DendroTweaks. To enable the export of reduced models in DendroTweaks’s modular format, as well as in plain simulator code, we implemented a method to fit the resulting parameter distributions to analytical functions (e.g., polynomials). This approach provided a compact representation, requiring a few coefficients to be stored in order to reproduce a distribution, independently of the original segmentation. The reduced morphologies can be exported as SWC files, standardized ion channel models as MOD files, and channel distributions as JSON files. Moreover, plain NEURON code (Python) to instantiate a cell class can be automatically generated for any model, including the reduced ones. Finally, to demonstrate how these exported models can be integrated into larger simulations, we implemented a "toy" network model in a Jupyter notebook included as an example in the GitHub repository. We believe that these changes greatly facilitate the integration of DendroTweaks-produced models into networks while also allowing users to run these networks on their favorite platforms.

      (4) Lack of a set of tutorials, which is common across many "Tools and Resources" papers, that would be helpful in users getting acquainted with the toolbox.

      This is an important point that we believe has been addressed fully in the revised version of the tool and manuscript. As previously mentioned, the lack of documentation was due to the software's early stage. We have now added comprehensive documentation, which is available at https://dendrotweaks.readthedocs.io. This extensive material includes API references, 12 tutorials, 4 interactive Jupyter notebooks, and a series of video tutorials, and it is regularly updated with new content. Moreover, the toolbox's GUI with example models is available through our online platform at https://dendrotweaks.dendrites.gr.  

      Reviewer #2 (Public review):

      The paper by Makarov et al. describes the software tool called DendroTweaks, intended for the examination of multi-compartmental biophysically detailed neuron models. It offers extensive capabilities for working with very complex distributed biophysical neuronal models and should be a useful addition to the growing ecosystem of tools for neuronal modeling.

      Strengths

      (1) This Python-based tool allows for visualization of a neuronal model's compartments.

      (2) The tool works with morphology reconstructions in the widely used .swc and .asc formats.

      (3) It can support many neuronal models using the NMODL language, which is widely used for neuronal modeling.

      (4) It permits one to plot the properties of linear and non-linear conductances in every compartment of a neuronal model, facilitating examination of the model's details.

      (5) DendroTweaks supports manipulation of the model parameters and morphological details, which is important for the exploration of the relations of the model composition and parameters with its electrophysiological activity.

      (6) The paper is very well written - everything is clear, and the capabilities of the tool are described and illustrated with great attention to detail.

      Weaknesses

      (1) Not a really big weakness, but it would be really helpful if the authors showed how the performance of their tool scales. This can be done for an increasing number of compartments - how long does it take to carry out typical procedures in DendroTweaks, on a given hardware, for a cell model with 100 compartments, 200, 300, and so on? This information will be quite useful to understand the applicability of the software.

      DendroTweaks functions as a layer on top of a simulator. As a result, its performance scales in the same way as for a given simulator. The GUI currently displays the time taken to run a simulation (e.g., in NEURON) at the bottom of the Simulation tab in the left menu. While Bokeh-related processing and rendering also consume time, this is not as straightforward to measure. It is worth noting, however, that this time is short and approximately equivalent to rendering the corresponding plots elsewhere (e.g., in a Jupyter notebook), and thus adds negligible overhead to the total simulation time. 

      (2) Let me also add here a few suggestions (not weaknesses, but something that can be useful, and if the authors can easily add some of these for publication, that would strongly increase the value of the paper).

      (3) It would be very helpful to add functionality to read major formats in the field, such as NeuroML and SONATA.

      We agree with the reviewer that support for major formats will substantially improve the toolbox, ensuring the reproducibility and reusability of the models. While integration with these formats has not been fully implemented, we have taken several steps to ensure elegant and reproducible model representation. Specifically, we have increased the modularity of model components and developed a custom compact data format tailored to single-cell modeling needs. We used a JSON representation inspired by the Allen Cell Types Database schema, modified to account for non-constant distributions of the model parameters. We have transitioned from a representation of parameter distributions dependent on specific segmentation graphs and sections to a more generalized domain-based distribution approach. In this revised methodology, segment groups are no longer explicitly defined by segment identifiers, but rather by specification of anatomical domains and conditional expressions (e.g., “select all segments in the apical domain with the maximum diameter < 0.8 µm”). Additionally, we have implemented the export of experimental protocols into CSV and JSON files, where the JSON files contain information about the stimuli (e.g., synaptic conductance, time constants), and the CSV files store locations of recording sites and stimuli. These features contribute toward a higher-level, structured representation of models, which we view as an important step toward eventual compatibility with standard formats such as NeuroML and SONATA. We have also initiated a two-way integration between DendroTweaks and SONATA. We developed a converter from DendroTweaks to SONATA that automatically generates SONATA files to reproduce models created in DendroTweaks. Additionally, support for the DendroTweaks JSON representation of biophysical properties will be added to the SONATA data format ecosystem, enabling models with complex dendritic distributions of channels. This integration is still in progress and will be included in the next version of DendroTweaks. While full integration with these formats is a goal for future releases, we believe the current enhancements to modularity and exportability represent a significant step forward, providing immediate value to the community.

      (4) Visualization is available as a static 2D projection of the cell's morphology. It would be nice to implement 3D interactive visualization.

      We offer an option to rotate a cell around the Y axis using a slider under the plot. This is a workaround, as implementing a true 3D visualization in Bokeh would require custom Bokeh elements, along with external JavaScript libraries. It's worth noting that there are already specialized tools available for 3D morphology visualization. In light of this, while a 3D approach is technically feasible, we advocate for a different method. The core idea of DendroTweaks’ morphology exploration is that each section is “clickable”, allowing its geometric properties to be examined in a 2D "Section" view. Furthermore, we believe the "Graph" view presents the overall cell topology and distribution of channels and synapses more clearly.

      (5) It is nice that DendroTweaks can modify the models, such as revising the radii of the morphological segments or ionic conductances. It would be really useful then to have the functionality for writing the resulting models into files for subsequent reuse.

      This functionality is fully available in local installations. Users can export JSON files with channel distributions and SWC files after morphology reduction through the GUI. Please note that for resource management purposes, file import/export is disabled on the public online demo. However, it can be enabled upon local installation by modifying the configuration file (app/default_config.json). In addition, it is now possible to generate plain NEURON (Python) code to reproduce a given model outside the toolbox (e.g., for network simulations). Moreover, it is now possible to export the simulation protocols as CSV files for locations of stimuli and recordings and JSON files for stimuli parameters.

      (6) If I didn't miss something, it seems that DendroTweaks supports the allocation of groups of synapses, where all synapses in a group receive the same type of Poisson spike train. It would be very useful to provide more flexibility. One option is to leverage the SONATA format, which has ample functionality for specifying such diverse inputs.

      Currently, each population of “virtual” neurons that form synapses on the detailed cell shares the same set of parameters for both biophysical properties of synapses (e.g., reversal potential, time constants) and presynaptic "population" activity (e.g., rate, onset). The parameter that controls an incoming Poisson spike train is the rate, which is indeed shared across all synapses in a population. Unfortunately, the current implementation lacks the capability to simulate complex synaptic inputs with heterogeneous parameters across individual synapses or those following non-uniform statistical distributions (the present implementation is limited to random uniform distributions). We have added this information in the Discussion (3. Discussion - 3.2 Limitations and future directions - ¶.5) to make users aware of the limitations. As it requires a substantial amount of additional work, we plan to address such limitations in future versions of the toolbox.

      (7) "Each session can be saved as a .json file and reuploaded when needed" - do these files contain the whole history of the session or the exact snapshot of what is visualized when the file is saved? If the latter, which variables are saved, and which are not? Please clarify.

      In the previous implementation, these files captured the exact snapshot of the model's latest state. In the new version, we adopted a modular approach where the biophysical configuration (e.g., channel distributions) and stimulation protocols are exported to separate files. This allows the user to easily load and switch the stimulation protocols for a given model. In addition, the distribution of parameters (e.g., channel conductances) is now based on the morphological domains and is agnostic of the exact morphology (i.e., sections and segments), which allows the same JSON files with biophysical configurations to be reused across multiple similar morphologies. This also allows for easy file exchange between the GUI and the standalone version.

      Joint recommendations to Authors:

      The reviewers agreed that the paper is well written and that DendroTweaks offers a useful collection of tools to explore models of single-cell biophysics. However, the tooling as provided with this submission has critical limitations in the capabilities, accessibility, and documentation that significantly limit the utility of DendroTweaks. While we recognize that it is under active development and features may have changed already, we can only evaluate the code and documentation available to us here.

      We thank the reviewers for their positive evaluation of the manuscript and express our sincere appreciation for their feedback. We acknowledge the limitations they have pointed out and have addressed most of these concerns in our revised version.

      In particular, we would emphasize:

      (1) While the features may be rich, the documentation for either a user of the graphical interface or the library is extremely sparse. A collection of specific tutorials walking a GUI user through simple and complex model examples would be vital for genuine uptake. As one category of the intended user is likely to be new to computational modeling, it would be particularly good if this documentation could also highlight known issues that can arise from the naive use of computational techniques. Similarly, the library aspect needs to be documented in a more standard manner, with docstrings, an API function list, and more didactic tutorials for standard use cases.

      DendroTweaks now features comprehensive documentation. The standalone Python library code is well-documented with thorough docstrings. The overall code modularity and readability have improved. The documentation is created using the widely adopted Sphinx generator, making it accessible for external contributors, and it is available via ReadTheDocs https://dendrotweaks.readthedocs.io/en/latest/index.html. The documentation provides a comprehensive set of tutorials (6 basic, 6 advanced) covering all key concepts and workflows offered by the toolbox. Interactive Jupyter notebooks are included in the documentation, along with the quick start guide. All example models also have corresponding notebooks that allow users to build the model from scratch.

      The toolbox has its own online platform, where a quick-start guide for the GUI is available https://dendrotweaks.dendrites.gr/guide.html. We have created video tutorials for the GUI covering the basic use cases. Additionally, we have added tips and instructions alongside widgets in the GUI, as well as a status panel that displays application status, warnings, and other information. Finally, we plan to familiarize the community with the toolbox by organizing online and in-person tutorials, as the one recently held at the CNS*2025 conference (https://cns2025florence.sched.com/event/25kVa/building-intuitive-and-efficient-biophysicalmodels-with-jaxley-and-dendrotweaks). Moreover, the toolbox was already successfully used for training young researchers during the Taiwan NeuroAI 2025 Summer School, founded by Ching-Lung Hsu. The feedback was very positive.

      (2) The paper describes both a GUI web app and a Python library. However, the code currently mixes these two in a way that largely makes sense for the web app but makes it very difficult to use the library aspect. Refactoring the code to separate apps and libraries would be important for anyone to use the library as well as allowing others to host their own DendroTweak servers. Please see the notes from the reviewing editor below for more details.

      The code in the previous `app/model` folder, responsible for the core functionality of the toolbox, has been extensively refactored and extended, and separated into a standalone library. The library is included in the Python package index (PyPI, https://pypi.org/project/dendrotweaks).

      Notes from the Reviewing Editor Comments (Recommendations for the authors):

      (1) While one could import morphologies and use a collection of ion channel models, details of synapse groups and stimulation approaches appeared to be only configurable manually in the GUI. The ability to save and load full neuron and simulation states would be extremely useful for reproducibility and sharing data with collaborators or as an interactive data product with a publication. There is a line in the text about saving states as json files (also mentioned by Reviewer #2), but I could see no such feature in the version currently online.

      We decided to reserve the online version for demonstration and educational purposes, with more example models being added over time. However, this functionality is available upon local installation of the app (and after specifying it in the ‘default_config.json’ in the root directory of the app). We’ve adopted a modular model representation to store separately morphology, channel models, biophysical parameters, and stimulation protocols.

      (2) Relatedly, GUI exploration of complex data is often a precursor to a more automated simulation run. An easy mechanism to go from a user configuration to scripting would be useful to allow the early strength of GUIs to feed into the power of large-scale scripting.

      Any model could be easily exported to a modular DendroTweaks representation and later imported either in the GUI or in the standalone version programmatically. This ensures a seamless transition between the two use cases.

      (3) While the paper discusses DendroTweaks as both a GUI and a python library, the zip file of code in the submission is not in good form as a library. Back-end library code is intermingled with front-end web app code, which limits the ability to install the library from a standard python interface like PyPI. API documentation is also lacking. Functions tend to not have docstrings, and the few that do, do not follow typical patterns describing parameters and types.

      As stated above, all these issues have been resolved in the new version of the toolbox. The library code is now housed in a separate repository https://github.com/Poirazi-Lab/DendroTweaks and included in PyPI https://pypi.org/project/dendrotweaks. The classes and public methods follow Numpy-style docstrings, and the API reference is available in the documentation: https://dendrotweaks.readthedocs.io/en/latest/genindex.html.

      (4) Library installation is very difficult. The requirements are currently a lockfile, fully specifying exact versions of all dependencies. This is exactly correct for web app deployment to maintain consistency, but is not feasible in the context of libraries where you want to have minimal impact on a user's environment. Refactoring the library from the web app is critical for making DendroTweaks usable in both forms described in the paper.

      The lockfile makes installation more or less impossible on computer setups other than that of the author. Needless to say, this is not acceptable for a tool, and I would encourage the authors to ask other people to attempt to install their code as they describe in the text. For example, attempting to create a conda environment from the environment.yml file on an M1 MacBook Pro failed because it could not find several requirements. I was able to get it to install within a Linux docker image with the x86 platform specified, but this is not generally viable. To make this be the tool it is described as in text, this must be resolved. A common pattern that would work well here is to have a requirements lockfile and Docker image for the web app that imports a separate, more minimally restrictive library package with that could be hosted on PyPI or, less conveniently, through conda-forge.

      The installation of the standalone library is now straightforward via pip install dendrotweaks.On the Windows platform, however, manual installation of NEURON is required as described          in the official NEURON documentation https://nrn.readthedocs.io/en/8.2.6/install/install_instructions.html#windows.

      (5) As an aside, to improve potential uptake, the authors might consider an MIT-style license rather than the GNU Public License unless they feel strongly about the GPL. Many organizations are hesitant to build on GPL software because of the wide-ranging demands it places on software derived from or using GPL code.

      We thank the editor for this suggestion. We are considering changing the licence to MPL 2.0. It will maintain copyleft restrictions only on the package files while allowing end-users to freely choose their own license for any derived work, including the models, generated data files, and code that simply imports and uses our package.

      Reviewer #1 (Recommendations for the authors):

      (1) Abstract: Neurons rely on the interplay between dendritic morphology and ion channels to transform synaptic inputs into a sequence of somatic spikes. Technically, this would have to be morphology, ion channels, pumps, transporters, exchangers, buffers, calcium stores, and other molecules. For instance, if the calcium buffer concentration is large, then there would be less free calcium for activating the calcium-activated potassium channels. If there are different chloride co-transporters - NKCC vs. KCC - expressed in the neuron or different parts of the neuron, that would alter the chloride reversal for all the voltage- or ligand-gated chloride channels in the neuron. So, while morphology and ion channels are two important parts of the transformation, it would be incorrect to ignore the other components that contribute to the transformation. The statement might be revised to make these two components as two critical components.

      The phrase “Two critical components” was added as it was suggested by the reviewer.

      (2) Section 2.1 - The overall GUI looks intuitive and simple.

      (3) Section 2.2

      (a) The Graph view of morphology, especially accounting for the specific d_lambda is useful.

      (b) "Note that while microgeometry might not significantly affect the simulation at a low spatial resolution (small number of segments) due to averaging, it can introduce unexpected cell behavior at a higher level of spatial discretization."

      It might be good to warn the users that the compartmentalization and error analyses are with reference to the electrical lambda. If users have to account for calcium microdomains, these analyses wouldn't hold given the 2 orders of magnitude differences between the electrical and the calcium lambdas (e.g., Zador and Koch, J Neuroscience, 1994). Please sensitize users that the impact of active dendrites in regulating calcium microdomains and signaling is critical when it comes to plasticity models in morphologically realistic structures.

      We thank the reviewer for this important point. We have clarified in the text that our spatial discretization specifically refers to the electrical length constant. We acknowledge that electrical and chemical processes operate on fundamentally different spatial and temporal scales, which requires special consideration when modeling phenomena like synaptic plasticity. We have sensitized users about this distinction. However, we do not address such examples in the manuscript, thus leaving the detailed discussion of non-electrical compartmentalization beyond the scope of this work.

      (c) I am not very sure if the "smooth" tool for diameters that is illustrated is useful. Users shouldn't consider real variability in morphology as artifacts of reconstruction. As mentioned above, while this might not be an issue with electrical compartmentalization, calcium compartmentalization will severely be affected by small changes in morphology. Any model that incorporates calcium-gated channels should appropriately compartmentalize calcium. Without this, the spread of activation of calcium-dependent conductances would be an overestimate. Even small changes in cellular shape and curvature can have large impacts when it comes to signaling in terms of protein aggregation and clustering.

      Although this functionality is still available in the toolbox, we have removed the emphasis from it in the manuscript. Nevertheless, for the purpose of addressing the reviewer’s comment, we provide an example when this “smoothening” might be needed:please see Figure S1 from Tasciotti et al. 2025.

      (2) Simone Tasciotti, Daniel Maxim Iascone, Spyridon Chavlis, Luke Hammond, Yardena Katz, Attila Losonczy, Franck Polleux, Panayiota Poirazi. From Morphology to Computation: How Synaptic Organization Shapes Place Fields in CA1 Pyramidal Neurons bioRxiv 2025.05.30.657022; doi: https://doi.org/10.1101/2025.05.30.657022

      (4) Section 2.3

      (a) The graphical representation of channel gating kinetics is very useful.

      (b) Please warn the users that experimental measurements of channel gating kinetics are extremely variable. Taking the average of the sigmoids or the activation/deactivation/inactivation kinetics provides an illusion that each channel subtype in a given cell type has fixed values of V_1/2, k, delta, and tau, but it is really a range obtained from several experiments. The heterogeneity is real and reflects cell-to-cell variability in channel gating kinetics, not experimental artifacts. Please sensitize the readers that there is not a single value for these channel parameters.

      This is a fair comment, and it refers to a general problem in neuronal modeling. In DendroTweaks, we follow the approach widely used in the community that indeed doesn't account for heterogeneity. We added a paragraph in the revised manuscript's Discussion (3. Discussion - 3.3 Limitations and future directions - ¶.3) to address this issue.

      (5) Section 2.4

      (a) Same as above: Please sensitize users that the gradients in channel conductances are measured as an average of measurements from several different cells. This gradient need not be present in each neuron, as there could be variability in location-dependent measurements across cells. The average following a sigmoid doesn't necessarily mean that each neuron will have the channel distributed with that specific sigmoid (or even a sigmoid!) with the specific parametric values that the average reported. This is extremely important because there is an illusion that the gradient is fixed across cells and follows a fixed functional form.

      We added this information to our Discussion in the same paragraph mentioned above.

      (b) Please provide an example where the half-maximal voltage of a channel varies as a function of distance (such as Poolos et al., Nature Neuroscience, 2002 or Migliore et al., 1999; Colbert and Johnston, 1997). This might require a step-like function in some scenarios. An illustration would be appropriate because people tend to assume that channel gating kinetics are similar throughout the dendrite. Again, please mention that these shifts are gleaned from the average and don't really imply that each neuron must have that specific gradient, given neuron-to-neuron variability in these measurements.

      We thank the reviewer for the provided literature, which we now cite when describing parameter distributions (2. Results - 2.4 Distributing ion channels - ¶.1). Please note that DendroTweaks' programming interface and data format natively support non-linear distribution of kinetic parameters alongside the channel conductances. As for the step-like function, users can either directly apply the built-in step-like distribution function or create it by combining two constant distributions.

      (6) Section 2.5

      (a) It might be useful to provide a mechanism for implementing the normalization of unitary conductances at the cell body, (as in Magee and Cook, 2000; Andrasfalvy et al., J Neuroscience, 2001). Specifically, users should be able to compute AMPAR conductance values at each segment which would provide a somatic EPSP value of 0.2 mV.

      This functionality is indeed useful and will be added in future releases. Currently, it has been mentioned in the list of known limitations when working with synaptic inputs (3. Discussion - 3.3 Limitations and future directions - ¶.5).

      (b) Users could be sensitized about differences in decay time constants of GABA_A receptors that are associated with parvalbamin vs. somatostatin neurons. As these have been linked to slow and fast gamma oscillations and different somatodendritic locations along different cell types, this might be useful (e.g., 10.1016/j.neuron.2017.11.033;10.1523/jneurosci.0261-20.2020; 10.7554/eLife.95562.1; 10.3389/fncel.2023.1146278).

      We thank the reviewer for highlighting this important biological detail. DendroTweaks enables users to define model parameters specific to their cell type of interest. For practical reasons, we leave the selection of biologically relevant parameters to the users. However, we will consider adding an explicit example in our tutorials to showcase the toolbox's flexibility in this regard.

      (7) Section 2.6

      While reducing the morphological complexity has its advantages, users of this tool should be sensitized in this section about how the reduction does not capture all the complexity of the dendritic computation. For instance, the segregation/amplification properties of Polsky et al., 2004, Larkum et al., 2009 would not be captured by a fully reduced model. An example across different levels of reductions, implementing simulations in Figure 7F (but for synapses on the same vs. different branches), would be ideal. Demonstrate segregation/amplification in the full model for the same set of synapses - coming on the same branch/different branch (linear integration of synapses on different branches and nonlinear integration of synapses on the same branch). Then, show that with different levels of reduction, this segregation/amplification vanishes in the reduced model. In addition, while impedance-based approaches account for account for electrical computation, calcium-based computation is not something that is accountable with reduced models, given the small lambda_calcium values. Given the importance of calcium-activated conductances in electrical behaviour, this becomes extremely important to account for and sensitize users to. The lack of such sensitization results in presumptuous reductions that assume that all dendritic computation is accounted for by reduced models!

      We agree with the reviewer that reduction leads to a loss in the complexity of dendritic computation. This has been stated in both the original algorithm paper (Amsalem et al., 2020) and in our manuscript (e.g., 3. Discussion - 3.2 Comparison to existing modeling software - ¶.6). In fact, to address this problem, we extended the functionality of neuron_reduce to allow for multiple levels of morphology reduction. Our motivation for integrating morphology reduction in the toolbox was to leverage the exploratory power of DendroTweaks to assess how different degrees of reduction alter cell integrative properties, determining which computations are preserved, which are lost, and at what specific reduction level these changes occur. Nevertheless, to address this comment, we've made it more explicit in the Discussion that reduction inevitably alters integrative properties and, at a certain level, leads to loss of dendritic computations.

      (8) Section 2.7

      (a) The validation process has two implicit assumptions:

      (i) There is only one value of physiological measurements that neurons and dendrites are endowed with. The heterogeneity in these measurements even within the same cell type is ignored. The users should be allowed to validate each measurement over a range rather than a single value. Users should be sensitized about the heterogeneity of physiological measurements.

      (ii) The validation process is largely akin to hand-tuning models where a one-to-one mapping of channels to measurements is assumed. For instance, input resistance can be altered by passive properties, by Ih, and by any channel that is active under resting conditions. Firing rate and patterns can be changed by pretty much every single ion channel that expresses along the somatodendritic axis.

      An updated validation process that respects physiological heterogeneities in measurements and accounts for global dependencies would be more appropriate. Please update these to account for heterogeneities and many-to-many mappings between channels and measurements. An ideal implementation would be to incorporate randomized search procedures (across channel parameters spanning neuron-to-neuron variability in channel conductances/gating properties) to find a population of models that satisfy all physiological constraints (including neuron-to-neuron variability in each physiological measurement), rather than reliance on procedures that are akin to hand-tuning models. Such population-based approaches are now common across morphologically-realistic models for different cell types (e.g., Rathour and Narayanan, PNAS, 2014; Basak and Narayanan, J Physiology, 2018; Migliore et al., PLoS Computational Biology, 2018; Basak and Narayanan, Brain Structure and Function, 2020; Roy and Narayanan, Neural Networks, 2021; Roy and Narayanan, J Physiology, 2023; Arnaudon et al., iScience, 2023; Reva et al., Patterns, 2023; Kumari and Narayanan, J Neurophysiology, 2024) and do away with the biases introduced by hand-tuning as well as the assumption of one-to-one mapping between channels and measurements.

      We appreciate the reviewer’s comment and the suggested alternatives to our validation process. We have extended the discussion on these alternative approaches (3. Discussion - 2. Comparison to existing modeling software - ¶.5). However, it is important to note that neither one-value nor one-to-one mapping assumption is imposed in our approach. It is true that validation is performed on a given model instance with fixed single-value parameters. However, users can discover heterogeneity and degeneracy in their models via interactive exploration. In the GUI, a given parameter can be changed, and the influence of this change on model output can be observed in real time. Validation can be run after each change to see whether the model output still falls within a biologically plausible regime or not. This is, of course, time-consuming and less efficient than any automated parameter optimization.

      However, and importantly, this is the niche of DendroTweaks. The approach we provide here can indeed be referred to as model hand-tuning. This is intentional: we aim to complement black-box optimization by exposing the relationship between parameters and model outputs. DendroTweaks is not aimed at automated parameter optimization and is not meant to provide the user with parameter ranges automatically. The built-in validation in DendroTweaks is intended as a lightweight, fast feedback tool to guide manual tuning of dendritic model parameters so as to enhance intuitive understanding and assess the plausibility of outputs, not as a substitute for comprehensive model validation or optimization. The latter can be done using existing frameworks, designed for this purpose, as mentioned by the reviewer. 

      (b) Users could be asked to wait for RMP to reach steady state. For instance, in some of the traces in Figure 7, the current injection is provided before RMP reaches steady-state. In the presence of slow channels (HCN or calcium-activated channels), the RMP can take a while to settle down. Users might be sensitized about this. This would also bring to attention the ability of several resting channels in modulating RMP, and the need to wait for steady-state before measurements are made.

      We agree with the observation and updated the validation process accordingly. We have added functionality for simulation stabilization, allowing users to pre-run a simulation before the main simulation time. For example, model.run(duration=1000, prerun_time=300) could be used to stabilize the model for a period of 300 ms before running the main simulation for 1 s.

      (c) Strictly speaking, it is incorrect to obtain membrane time constant by fitting a single exponential to the initial part of the sag response (Figure 7A). This may be confirmed in the model by setting HCN to zero (strictly all active channel conductances to zero), obtaining the voltage-response to a pulse current, fitting a double exponential (as Rall showed, for a finite cable or for a real neuron, a single exponential would yield incorrect values for the tau) to the voltage response, and mapping membrane time constant to the slower of the two time-constants (in the double exponential fit). This value will be very different from what is obtained in Figure 7A. Please correct this, with references to Rall's original papers and to electrophysiological papers that use this process to assess membrane properties of neurons and their dendrites (e.g., Stuart and Spruston, J Neurosci, 1998; Golding and Spruston, J Physiology, 2005).

      We updated the algorithm for calculating the membrane time constant based on the reviewer's suggestions and added the suggested references. The time constant is now obtained in a model with blocked HCN channels (setting maximal conductance to 0) via a double exponential fit, taking the slowest component.

      (9) Section 3

      (a) May be good to emphasize the many-to-many mapping between ion channels and neuronal functions here in detail, and on how to explore this within the Dendrotweaks framework.

      We have added a paragraph in the Discussion that addresses both the problems of heterogeneity and degeneracy in biological neurons and neuronal models (3. Discussion - 3.3 Limitations and future directions - ¶.3)

      (b) May be good to have a specific section either here or in results about how the different reduced models can actually be incorporated towards building a network.

      As mentioned earlier, building a network of reduced models is a promising new direction. However, it is beyond the scope of this manuscript, whose primary goal is to introduce DendroTweaks and highlight its capabilities. DendroTweaks is designed for single-cell modeling and provides export capabilities that allow integrating it into broader workflows, including network modeling. We have added a paragraph in the manuscript (3. Discussion - 3.1 Conceptual and implementational accessibility - ¶.2) that addresses how DendroTweaks could be used alongside other software, in particular for scaling up single-cell models to the network level.

      (10) Section 4

      (a) Section 4.3: In the second sentence (line 568), the "first Kirchhoff's law" within parentheses immediately after Q=CV gives an illusion that Q=CV is the first Kirchhoff's law! Please state that this is with reference to the algebraic sum of currents at a node.

      We have corrected the equations and apologize for this oversight. 

      (b) Table 1: In the presence of active ion channels, input resistance, membrane time constant, and voltage attenuation are not passive properties. Input resistance is affected by any active channel that is active at rest (HCN, Kir, A-type K+ through the window current, etc). The same holds for membrane time constant and voltage attenuation as well. This could be made clear by stating if these measurements are obtained in the presence or absence of active ion channels. In real neurons, all these measurements are affected by active ion channels; so, ideally, these are also active properties, not passive! Also, please mention that in the presence of resonating channels (e.g., HCN, M-type K+), a single exponential fit won't be appropriate to obtain tau, given the presence of sag.

      We thank the reviewer for pointing out this ambiguity. What the term “Passive” means in Table 1 (e.g., for the input resistance, R_in) is that the minimal set of parameters needed to validate R_in are the passive ones (i.e., Cm, Ra, and Leak). We have changed the table listing to reflect this.

      Reviewer #2 (Recommendations for the authors):

      (1) Figure 2B and the caption to Figure 2F show and describe the diameter of the sections, whereas the image in Figure 2F shows the radius. Which is the correct one?

      The reason for this is that Figure 2B shows the sections' geometry as it is represented in NEURON, i.e., with diameters, while Figure 2F shows the geometry as it is represented in an SWC file (as these changes are made based on the SWC file). Nevertheless, as mentioned earlier, we decided to remove panel F from the figure in the new version, to present a more important panel on tree graph representations.

      (2) "Each segment can be viewed as an equivalent RC circuit representing a part of the membrane". The example in Figure 2B is perhaps a relatively simple case. For more complex cases where multiple nonlinear conductances are present in each section, would it be possible to show each of these conductances explicitly? If yes, it would be nice to illustrate that.

      We would like to clarify that "can be viewed" here was intended to mean "can be considered," and we have updated the text accordingly. The schematic RC circuits were added to the corresponding figure for illustration purposes only and are not present in the GUI, as this would indeed be impractical for multiple conductances.

      (3) Some extra citations could be added. For example, it is a little strange that BRIAN2 is mentioned, but NEST is not. It might be worth mentioning and citing it. Also, the Allen Cell Types Database is mentioned, but no citation for it is given. It could be useful to add such citations (https://doi.org/10.1038/s41593-019-0417-0, https://doi.org/10.1038/s41467-017-02718-3).

      Brian 2 is extensively used in our lab on its own and as a foundation of the Dendrify library (Pagkalos et al., 2023). As stated in the discussion, we are considering bridging reduced Hodgkin-Huxley-type models to Dendrify leaky integrate-and-fire type models. For these reasons, Brian 2 is mentioned in the discussion. However, we acknowledge that our previous overview omitted references to some key software, which have now been added to the updated manuscript. We appreciate the reviewer providing references that we had overlooked.

      (3) Pagkalos, M., Chavlis, S. & Poirazi, P. Introducing the Dendrify framework for incorporating dendrites to spiking neural networks. Nat Commun 14, 131 (2023). https://doi.org/10.1038/s41467-022-35747-8

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewing Editor Comments:

      The study design used reversal learning (i.e. the CS+ becomes the CS- and vice versa), while the title mentions 'fear learning and extinction'. In my opinion, the paper does not provide insight into extinction and the title should be changed.

      Thank you for this important point. We agree that our paradigm focuses more directly on reversal learning than on standard extinction, as the test phases represent extinction in the absence of a US but follow a reversal phase. To better reflect the core of our investigation, we have changed the title.

      Proposed change in manuscript (Title): Original Title: Distinct representational properties of cues and contexts shape fear learning and extinction 

      New Title: Distinct representational properties of cues and contexts shape fear and reversal learning

      Secondly, the design uses 'trace conditioning', whereas the neuroscientific research and synaptic/memory models are rather based on 'delay conditioning'. However, given the limitations of this design, it would still be possible to make the implications of this paper relevant to other areas, such as declarative memory research.

      This is an excellent point, and we thank you for highlighting it. Our design, where a temporal gap exists between the CS offset and US onset, is indeed a form of trace conditioning. We also agree that this feature, particularly given the known role of the hippocampus in trace conditioning, strengthens the link between our findings and the broader field of episodic memory.

      Proposed change in manuscript (Methods, Section "General procedure and stimuli"): We inserted the following text (lines 218-220): "It is important to note that the temporal gap between the CS offset and potential US delivery (see Figure 1A) indicates that our paradigm employs a trace conditioning design. This form of learning is known to be hippocampus-dependent and has been distinguished from delay conditioning.

      Proposed change in manuscript (Discussion): We added the following to the discussion (lines 774-779): "Furthermore, our use of a trace conditioning paradigm, which is known to engage the hippocampus more than delay conditioning does, may have facilitated the detection of item-specific, episodiclike memory traces and their interaction with context. This strengthens the relevance of our findings for understanding the interplay between aversive learning and mechanisms of episodic memory."

      The strength of the evidence at this point would be described as 'solid'. In order to increase the strength (to convincing), analyses including FWE correction would be necessary. I think exploratory (and perhaps some FDR-based) analyses have their valued place in papers, but I agree that these should be reported as such. The issue of testing multiple independent hypotheses also needs to be addressed to increase the strength of evidence (to convincing). Evaluating the design with 4 cues could lead to false positives if, for example, current valence, i.e. (CS++ and CS-+) > (CS+- and CS--), and past valence (CS++ > CS+-) > (CS-+ > CS--) are tested as independent tests within the same data set. Authors need to adjust their alpha threshold.

      We fully agree. As summarized in our general response, we have implemented two major changes to our statistical approach to address these concerns comprehensively. These, are stated above, are the following:

      (1) Correction for Multiple Hypotheses: We previously used FWER-corrected p-values that were obtained through permutation testing. We have now applied a Bonferroni adjustment to the FWER-corrected threshold (previously 0.05) used in our searchlight analyses. For instance, in the acquisition phase, since 2 independent tests (contrasts) were conducted, the significance threshold of each of these searchlight maps was set to p <0.025 (after FWE-correction estimated through non-parametric permutation testing); in reversal, 4 tests were conducted, hence the significance threshold was set to p<0.0125. This change is now clearly described in the Methods section (section “Searchlight approach” (lines 477484). This change had no impact on our searchlight results, given that all clusters that were previously as significant with the previous FWER alpha of 0.05 were also significant at the new, Bonferroni-adjusted thresholds; we also now report the cluster-specific corrected p-values in the cluster tables in Supplementary Material.

      (2) ROI Analyses: Our ROI-based analyses used FDR-based correction for within each item reinstatement/generalized reinstatement pair of each ROI. We now explicitly state in the abstract, methods and results sections that these ROI-based analyses are exploratory and secondary to the primary whole-brain results, given that the correction method used is more liberal, in accordance with the exploratory character of these analyses.

      We are confident that these changes ensure both the robustness and transparency of our reported findings.

      Reviewer #1 (Public Review):

      (1) I had a difficult time unpacking lines 419-420: "item stability represents the similarity of the neural representation of an item to other representations of this same item."

      We thank the reviewer for pointing out this lack of clarity. We have revised the definition to be more intuitive and have ensured it is introduced earlier in the manuscript.

      Proposed change in manuscript (Introduction, lines 144-150): We introduced the concept earlier and more clearly: "Furthermore, we can measure the consistency of a neural pattern for a given item across multiple presentations. This metric, which we refer to as “item stability”, quantifies how consistently a specific stimulus (e.g., the image of a kettle) is represented in the brain across multiple repetitions of the same item. Higher item stability has been linked to successful episodic memory encoding (Xue et al., 2010)."

      Proposed change in manuscript (Methods, Section "Item stability and generalization of cues"): Original text: "Thus, item stability represents the similarity of the neural representation of an item to other representations of this same item (Xue, 2018), or the consistency of neural activity across repetitions (Sommer et al., 2022)."

      Revised text (lines 434-436): "Item stability is defined as the average similarity of neural patterns elicited by multiple presentations of the same item (e.g., the kettle). It therefore measures the consistency of an item's neural representation across repeated encounters."

      (2) The authors use the phrase "representational geometry" several times in the paper without clearly defining what they mean by this.

      We apologize for this omission. We have now added a clear and concise definition of "representational geometry" in the Introduction, citing the foundational work by Kriegeskorte et al. (2008).

      Proposed change in manuscript (Introduction): We inserted the following text (lines 117-125): " By contrast, multivariate pattern analyses (MVPA), such as representational similarity analysis (RSA; Kriegeskorte et al., 2008) has emerged as a powerful tool to investigate the content and structure of these representations (e.g., Hennings et al., 2022). This approach allows us to characterize the “representational geometry” of a set of items – that is, the structure of similarities and dissimilarities between their associated neural activity patterns. This geometry reveals how the brain organizes information, for instance, by clustering items that are conceptually similar while separating those that are distinct."

      (3) The abstract is quite dense and will likely be challenging to decipher for those without a specialized knowledge of both the topic (fear conditioning) and the analytical approach. For instance, the goal of the study is clearly articulated in the first few sentences, but then suddenly jumps to a sentence stating "our data show that contingency changes during reversal induce memory traces with distinct representational geometries characterized by stable activity patterns across repetitions..." this would be challenging for a reader to grok without having a clear understanding of the complex analytical approach used in the paper.

      We agree with your assessment. We have rewritten it to be more accessible to a general scientific audience, by focusing on the conceptual findings rather than methodological jargon.

      Proposed change in manuscript (Abstract): We revised the abstract to be clearer. It now reads: " When we learn that something is dangerous, a fear memory is formed. However, this memory is not fixed and can be updated through new experiences, such as learning that the threat is no longer present. This process of updating, known as extinction or reversal learning, is highly dependent on the context in which it occurs. How the brain represents cues, contexts, and their changing threat value remains a major question. Here, we used functional magnetic resonance imaging and a novel fear learning paradigm to track the neural representations of stimuli across fear acquisition, reversal, and test phases. We found that initial fear learning creates generalized neural representations for all threatening cues in the brain’s fear network. During reversal learning, when threat contingencies switched for some of the cues, two distinct representational strategies were observed. On the one hand, we still identified generalized patterns for currently threatening cues, whereas on the other hand, we observed highly stable representations of individual cues (i.e., item-specific) that changed their valence, particularly in the precuneus and prefrontal cortex. Furthermore, we observed that the brain represents contexts more distinctly during reversal learning. Furthermore, additional exploratory analyses showed that the degree of this context specificity in the prefrontal cortex predicted the subsequent return of fear, providing a potential neural mechanism for fear renewal. Our findings reveal that the brain uses a flexible combination of generalized and specific representations to adapt to a changing world, shedding new light on the mechanisms that support cognitive flexibility and the treatment of anxiety disorders via exposure therapy."

      (4) Minor: I believe it is STM200 not the STM2000.

      Thank you for pointing this out. We have corrected it in the Methods section.

      Proposed change in manuscript (Methods, Page 5, Line 211): Original: STM2000 -> Corrected: STM200

      (5) Line 146: "...could be particularly fruitful as a means to study the influence of fear reversal or extinction on context representations, which have never been analyzed in previous fear and extinction learning studies." I direct the authors to Hennings et al., 2020, Contextual reinstatement promotes extinction generalization in healthy adults but not PTSD, as an example of using MVPA to decipher reinstatement of the extinction context during test.

      Thank for pointing us towards this relevant work. We have revised the sentence to reflect the state of the literature more accurately.

      Proposed change in manuscript (Introduction, Page 3): Original text: "...which have never been analyzed in previous fear and extinction learning studies." 

      Revised text (lines 154-157): "...which, despite some notable exceptions (e.g., Hennings et al., 2020), have been less systematically investigated than cue representations across different learning stages."

      (6) This is a methodological/conceptual point, but it appears from Figure 1 that the shock occurs 2.5 seconds after the CS (and context) goes off the screen. This would seem to be more like a trace conditioning procedure than a standard delay fear conditioning procedure. This could be a trivial point, but there have been numerous studies over the last several decades comparing differences between these two forms of fear acquisition, both behaviorally and neurally, including differences in how trace vs delay conditioning is extinguished.

      Thank you for this pertinent observation; this was also pointed out by the editor. As detailed in our response to the editor, we now explicitly acknowledge that our paradigm uses a trace conditioning design, and have added statements to this effect in the Methods and Discussion sections (lines 218-220, and 774-779).

      (7) In Figure 4, it would help to see the individual data points derived from the model used to test significance between the different conditions (reinstatement between Acq, reversal, and test-new).

      We agree that this would improve the transparency of our results. We have revised Figure 4 to include individual data points, which are now plotted over the bar graphs. 

      Reviewer #2 (Public Review & Recommendations)

      Use a more stringent method of multiple comparison correction: voxel-wise FWE instead of FDR; Holm-Bonferroni across multiple hypothesis tests. If FDR is chosen then the exploratory character of the results should be transparently reported in the abstract.

      Thank you for these critical comments regarding our statistical methods. As detailed in the general response and response to the editor (Comment 3), we have thoroughly revised our approach to ensure its rigor. We now clarify that our whole-brain analyses consistently use FWER-corrected pvalues. Additionally, the significance of these FWER-corrected p-values (obtained through permutation testing), which were previously considered significant against a default threshold of 0.05, are now compared with a Bonferroni-adjusted threshold equal to the number of tested contrasts in each experimental phase. We have modified the revised manuscript accordingly, in the methods section (lines 473-484) and in the supplementary material, where we added the p-values (FWER-corrected) of each cluster, evaluated against the new Bonferroni-adjusted thresholds. It is to be of note that this had no impact on our searchlight results, given that all clusters that were previously reported as significant with the alpha threshold of 0.05 were also significant at the new, corrected thresholds.

      Proposed change in manuscript (Methods): We revised the relevant paragraphs (lines 473-484): "Significance corresponding to the contrast between conditions of the maps of interest was FWER-corrected using nonparametric permutation testing at the cluster level (10,000 permutations) to estimate significant cluster size. Additionally, we adjusted the alpha threshold against which we assessed the significance of the cluster-specific FWERcorrected p-values using Bonferroni correction. In this order, we divided the default alpha corrected threshold of 0.05 by the number of statistical comparisons that were conducted in each experimental phase. For example, for fear acquisition, we compared the CS+>CS- contrast for both item stability and cue generalization, resulting in 2 comparisons and hence a corrected alpha threshold of 0.025. Only clusters that had a FWER-corrected p-value below the Bonferroni-adjusted threshold were deemed significant. All searchlight analyses were restricted within a gray matter mask.”

      The authors report fMRI results from line 96 onwards; all of these refer exclusively to mass-univariate fMRI which could be mentioned more transparently... The authors contrast "activation fMRI" with "RSA" (line 112). Again, I would suggest mentioning "mass-univariate fMRI", and contrasting this with "multivariate" fMRI, of which RSA is just one flavour. For example, there is some work that is clear and replicable, demonstrating human amygdala involvement in fear conditioning using SVM-based analysis of highresolution amygdala signals (one paper is currently cited in the discussion).

      Thank you for this important clarification. We have revised the manuscript to incorporate your suggestions. We now introduce our initial analyses as "mass-univariate" and contrast them with the "multivariate pattern analysis" (MVPA) approach of RSA.

      Proposed change in manuscript (Introduction): We revised the relevant paragraphs (lines 113-125): " While mass-univariate functional magnetic resonance imaging (fMRI) activation studies have been instrumental in identifying the brain regions involved in fear learning and extinction, they are insensitive to the patterns of neural activity that underlie the stimulus-specific representations of threat cues and contexts. Contrastingly, multivariate pattern analyses methods, such as representational similarity analysis (RSA; Kriegeskorte et al., 2008), have emerged as a powerful tool to investigate the content and structure of these representations (e.g., Hennings et al., 2022). This approach allows us to characterize the “representational geometry” of a set of items – i.e., the structure of similarities and dissimilarities between their associated neural activity patterns. This geometry reveals how the brain organizes information, for instance, by clustering items that are conceptually similar while separating those that are distinct.”

      Line 177: unclear how incomplete data was dealt with. If there are 30 subjects and 9 incomplete data sets, then how do they end up with 24 in the final sample?

      We apologize for the unclear wording in our original manuscript. We have clarified the participant exclusion pipeline in the Methods section.

      Proposed change in manuscript (Methods, Section "Participants"): Original text: "The number of participants with usable fMRI data for each phase was as follows: N = 30 for the first phase of day one, N = 29 for the second phase of day one, N = 27 for the first phase of day two, and N = 26 for the second phase of day two. Of the 30 participants who completed the first session, four did not return for the second day and thus had incomplete data across the four experimental phases. An additional two participants were excluded from the analysis due to excessive head movement (>2.5 mm in any direction). This resulted in a final sample of 24 participants (8 males) between 18 and 32 years of age (mean: 24.69 years, standard deviation: 3.6) with complete, low-motion fMRI data for all analyses." 

      Revised text: "The number of participants with usable fMRI data for each phase was as follows: N = 30 for the first phase of day one, N = 29 for the second phase of day one, N = 27 for the first phase of day two, and N = 26 for the second phase of day two. An additional two participants were excluded from the analysis due to excessive head movement (>2.5 mm in any direction). This resulted in a final sample of 24 participants (8 males) between 18 and 32 years of age (mean: 24.69 years, standard deviation: 3.6) with complete, low-motion fMRI data for all analyses."

      Typo in line 201.  

      Thank you for your comment. We have re-examined line 201 (“interval (Figure 1A). A total of eight CSs were presented during each phase and”) and the surrounding text but were unable to identify a clear typographical error in the provided quote. However, in the process of revising the manuscript for clarity, we have rephrased this section.

      it would be good to see all details of the US calibration procedure, and the physical details of the electric shock (e.g. duration, ...).

      Thank you for your comment. We have expanded the Methods section to include these important details.

      Proposed change in manuscript (Methods, Section "General procedure and stimuli"): We inserted the following text (lines 225-230): "Electrical stimulation was delivered via two Ag/AgCl electrodes attached to the distal phalanx of the index and middle fingers of the non-dominant hand. he intensity of the electrical stimulation was calibrated individually for each participant prior to the experiment. Using a stepping procedure, the voltage was gradually increased until the participant rated the sensation as 'unpleasant but not painful'.

      "beta series modelling" is a jargon term used in some neuroimaging software but not others. In essence, the authors use trial-by-trial BOLD response amplitude estimates in their model. Also, I don't think this requires justification - using the raw BOLD signal would seem outdated for at least 15 years.

      Thank you for this helpful suggestion. We have simplified the relevant sentences for improved clarity.

      Proposed change in manuscript (Methods, Section "RSA"): Original text: "...an approach known as beta-series modeling (Rissman et al., 2004; Turner et al., 2012)." 

      Revised text (lines 391-393): "...an approach that allows for the estimation of trial-by-trial BOLD response amplitudes, often referred to as beta-series modeling (Rissman et al., 2004). Specifically, we used a Least Square Separate (LSS) approach..."

      I found the use of "Pavlovian trace" a bit confusing. The authors are coming from memory research where "memory trace" is often used; however, in associative learning the term "trace conditioning" means something else. Perhaps this can be explained upon first occurrence, and "memory trace" instead of "Pavlovian trace" might be more common.

      We are grateful for this comment, as it highlights a critical point of potential confusion, especially given that we now acknowledge our paradigm uses a trace conditioning design. To eliminate this ambiguity, we have replaced all instances of "Pavlovian trace" with "lingering fear memory trace" throughout the manuscript (lines 542 and 599).

      I would suggest removing evaluative statements from the results (repeated use of "interesting").

      Thank you for this valuable suggestion. We have reviewed the Results section and removed subjective evaluative words to maintain a more objective tone. 

      Line 882: one of these references refers to a multivariate BOLD analysis using SVM, not explicitly using temporal information in the signal (although they do show session-by-session information).

      Thank you for this correction. We have re-examined the cited paper (Bach et al., 2011) and removed its inclusion in the text accordingly.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      The study explores the use of Transport-based morphometry (TBM) to predict hematoma expansion and growth 24 hours post-event, leveraging Non-Contrast Computed Tomography (NCCT) scans combined with clinical and location-based information. The research holds significant clinical potential, as it could enable early intervention for patients at high risk of hematoma expansion, thereby improving outcomes. The study is well-structured, with detailed methodological descriptions and a clear presentation of results. However, the practical utility of the predictive tool requires further validation, as the current findings are based on retrospective data. Additionally, the impact of this tool on clinical decision-making and patient outcomes needs to be further investigated.

      Strengths:

      (1) Clinical Relevance: The study addresses a critical need in clinical practice by providing a tool that could enhance diagnostic accuracy and guide early interventions, potentially improving patient outcomes.

      (2) Feature Visualization: The visualization and interpretation of features associated with hematoma expansion risk are highly valuable for clinicians, aiding in the understanding of model-derived insights and facilitating clinical application.

      (3) Methodological Rigor: The study provides a thorough description of methods, results, and discussions, ensuring transparency and reproducibility.

      Weaknesses:

      (1) The limited sample size in this study raises concerns about potential model overfitting. While the reported AUCROC of 0.71 may be acceptable for clinical use, the robustness of the model could be further enhanced by employing techniques such as k-fold crossvalidation. This approach, which aggregates predictive results across multiple folds, mimics the consensus of diagnoses from multiple clinicians and could improve the model's reliability for clinical application. Additionally, in clinical practice, the utility of the model may depend on specific conditions, such as achieving high specificity to identify patients at risk of hematoma expansion, thereby enabling timely interventions. Consequently, while AUC is a commonly used metric, it may not fully capture the model's clinical applicability. The authors should consider discussing alternative performance metrics, such as specificity and sensitivity, which are more aligned with clinical needs. Furthermore, evaluating the model's performance in real-world clinical scenarios would provide valuable insights into its practical utility and potential impact on patient outcomes.

      We thank the reviewer for these thoughtful comments. We agree that k-fold cross validation is a valid approach to reduce bias associated with overfitting and account for variability in the dataset composition. During the training and optimization process, this was employed within the VISTA dataset where data were shuffled at random and separated into independent training (60%) and internal validation (40%) datasets. This process was repeated 1000 times, to generate 1000 different training and internal validation splits. Statistical analyses and data visualization were performed independently on each of the 1000 cross-validation samples, and the mean results with corresponding 95% confidence intervals are presented. The p-values were averaged using the Fisher’s method. We have included this information in the methods section. [Page 22; Paragraph 1, Lines 8-10]. External validation was performed on the ERICH dataset and analyzed only once. We chose not to perform k-fold cross validation with the test dataset in attempt to assess the model’s generalizability to unseen data from a different patient cohort. However, we agree that taking advantage of the full 1,066 ERICH cases for model validation would improve the strength of our conclusions regarding the model’s robustness. This has been included in the discussion. [Page 15; Paragraph 1; Lines 11-14].

      We agree that the AUC alone will not effectively describe the clinical applicability of the intended model. We have added the sensitivity and specificity metrics for the TBM’s performance in the external dataset to the discussion. The design of the present study was primarily a pre-clinical methodological study. However, we have suggested that future external validation studies should seek to identify ideal sensitivity and specificity thresholds when evaluating the model’s translatability to a clinical setting. [Page 11; Paragraph 2; Line 22 and Page 12; Paragraph 1; Lines 2-4]. We agree that future validation studies should also assess the model’s performance in a real-world clinical setting and have emphasized this point in the discussion. [Page 13; Paragraph 2; Lines 22-23 and Page 14; Paragraph 1; Lines 1-4].

      (2) The authors compared the performance of TBM with clinical and location-based information, as well as other machine learning methods. While this comparison highlights the relative strengths of TBM, the study would benefit from providing concrete evidence on how this tool could enhance clinicians' ability to assess hematoma expansion in practice. For instance, it remains unclear whether integrating the model's output with a clinician's own assessment would lead to improved diagnostic accuracy or decisionmaking. Investigating this aspect-such as through studies evaluating the combined performance of clinician judgment and model predictions-could significantly enhance the tool's practical value.

      We thank the reviewer for this suggestion. The present study intended to suggest potential advantages of the TBM method with comparison to alternate clinician-based and machine learning methods. While we agree that the TBM method warrants further evaluation in a realworld clinical setting to determine its practical utility, we propose that further optimization of TBM is first needed to improve its predictive accuracy. 

      In developing TBM, our eventual goal is to produce a prediction tool, which can identify patients at risk for hematoma expansion early in the disease course, who may benefit from intervention with surgical and/or medical therapies. Current clinician-based risk stratification methods are highly variable in accuracy, inefficient, and require subjective interpretation of the NCCT scan. Our eventual goal is to aid clinical decision making with an automated, accurate and efficient model. In follow up work, we will study how to combine information derived from imaging and TBM with other assessment tools and clinical data in order to best inform clinicians. This has been incorporated into the discussion. [Page 14; Paragraph 1; Lines 1-4].

      Reviewer #2 (Public review):

      Summary:

      The author presents a transport-based morphometry (TBM) approach for the discovery of noncontrast computed tomography (NCCT) markers of hematoma expansion risk in spontaneous intracerebral hemorrhage (ICH) patients. The findings demonstrate that TBM can quantify hematoma morphological features and outperforms existing clinical scoring systems in predicting 24-hour hematoma expansion. In addition, the inversion model can visualize features, which makes it interpretable. In conclusion, this research has clinical potential for ICH risk stratification, improving the precision of early interventions.

      Strengths:

      TBM quantifies hematoma morphological changes using the Wasserstein distance, which has a well-defined physical meaning. It identifies features that are difficult to detect through conventional visual inspection (such as peripheral density distribution and density heterogeneity), which provides evidence supporting the "avalanche effect" hypothesis in hematoma expansion pathophysiology.

      Weaknesses:

      (1) As a methodology-focused study, the description of the methods section somewhat lacks depth and focus, which may make it challenging for readers to fully grasp the overall structure and workflow of the approach. For instance, the manuscript lacks a systematic overview of the entire process, from NCCT image input to the final prediction output. A potential improvement would be to include a workflow figure at the beginning of the manuscript, summarizing the proposed method and subsequent analytical procedures. This would help readers better understand the mechanism of the model.

      We thank the reviewer for this suggestion. We have included a figure detailing the TBM workflow to improve reader understanding. [Figure 1, Page 5; Paragraph 2; Lines 19-20 and Page 30; Paragraph 1].

      (2) The description of the comparison algorithms could be more detailed. Since TBM directly utilizes NCCT images as input for prediction, while SVM and K-means are not inherently designed to process raw imaging data, it would be beneficial to clarify which specific features or input data were used for these comparison models. This would better highlight the effectiveness and advantages of the TBM method.

      We thank the reviewer for this suggestion. We have included additional details of the comparison with machine learning models in the methods section. While we used PCA on the extracted transport maps and raw image data for dimensionality reduction prior to classification, we agree that the machine learning methods described may not have been optimally tuned to examine the data in the format in which it was presented. Future studies should aim to compare TBM with optimized machine and deep learning methods to determine TBM’s potential as an automated clinical risk stratification tool. We have added this to the limitations section of the discussion. [Page 14; Paragraph 2; Lines 22-23 and Page 15; Paragraph 1; Lines 1-2].

      (3) The relatively small training and testing dataset may limit the model's performance and generalizability. Notably, while the study mentions that 1,066 patients from the ERICH dataset met the inclusion criteria, only 170 were randomly selected for the test set. Leveraging the full 1,066 ERICH cases for model training and internal validation might potentially enhance the model's robustness and performance.

      We thank the reviewer for this suggestion. As the reviewer highlights, the intention of the manuscript was to present a methodologically focused study which led to our small validation cohort of 170 patients from the ERICH dataset. It is our intention to further optimize and validate the TBM method in a future larger study which is underway, taking full advantage of the ERICH dataset. This has been incorporated into the discussion section. [Page 15; Paragraph 1; Lines 1114].

      (4) Some minor textual issues need to be checked and corrected, such as line 16 in the abstract "Incorporating these traits into a v achieved an AUROC of 0.71 ...".

      We thank the reviewer for this comment. The typographical error has been corrected. 

      (5) Some figures need to be reformatted (e.g., the x-axis in Figure 2 a is blocked).

      We thank the reviewer for this comment. This was intentional to demonstrate that the X-axis in Figure 2a and 2b are identical and thereby highlight image features corresponding to the regression line on the graph.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      While the authors have largely ruled out zebrin II as the key protein underlying PC vulnerability or resistance to age-related loss, the molecular basis of this phenomenon remains unidentified. This reviewer acknowledges the complexity of this investigation and considers it a minor issue, as the manuscript thoughtfully discusses the gap and highlights it as a future direction.

      We appreciate the reviewer’s acknowledgement of the complexity of determining the molecular basis of differential Purkinje cell vulnerability. Moreover, we acknowledge that zebrin II expression/identity is not the only factor in determining vulnerability; rather, the compartmentalized map as a whole may dictate these differences. We are eager to shed light on this issue through future study.

      In cases where no PC loss is observed in aged mice (Figure 1F), it is unclear whether these PCs undergo morphological degeneration, such as thickened axons and shrunken dendrites. Further characterization of these resilient PCs would help understand why the aged mice without PC loss still exhibit motor deficits (Figure 7).

      Thank you for the excellent idea of examining Purkinje cell morphology in aged mice without Purkinje cell loss. Upon looking for hallmarks of neurodegeneration, such as shrunken dendrites and axonal swellings, in aged mice without Purkinje cell loss, we observed minimal axonal pathology and no shrinkage of the molecular layer.  However, we note that while the features we examined are wellstudied hallmarks of degeneration, they are specific rather than exhaustive, and subtle morphological characteristics that are beyond our methods’ detection may change. We have added these new results to Figure 2C and added these notes to the manuscript.

      The histologic analysis is based on mice with different genetic backgrounds. For example, the PC-specific reporter mice include two strains: Pcp2-Cre; Ai32 and Pcp2-Cre; Ai40D. These genetic variations may contribute to the heterogeneity of PC loss (Figure 1). To improve clarity, please add the genetic background details to Table 1.

      We have added the genetic backgrounds of all mice used in the study to Table 1.

      Please indicate from which lobule in the anterior or posterior human cerebellum the images in Figure 8 were taken.

      Unfortunately, because of the limitations of human postmortem tissue collection (in some cases, we are provided with a very small block that was collected after the pathologist completed their primary duty for that individual), we cannot with full certainty distinguish the lobules from which the images were taken. However, we are grateful that, upon our request, the pathologists were able to collect tissue mainly from the vermis, which is where we wished to begin, knowing that the vermis in rodents and non-human primates typically has the clearest and most well-studied pattern. That said, this is an important issue that we are addressing for future studies.

      Reviewer #2 (Public review):

      (1) Limited strain diversity: The study exclusively uses C57BL/6J mice despite known genetic and motor differences even the closely related strains like C57BL/6N.

      Thank you for pointing out this limitation of our study. We chose to limit this initial study to C57BL/6J mice based on their widespread use as a background strain on many currently maintained lines. That said, our study intentionally included several different crosses to provide genetic variability, even though C57BL/6J is still the predominant genetic background. In addition to the motor differences in genetic strains, we are also particularly interested in the differences in cerebellar morphology across strains (Inouye and Oda, 1980; Sillitoe and Joyner, 2007). Our use of mice maintained on the C57BL/6J background leaves open an exciting future direction: investigating age-related Purkinje cell loss in mice of different inbred and outbred strains. Given the importance of the topic, we have included new text in the discussion to alert the reader to this limitation of our study and to highlight interesting differences across strains that will be important to disentangle in our future work.

      (2) No correlation quantified between the degree of PC loss, aging, and motor performance.

      We sought to conduct a broad overview of motor problems that might be caused by age-related Purkinje cell loss, rather than a comprehensive investigation of how motor behavior changes with advancing Purkinje cell loss. Therefore, we agree with the reviewer’s comment, and we have added text to indicate that stronger correlations between these domains would be best tackled with deeper behavioral phenotyping conducted over time to match the potentially cooccurring progressive changes in cerebellar morphology, with a focus on Purkinje cell degeneration and eventual loss.

      (3) It has not been demonstrated whether the neurodegenerative changes are indeed observed in zebrin-negative PCs.

      We have added Supplementary Figure 4, which includes an example of reduced dendritic density and loss of Purkinje cell somata in zebrin II-negative stripes in lobules II and III. Please also see Figure 4B for an example of reduced dendritic density in zebrin II-negative Purkinje cells in lobules III and IV.

      (4) The mechanisms of why only a subset of mice show PC loss remain unexplored and not discussed.

      We agree that our manuscript would benefit from discussion of why some aged mice are resistant to age-related Purkinje cell loss. We have elaborated upon possible reasons for this differential vulnerability in the discussion.

      (5) Linkages with normal human aging and cerebellar function are not well supported. While motor behavioral assays captured phenotypes that mimic aged people, correlation with PC loss is demonstrated to be absent in mice. It remains unclear whether this PC loss phenomenon is universal or specific to a particular individual; and whether specific to a human PC subtype.

      In our study, we sought to show that patterned age-related Purkinje cell loss presents a promising area for future research in humans. We agree that further study is needed to solidify a link between age-related Purkinje cell loss in mice and humans and the implications for motor function. The reviewer raises a fair criticism that reflects the current state of knowledge: studies that link cerebellar aging to  motor function and cognitive decline in humans are few, as are studies of the cellular-level morphological changes of cerebellar aging –there is a pressing need for deeper study of human tissue. To address the issue raised by the reviewer, we have included new text to the discussion of our manuscript indicating these gaps in knowledge. 

      (6) Analyses in the paraflocculus are currently not easy to understand. This lobule has heterogeneous PC subtypes, developmentally or molecularly. Zebrin-weak and Zebrinintense PCs are known to be arranged in stripes, which resembles the pattern of developmentally defined PC subsets (Fujita et al., 2014, Plos one; Fujita et al., 2012, J Neurosci). In the data presented, it is hard to appreciate whether the viewing angle is consistent relative to the angle of the paraflocculus. This may be a limitation of the analysis of the paraflocculus in general, that the orientation of this lobule is so susceptible to fixation and dissection. Discrepancy between PC loss stripe and zebrin pattern may be an overstatement, because appropriate analyses on the paraflocculus would require a rigorously standardized analytic method.

      Thank you for your valuable insights on the complexity of analyzing the paraflocculus. We have altered our language to more accurately reflect the nuanced zebrin II expression pattern of this region. We also agree with and very much appreciate your advice that “analyses on the paraflocculus would require a rigorously standardized analytic method.” We have added these arguments to the revised manuscript text.

      Reviewer #3 (Public review):

      (1) In Figure 3, the authors use Pcp2-CRE mice to drive GFP expression in Purkinje cells in order to avoid the confounding variable of loss of calbindin expression in aging Purkinje cells. The authors go on to say, "we argue that calbindin expression alone is not a reliable, sufficient indicator of Purkinje cell loss". However, in Figure 4, the authors return to calbindin staining alone to assess the correlation of Purkinje cell loss with zebrin-II expression. Could the authors comment on why zebrin-II co-staining experiments were not performed in GFP reporter mice to avoid potential confounds of calbindin expression? Without this experiment, should readers accept the data presented in Figure 4 as a "reliable, sufficient indicator of Purkinje cell loss", given the author's prior claim?

      This is a very good point, thank you. We agree that the data presented in Figure 4 alone would not be a sufficient indicator of Purkinje cell loss. However, we prefaced our calbindin and zebrin II co-staining with calbindin and GFP costaining (Figure 3), which showed that Purkinje cell-specific reporter expression revealed the same pattern of Purkinje cell loss as calbindin expression, and Neutral Red staining (Figure 2 and Supplementary Figure 3B), which confirmed the loss of Purkinje cells independent of immunofluorescence. For this reason, we feel confident that the data in Figure 4 is representative of the striped pattern of age-related Purkinje cell loss. Still, we see and agree with the reviewer’s comment, and therefore, to further show the correlation of Purkinje cell loss with zebrin II expression, we have added a new Supplementary Fig. 4, which shows co-staining of calbindin, GFP, and zebrin II.

      (2) Throughout the manuscript, there is a considerable reliance on the authors' interpretation of imaging data with no accompanying quantification (categorization of "striped" or "non-striped" PC loss, correlation of GFP/calbindin/zebrin-II staining, etc.). While this may be difficult to obtain, the results would be much stronger with a quantitative approach to support the stated categorizations/observations.

      Thank you for your suggestion. Quantifying stripe properties has been a challenging task for the field, given the regionalized features of stripe compartmentalization that make its complex architecture tricky to measure in its typical organization within the 3D anatomy of lobules and fissures and even harder to interpret when there are abnormalities. However, to quantitatively support our categorization of “striped” and “non-striped” Purkinje loss and the observed correlation between calbindin and GFP expression in aged mice, we have quantified the mediolateral pixel intensity across lobules II-IV, in which Purkinje cell loss reliably occurs in zebrin II-negative stripes. The results can be found in Supplementary Figure 1B and Supplementary Figure 3.

      Reviewer #1 (Recommendations for the authors):

      (1) In Figure 1, both staining artifacts and PC degeneration appear in light color. Please clarify how these two were differentiated.

      Thank you for your comment, which raises an important point about distinguishing staining artifacts from Purkinje cell degeneration. Cerebellar patterning is symmetrical across the midline, so asymmetrical abnormalities are one clue that differentiates staining artifacts from the degenerative pattern. Another indicator of a staining artifact seen in wholemount preparations is the gradual fading of the stain (seen in some hemispheres in Figure 1), which is caused by continuous rubbing of the cerebellum against the tube during the staining process. In some cases, such as in Figure 1F, the cerebellum was damaged during the dissection of the meninges after staining, and in such cases the accidental removal of cerebellar tissue (molecular layer) reveals unstained tissue beneath the surface of the cerebellum. This type of staining artifact can be identified by a missing chunk of tissue surrounded by stained Purkinje cells, compared to the smooth, unmarred tissue where PCs have degenerated. We have added new text to the results (the legends) to clarify these critical points for the reader.

      (2) In Figure 7C, please consider changing "Aged without stripes" to "Aged without PC loss" to be consistent with the labeling used in other panels.

      Thank you for pointing out this discrepancy. We have made the suggested changes.

      Reviewer #3 (Recommendations for the authors):

      Could the authors comment on why zebrin-II co-staining experiments were not performed in GFP reporter mice to avoid potential confounds of calbindin expression? Without this experiment, should readers accept the data presented in Figure 4 as a "reliable, sufficient indicator of Purkinje cell loss", given the author's prior claim?

      Thank you for this recommendation; we appreciate this advice. As we described above, our response to this suggestion reads:

      This is a very good point, thank you. We agree that the data presented in Figure 4 alone would not be a sufficient indicator of Purkinje cell loss. However, we prefaced our calbindin and zebrin II co-staining with calbindin and GFP costaining (Figure 3), which showed that Purkinje cell-specific reporter expression revealed the same pattern of Purkinje cell loss as calbindin expression, and Neutral Red staining (Figure 2 and Supplementary Figure 3B), which confirmed the loss of Purkinje cells independent of immunofluorescence. For this reason, we feel confident that the data in Figure 4 is representative of the striped pattern of age-related Purkinje cell loss. Still, we see and agree with the reviewer’s comment, and therefore to further show the correlation of Purkinje cell loss with zebrin II expression, we have added a new Supplementary Fig. 4, which shows co-staining of calbindin, GFP, and zebrin II.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewing Editor Comments:

      Recommendations for improvement:

      (1) Address data presentation, editing, and other issues of lack of clarity as pointed out by the reviewers.

      We have now addressed all comments from reviewers that identify editing errors and lack of clarity issues. Regarding data presentation we have made some changes, for example including a combined heatmap to show consistency between row names (Figure 2 - figure supplement 2), but also kept some stylistic features such as the balance between main and supplemental figures that we think fits more naturally with the story of the paper.

      (2) Inclusion of requested and critical details in the methodology section, an important component for broad applicability of a new methodology by other investigators.

      We have added the requested details to the methods section, specifically the RCA protocol.

      (3) More in-depth discussion of the limitations of the methodology and approach to capture important but more complex components of tissues of interest, for example, sexual dimorphism.

      We have now edited the ‘pitfalls of study’ section in the discussion to include further detail of the limitations of the number of genes that can be used to deeply profile transcriptomic types, including sexual dimorphism. Regarding its use in other tissues of interest, we have now included a reference in the discussion (Bintu et al., 2025) where a similar strategy has been used to profile cells in the olfactory epithelium and olfactory bulb. We have also used hamFISH in other brain areas (as commented in our public reviews responses) but as this is unpublished work we will refrain from mentioning it in the main text.

      Reviewer #1 (Recommendations for the authors):

      The manuscript by Edwards et al. would benefit from minor revisions. Here, we outline several points that could / should be addressed:

      (1) General balance of data presentation between main and supplementary figures

      (a) quantifications were often missing from main figures and only presented in the supplements

      Thank you for raising this point. We believe that the balance of panels between the main and supplemental figures matches our story and results section well with quantifications included in the main figures where appropriate.

      (b) more informative figure legends in supplements (e.g.: Supplementary Figure I - Figure 3)

      We have now revised the figure legends and added more description where appropriate.

      (c) missing subpanel in Figure 3; figure legend describes 3H, which is missing in the figure

      We thank the reviewer for pointing this out and have now amended the subpanel.

      stand-alone figure on inhibitory neuron cluster i3 cells

      We agree that this is an important characterisation of i3 cells but decided to place this figure in the supplement as it does not fall within the main storyline (defining transcriptomic characterisation of cell types in a multimodal fashion), but rather acts as accessory information for those specifically interested in these inhibitory cell types.

      statistical tests used (e.g.: Figure 1 C -, Supplementary Figure 3 - Figure 2)/ graphs shown (Supplementary Figure 1 - 1 D)

      The statistical tests used are described in the figure legends.

      t-SNE dimensionality reduction of positional parameters

      Explanations of the t-SNE dimensionality reduction of positional parameters can be found in the materials and methods.

      (d) heatmaps similarly informative and more convincing

      We have included an extra heatmap (Figure 2 - figure supplement 2) in response to Reviewer 3’s comment (see below) in order to more easily follow genes across all the different clusters. We hope this helps to make the heatmaps more convincing and informative.

      code availability

      Code availability is described in the methods section of the manuscript.

      page 6, 3rd paragraph wrong description of PMCo abbreviation

      We thank the reviewer for identifying the mistake and we have now amended it.

      Reviewer #2 (Recommendations for the authors):

      The pre-existing scRNA-seq dataset on which the manuscript is based is an older Drop-seq dataset for which minimal QC information is provided. The authors should include QC information (genes/cells and UMIs/cells) in the Methods. Moreover, the Seurat clustering of these cells and depiction of marker genes in feature plots are not shown.

      It is therefore difficult to determine how the authors selected their 31 genes for their hamFISH panel, or how selective they are to the original Drop-seq clusters.

      The QC information of this dataset can be found in the original publication (Chen et al., 2019) with our clustering methods described in the materials and methods section. We have not included individual gene names in our heatmap plots for presentation purposes (there are over 200 rows), but the data and cluster descriptions can be found in supplemental tables.

      Reviewer #3 (Recommendations for the authors):

      (1) The imaging modality is not entirely clear in the methods. The microscopy technique is referenced to prior work and involves taking z-stacks, but analysis appears to be done on maximum z-projections, which seems like it would introduce the risk of false attribution of gene expression to cells that are overlapping in "z".

      Thank you for pointing out the technical limitation of the microscopy. For imaging we used epifluorescence microscopy with 14x 500 nm z-steps to collect our raw data and generate a maximum intensity projection for further analysis. Because of the thin sections (10 um) used for the imaging, the overlap between cells in z is expected to be minimal. However, we cannot completely rule out misattribution raised in the comment. The method section contains this information.

      (2) Supplemental Figure 1 - Figure Supplement 2B: RCA looks significantly different when compared to v2 smFISH from the representative image, although it is written as comparable. Additionally, there is no information about RCA mentioned in the Materials and Methods section. Supplemental Figure 1 - Figure Supplement 2B: The figure label for RCA is missing.

      By comparable we are referring to the intensity rather than pattern as mentioned in the results section. We did not analyze the number of spots. It is true that the pattern of RCA signal is much sparser due to its inherent insensitivity compared with hamFISH. We thank the reviewer for identifying the lack of a methodological RCA description and have amended the manuscript to include this. We have also now amended the missing RCA label in the figure.

      (3) Figure 2C and associated supplement: The rows (each gene) are not consistent across the subpanels (i.e. they do not line up left-to-right), this makes it difficult for the reader to follow the patterns that distinguish the cell types in each subset.

      We have done this as we believe it makes for an easier interpretation of inhibitory vs excitatory clusters for the reader. However, we agree with the reviewer that one may wish to look at the dataset as a whole with a consistent gene order, and we have now provided this in the corresponding supplemental figure.  

      (4) "Consistent with previous work, most inhibitory classes are localized in the dorsal and ventral subdivisions of the MeA, whereas excitatory neurons occupy primarily the ventral MeA (Figure 2D, Figure 2 - Figure Supplement 2C, Figure 1D)". - The reference to Figure 1D seems to be an error.

      We thank the reviewer for identifying the mistake, and we have now amended it.

      (5) Supplemental Figure 2 - Figure Supplement 1, "published by Chen et al." - should have a proper reference number to be compatible with the rest of the manuscript. Also, the lack of gene info makes it difficult to understand Panel A. Finally, the text on Panel B refers to "hamMERFISH" which seems an error.

      We thank the reviewer for identifying the mistake on Panel B, it has now been amended. We have also changed the reference format. Regarding the lack of gene information in panel A, it is difficult to present all row names due to the large number of rows (>200), but this information can be found in supplemental table 2.

      (6) Supplemental Figure 2 - Figure Supplement 1: there are thin dividing lines drawn on each section, but these are not described or defined, making it difficult to understand what is being delineated.

      We thank the reviewer for identifying this omission and have now edited to figure legend to contain a description.

      (7) Page 4, "...we found 26 clusters in cells that are positive for Slc32a1 (inhibitory) or Slc17a6 (encoding Vglut2 and therefore excitatory) positive (Figure 2 - figure supplement 1A, Table S2)."

      This seems to be an error as Figure 2 - figure supplement 1A does not show this.

      We double-checked that this description describes the panel accurately.

      (8) "The clustering revealed that inhibitory and excitatory classes generally have different spatial properties (Figure 1E, left), although the salt-and-pepper, sparse nature of e10 (Nts+) cells is more similar to inhibitory cells than other excitatory classes".

      The references to Figure 1E's should be to Figure 2E.

      We thank the reviewer for identifying the mistake, and we have now amended it.

      (9) "Comparison of the proportion of all cells that are cluster X vs projection neurons labelled by CTB that are cluster X". Please explain cluster X in this context.

      We have now rephrased this sentence in the figure legend for clarity.

      (10) Figure 3 - figure supplement 3: There appears to be quite a bit of heterogeneity in the patterns of activity across clusters even within behavioral contexts (e.g. the bottom 2 animals paired with females). It might be worth commenting on (or quantifying) whether there were any evident differences in the social behaviors observed (e.g. mating or not?) in individuals demonstrating these patterns.

      We thank the reviewer for this observation. We unfortunately did not quantify the behaviors, but we agree that more work is needed to link the pattern of c-fos activity with incrementally measured behavioral variables. At least, we did not include animals that did not display the anticipated social behaviours (as described in the materials and methods) in the in situ transcriptomic profiling work.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      In the current article, Octavia Soegyono and colleagues study "The influence of nucleus accumbens shell D1 and D2 neurons on outcome-specific Pavlovian instrumental transfer", building on extensive findings from the same lab. While there is a consensus about the specific involvement of the Shell part of the Nucleus Accumbens (NAc) in specific stimulus-based actions in choice settings (and not in General Pavlovian instrumental transfer - gPIT, as opposed to the Core part of the NAc), mechanisms at the cellular and circuitry levels remain to be explored. In the present work, using sophisticated methods (rat Cre-transgenic lines from both sexes, optogenetics, and the well-established behavioral paradigm outcome-specific PIT-sPIT), Octavia Soegyono and colleagues decipher the diNerential contribution of dopamine receptors D1 and D2 expressing spiny projection neurons (SPNs). 

      After validating the viral strategy and the specificity of the targeting (immunochemistry and electrophysiology), the authors demonstrate that while both NAc Shell D1- and D2SPNs participate in mediating sPIT, NAc Shell D1-SPNs projections to the Ventral Pallidum (VP, previously demonstrated as crucial for sPIT), but not D2-SPNs, mediates sPIT. They also show that these eNects were specific to stimulus-based actions, as valuebased choices were left intact in all manipulations. 

      This is a well-designed study, and the results are well supported by the experimental evidence. The paper is extremely pleasant to read and adds to the current literature.

      We thank the Reviewer for their positive assessment. 

      Reviewer 2 (Public Review):

      Summary: 

      This manuscript by Soegyono et al. describes a series of experiments designed to probe the involvement of dopamine D1 and D2 neurons within the nucleus accumbens shell in outcome-specific Pavlovian-instrumental transfer (osPIT), a well-controlled assay of cueguided action selection based on congruent outcome associations. They used an optogenetic approach to phasically silence NAc shell D1 (D1-Cre mice) or D2 (A2a-Cre mice) neurons during a subset of osPIT trials. Both manipulations disrupted cue-guided action selection but had no eNects on negative control measures/tasks (concomitant approach behavior, separate valued guided choice task), nor were any osPIT impairments found in reporter-only control groups. Separate experiments revealed that selective inhibition of NAc shell D1 but not D2 inputs to ventral pallidum was required for osPIT expression, thereby advancing understanding of the basal ganglia circuitry underpinning this important aspect of decision making.

      Strengths: 

      The combinatorial viral and optogenetic approaches used here were convincingly validated through anatomical tract-tracing and ex vivo electrophysiology. The behavioral assays are sophisticated and well-controlled to parse cue and value-guided action selection. The inclusion of reporter-only control groups is rigorous and rules out nonspecific eNects of the light manipulation. The findings are novel and address a critical question in the literature. Prior work using less decisive methods had implicated NAc shell D1 neurons in osPIT but suggested that D2 neurons may not be involved. The optogenetic manipulations used in the current study provide a more direct test of their involvement and convincingly demonstrate that both populations play an important role. Prior work had also implicated NAc shell connections to ventral pallidum in osPIT, but the current study reveals the selective involvement of D1 but not D2 neurons in this circuit. The authors do a good job of discussing their findings, including their nuanced interpretation that NAc shell D2 neurons may contribute to osPIT through their local regulation of NAc shell microcircuitry. 

      We thank the Reviewer for their positive assessment. 

      Weaknesses: 

      The current study exclusively used an optogenetic approach to probe the function of D1 and D2 NAc shell neurons. Providing a complementary assessment with chemogenetics or other appropriate methods would strengthen conclusions, particularly the novel demonstration of D2 NAc shell involvement. Likewise, the null result of optically inhibiting D2 inputs to the ventral pallidum leaves open the possibility that a more complete or sustained disruption of this pathway may have impaired osPIT.

      We acknowledge the reviewer's valuable suggestion that demonstrating NAc-S D1- and D2-SPNs engagement in outcome-specific PIT through another technique would strengthen our optogenetic findings. Several approaches could provide this validation. Chemogenetic manipulation, as the reviewer suggested, represents one compelling option. Alternatively, immunohistochemical assessment of phosphorylated histone H3 at serine 10 (P-H3) oMers another promising avenue, given its established utility in reporting striatal SPNs plasticity in the dorsal striatum (Matamales et al., 2020). We hope to complete such an assessment in future work since it would address the limitations of previous work that relied solely on ERK1/2 phosphorylation measures in NAc-S SPNs (Laurent et al., 2014). The manuscript was modified to report these future avenues of research (page 12). 

      Regarding the null result from optical silencing of D2 terminals in the ventral pallidum, we agree with the reviewer's assessment. While we acknowledge this limitation in the current manuscript (page 13), we aim to address this gap in future studies to provide a more complete mechanistic understanding of the circuit.

      Reviewer 3 (Public Review):

      Summary:

      The authors present data demonstrating that optogenetic inhibition of either D1- or D2MSNs in the NAc Shell attenuates expression of sensory-specific PIT while largely sparing value-based decision on an instrumental task. They also provide evidence that SS-PIT depends on D1-MSN projections from the NAc-Shell to the VP, whereas projections from D2-MSNs to the VP do not contribute to SS-PIT.

      Strengths:

      This is clearly written. The evidence largely supports the authors' interpretations, and these eNects are somewhat novel, so they help advance our understanding of PIT and NAc-Shell function.

      We thank the Reviewer for their positive assessment. 

      Weaknesses:

      I think the interpretation of some of the eNects (specifically the claim that D1-MSNs do not contribute to value-based decision making) is not fully supported by the data presented.

      We appreciate the reviewer's comment regarding the marginal attenuation of valuebased choice observed following NAc-S D1-SPN silencing. While this manipulation did produce a slight reduction in choice performance, the behavior remained largely intact. We are hesitant to interpret this marginal eMect as evidence for a direct role of NAc-S D1SPNs in value-based decision-making, particularly given the substantial literature demonstrating that NAc-S manipulations typically preserve such choice behavior (Corbit et al., 2001; Corbit & Balleine, 2011; Laurent et al., 2012). Furthermore, previous work has shown that NAc-S D1 receptor blockade impairs outcome-specific PIT while leaving value-based choice unaMected (Laurent et al., 2014). We favor an alternative explanation for our observed marginal reduction. As documented in Supplemental Figure 1, viral transduction extended slightly into the nucleus accumbens core (NAc-C), a region established as critical for value-based decision-making (Corbit et al., 2001; Corbit & Balleine, 2011; Laurent et al., 2012; Parkes et al., 2015). The marginal impairment may therefore reflect inadvertent silencing of a small number of  NAc-C D1-SPNs rather than a functional contribution from NAc-S D1-SPNs. Future studies specifically targeting larger NAc-C D1-SPN populations would help clarify this possibility and provide definitive resolution of this question.

      Reviewer 1 (Recommendations for the Author):

      My main concerns and comments are listed below.

      (1) Could the authors provide the "raw" data of the PIT tests, such as PreSame vs Same vs PreDiNerent vs DiNerent? Could the authors clarify how the Net responding was calculated? Was it Same minus PreSame & DiNerent minus PreDiNerent, or was the average of PreSame and PreDiNerent used in this calculation?

      The raw data for PIT testing across all experiments are now included in the Supplemental Figures (Supplemental Figures S1E, S2E, S3E, and S4E). Baseline responding was quantified as the average number of lever presses per minute for both actions during the two-minute period (i.e., average of PreSame and PreDiMerent) preceding each stimulus presentation. This methodology has been clarified in the revised manuscript (page 7).

      (2) While both sexes are utilized in the current study, no statistical analysis is provided. Can the authors please comment on this point and provide these analyses (for both training and tests)?

      As noted in the original manuscript, the final sample sizes for female and male rats were insuMicient to provide adequate statistical power for sex-based analyses (page 15). To address this limitation, we have now cited a previous study from our laboratory (Burton et al., 2014) that conducted such analyses with suMicient power in identical behavioural tasks. That study identified only marginal sex diMerences in performance, with female rats exhibiting slightly higher magazine entry rates during Pavlovian conditioning. Importantly, no diMerences were observed in outcome-specific PIT or value-based choice performance between sexes.

      (3) Regarding Figure 1 - Anterograde tracing in D1-Cre and A2a-Cre rats (from line 976), I have one major and one minor question:

      (3.1) I do not understand the rationale of showing anterograde tracing from the Dorsal Striatum (DS) as this region is not studied in the current work. Moreover, sagittal micrographs of D1-Cre and A2a-Cre would be relevant here. Could the authors please provide these micrographs and explain the rationale for doing tracing in DS?

      We included dorsal striatum (DS) tracing data as a reference because the projection patterns of D1 and D2 SPNs in this region are well-established and extensively characterized, in contrast to the more limited literature on these cell types in the NAc-S. Regarding the comment about sagittal micrographs, we are uncertain of the specific concern as these images are presented in Figure 1B.

      If the reviewer is requesting sagittal micrographs for NAc-S anterograde tracing, we did not employ this approach because: (1) the NAc-S and ventral pallidum are anatomically adjacent regions and (2) the medial-lateral coordinates of the ventral pallidum and lateral hypothalamus do not align optimally with those of the NAc-S, limiting the utility of sagittal analysis for these projections.

      (3.2) There is no description about how the quantifications were done: manually? Automatically? What script or plugin was used? If automated, what were the thresholding conditions? How many brain sections along the anteroposterior axis? What was the density of these subpopulations? Can the authors include a methodological section to address this point?

      We apologize for the omission of quantification methods used to assess viral transduction specificity. This methodological description has now been added to the revised manuscript (page 22). Briefly, we employed a manual procedure in two sections per rat, and cell counts were completed in a defined region of interest located around the viral infusion site.

      (4) Lex A & Hauber (2008) Dopamine D1 and D2 receptors in the nucleus accumbens core and shell mediate Pavlovian-instrumental transfer. Learning & memory 15:483- 491, should be cited and discussed. It also seems that the contribution of the main dopaminergic source of the brain, the ventral tegmental area, is not cited, while it has been investigated in PIT in at least 3 studies regarding sPIT only, notably the VP-VTA pathway (Leung & Balleine 2015, accurately cited already).

      We did not include the Lex & Hauber (2008) study because its experimental design (single lever and single outcome) prevents diMerentiation between the eMects of Pavlovian stimuli on action performance (general PIT) versus action selection (outcome-specific PIT, as examined in the present study). Drawing connections between their findings and our results would require speculative interpretations regarding whether their observed eMects reflect general or outcome-specific PIT mechanisms, which could distract from the core findings reported in the article.

      Several studies examining the role of the VTA in outcome-specific PIT were referenced in the manuscript's introduction. Following the reviewer's recommendation, these references have also been incorporated into the discussion section (page 13). 

      (5) While not directly the focus of this study, it would be interesting to highlight the accumbens dissociation between General vs Specific PIT, and how the dopaminergic system (diNerentially?) influences both forms of PIT.

      We agree with the reviewer that the double dissociation between nucleus accumbens core/shell function and general/specific PIT is an interesting topic. However, the present manuscript does not examine this dissociation, the nucleus accumbens core, or general PIT. Similarly, our study does not directly investigate the dopaminergic system per se. We believe that discussing these topics would distract from our core findings and substantially increase manuscript length without contributing novel data directly relevant to these areas. 

      (6) While authors indicate that conditioned response to auditory stimuli (magazine visits) are persevered in all groups, suggesting intact sensitivity to the general motivational properties of reward-predictive stimuli (lines 344, 360), authors can't conclude about the specificity of this behavior i.e. does the subject use a mental representation of O1 when experiencing S1, leading to a magazine visits to retrieve O1 (and same for S2-O2), or not? Two food ports would be needed to address this question; also, authors should comment on the fact that competition between instrumental & pavlovian responses does not explain the deficits observed.

      We agree with the Reviewer that magazine entry data cannot be used to draw conclusions about specificity, and we do not make such claims in our manuscript. We are therefore unclear about the specific concern being raised. Following the Reviewer’s recommendation, we have commented on the fact that response competition could not explain the results obtained (page 11, see also supplemental discussion). 

      The minor comments are listed below.

      (7) A high number of rats were excluded (> 32 total), and the number of rats excluded for NAc-S D1-SPNs-VP is not indicated.

      We apologize for omitting the number of rats excluded from the experiment examining NAc-S D1-SPN projections to the ventral pallidum. This information has been added to the revised manuscript (page 22).

      (7.1) Can authors please comment on the elevated number of exclusions?

      A total of 133 rats were used across the reported experiments, with 40 rats excluded based on post-mortem analyses. This represents an attrition rate of approximately 30%, which we consider reasonable given that most animals received two separate viral infusions and two separate fiber-optic cannula implantations, and that the inclusion of both female and male rats contributed to some variability in coordinates and so targeting. 

      (7.2) Can authors please present the performance of these animals during the tasks (OFF conditions, and for control ones, both ON & OFF conditions)?

      Rats were excluded after assessing the spread of viral infusions, placement of fibre-optic cannulas and potential damage due to the surgical procedures (page 21). The requested data are presented below and plotted in the same manner as in Figures 3-6. The pattern of performance in excluded animals was highly variable. 

      Author response image 1.

       

      (8) For tracing, only males were used, and for electrophysiology, only females were used.

      (8.1) Can authors please comment on not using both sexes in these experiments? 

      We agree that equal allocation of female and male rats in the experiments presented in Figures 1-2 would have been preferable. Animal availability was the sole factor determining these allocations. Importantly, both female and male D1-Cre and A2A-Cre rats were used for the NAc-S tracing studies, and no sex diMerences were observed in the projection patterns. The article describing the two transgenic lines of rats did not report any sex diMerence (Pettibone et al., 2019). 

      (8.2) Is there evidence in the literature that the electrophysiological properties of female versus male SPNs could diNer?

      The literature indicates that there is no sex diMerence in the electrophysiological properties of NAc-S SPNs (Cao et al., 2018; Willett et al., 2016).  

      (8.3) It seems like there is a discrepancy between the number of animals used as presented in the Figure 2 legend versus what is described in the main text. In the Figure legend, I understand that 5 animals were used for D1-Cre/DIO-eNpHR3.0 validation, and 7 animals for A2a-Cre/DIO-eNpHR3.0; however, the main text indicates the use of a total of 8 animals instead of the 12 presented in the Figure legend. Can authors please address this mismatch or clarify?

      The number of rats reported in the main text and Figure 2 legend was correct. However, recordings sometimes involved multiple cells from the same animal, and this aspect of the data was incorrectly reported and generated confusion. We have clarified the numbers in both the main text and Figure 2 legend to distinguish between animal counts and cell counts. 

      (9) Overall, in the study, have the authors checked for outliers?

      Performance across all training and testing stages was inspected to identify potential behavioral outliers in each experiment. Abnormal performance during a single session within a multi-session stage was not considered suMicient grounds for outlier designation. Based on these criteria, no subjects remaining after post-mortem analyses exhibited performance patterns warranting exclusion through statistical outlier analysis. However, we have conducted the specific analyses requested by the Reviewer, as described below. 

      (9.1) In Figure 3, it seems that one female in the eYFP group, in the OFF situation, for the diNerent condition, has a higher level of responding than the others. Can authors please confirm or refute this visual observation with the appropriate statistical analysis?

      Statistical analysis (z-score) confirmed the reviewer's observation regarding responding of the diMerent action in the OFF condition for this subject (|z| = 2.58). Similar extreme responding was observed in the ON condition (|z| = 2.03). Analyzing responding on the diMerent action in isolation is not informative in the context of outcome-specific PIT. Additional analyses revealed |z| < 2 when examining the magnitude of choice discrimination in outcome-specific PIT (i.e., net same versus net diMerent responding) in both ON and OFF conditions. Furthermore, this subject showed |z| < 2 across all other experimental stages. Based on these analyses, we conclude that the subject should be kept in all analyses. 

      (9.2) In Figure 5, it seems that one male, in the ON situation, in the diNerent condition, has a quite higher level of responding - is this subject an outlier? If so, how does it aNect the statistical analysis after being removed? And who is this subject in the OFF condition?

      The reviewer has identified two diMerent male rats infused with the eNpHR3.0 virus and has asked closer examination of their performance.

      The first rat showed outlier-level responding on the diMerent action in the ON condition (|z| = 2.89) but normal responding for all other measures across LED conditions (|z| < 2). Additional analyses revealed |z| = 2.55 when examining choice discrimination magnitude in outcome-specific PIT during the ON condition but not during the OFF condition (|z| = 0.62). This subject exhibited |z| < 2 across all other experimental stages.

      The second rat showed outlier-level responding on the same action in the OFF condition (|z| = 2.02) but normal responding for all other measures across LED conditions (|z| < 2). Additional analyses revealed |z| = 2.12 when examining choice discrimination magnitude in outcome-specific PIT during the OFF condition but not during the ON condition (|z| = 0.67). This subject also exhibited |z| < 2 across all other experimental stages.

      We excluded these two subjects and conducted the same analyses as described in the original manuscript. Baseline responding did not diMer between groups (p = 0.14), allowing to look at the net eMect of the stimuli. Overall lever presses were greater in the eYFP rats (Group: F(1,16) = 6.08, p < 0.05; η<sup>2</sup> = 0.28) and were reduced by LED activation (LED: F(1,16) = 9.52, p < 0.01; η<sup>2</sup> = 0.44) and this reduction depended on the group considered (Group x LED: F(1,16) = 12.125, p < 0.001; η<sup>2</sup> = 0.43). Lever press rates were higher on the action earning the same outcome as the stimuli compared to the action earning the diMerent outcome (Lever: F(1,16)= 49.32; η<sup>2</sup> = 0.76; p < 0.001), regardless of group (Group x Lever: p = 0.14). There was a Lever by LED light condition interaction (Lever x LED: F(1,16)= 5.25; η<sup>2</sup> = 0.24; p < 0.05) but no an interaction between group, LED light condition, and Lever during the presentation of the predictive stimuli (p = 0.10). Given the significant Group x LED and Lever x LED interactions, additional analyses were conducted to determine the source of these interactions. In eYFP rats, LED activation had no eMect (LED: p = 0.70) and lever presses were greater on the same action (Lever: (F(1,9) = 23.94, p < 0.001; η<sup>2</sup> = 0.79) regardless of LED condition (LED x Lever: p = 0.72). By contrast, in eNpHR3.0 rats, lever presses were reduced by LED activation (LED: F(1,9) = 23.97, p < 0.001; η<sup>2</sup> = 0.73), were greater on the same action (Lever: F(1,9) = 16.920, p < 0.001; η<sup>2</sup> = 0.65) and the two factors interacted (LED x Lever: F(1,9) = 9.12, p < 0.01; η<sup>2</sup> = 0.50). These rats demonstrated outcome-specific PIT in the OFF condition (F(1,9) = 27.26, p < 0.001; η<sup>2</sup> = 0.75) but not in the ON condition (p = 0.08).

      Overall, excluding these two rats altered the statistical analyses, but both the original and revised analyses yielded the same outcome: silencing the NAc-S D1-SPN to VP pathway disrupted PIT. More importantly, we do not believe there are suMicient grounds to exclude the two rats identified by the reviewer. These animals did not display outlier-level responding across training stages or during the choice test. Their potential classification as outliers would be based on responding during only one LED condition and not the other, with notably opposite patterns between the two rats despite belonging to the same experimental group. 

      (10) I think it would be appreciable if in the cartoons from Figure 5.A and 6.A, the SPNs neurons were color-coded as in the results (test plots) and the supplementary figures (histological color-coding), such as D1- in blue & D2-SPNs in red.

      Our current color-coding system uses blue for D1-SPNs transduced with eNpHR3.0 and red for D2-SPNs transduced with eNpHR3.0. The D1-SPNs and D2-SPNs shown in Figures 5A and 6A represent cells transduced with either eYFP (control) or eNpHR3.0 virus and therefore cannot be assigned the blue or red color, which is reserved for eNpHR3.0transduced cells specifically. The micrographs in the Supplemental Figures maintain consistency with the color-coding established in the main figures.

      (11) As there are (relatively small) variations in the control performance in term of Net responding (from ~3 to ~7 responses per min), I wonder what would be the result of pooling eYFP groups from the two first experiments (Figures 3 & 4) and from the two last ones (Figures 5 & 6) - would the same statically results stand or vary (as eYFP vs D1-Cre vs A2a-Cre rats)? In particular for Figures 3 & 4, with and without the potential outlier, if it's indeed an outlier.

      We considered the Reviewer’s recommendation but do not believe the requested analysis is appropriate. The Reviewer is requesting the pooling of data from subjects of distinct transgenic strains (D1-Cre and A2A-Cre rats) that underwent surgical and behavioral procedures at diMerent time points, sometimes months apart. Each experiment was designed with necessary controls to enable adequate statistical analyses for testing our specific hypotheses. 

      (12) Presence of cameras in operant cages is mentioned in methods, but no data is presented regarding recordings, though authors mention that they allow for real-time observations of behavior. I suggest removing "to record" or adding a statement about the fact that no videos were recorded or used in the present study.

      We have removed “to record” from the manuscript (page 18). 

      (13) In all supplementary Figures, "F" is wrongly indicated as "E".

      We thank the Reviewer for reporting these errors, which have been corrected. 

      (14) While the authors acknowledge that the eNicacy of optogenetic inhibition of terminals is questionable, I think that more details are required to address this point in the discussion (existing literature?). Maybe, the combination of an anterograde tracer from SPNs to VP, to label VP neurons (to facilitate patching these neurons), and the Credependent inhibitory opsin in the NAc Shell, with optogenetic illumination at the level of the VP, along with electrophysiological recordings of VP neurons, could help address this question but may, reasonably, seem challenging technically.

      Our manuscript does not state that optogenetic inhibition of terminals is questionable. It acknowledges that we do not provide any evidence about the eMicacy of the approach. Regardless, we have provided additional details and suggestions to address this lack of evidence (page 13). 

      (15) A nice addition could be an illustration of the proposed model (from line 374), but it may be unnecessary.

      We have carefully considered the reviewer's recommendation. The proposed model is detailed in three published articles, including one that is freely accessible, which we have cited when presenting the model in our manuscript (page 14). This reference should provide interested readers with easy access to a comprehensive illustration of the model.

      Reviewer 2 (Recommendations for the Author):

      As noted in my public comments, this is a truly excellent and compelling study. I have only a few minor comments.

      (1) I could not find the coordinates/parameters for the dorsal striatal AAV injections for that component of the tract tracing experiment.

      We apologize for this omission, which has now been corrected (page 16). 

      (2) Please add the final group sizes to the figure captions.

      We followed the Reviewer’s recommendation and added group sizes in the main figure captions. 

      (3) The discussion of group exclusions (p 21 line 637) seems to accidentally omit (n = X) the number of NAc-S D1-SPNs-VP mice excluded.

      We apologize for this omission, which has now been corrected (page 22). 

      (4) There were some labeling issues in the supplementary figures (perhaps elsewhere, too). Specifically, panel E was listed twice (once for F) in captions.

      We apologize for this error, which has now been corrected.  

      (5) Inspection of the magazine entry data from PIT tests suggests that the optogenetic manipulations may have had some eNects on this behavior and would encourage the authors to probe further. There was a significant group diNerence for D1-SPN inhibition and a marginal group eNect for D2-SPNs. The fact that these eNects were in opposite directions is intriguing, although not easily interpreted based on the canonical D1/D2 model. Of course, the eNects are not specific to the light-on trials, but this could be due to carryover into light-oN trials. An analysis of trial-order eNects seems crucial for interpreting these eNects. One might also consider normalizing for pre-test baseline performance. Response rates during Pavlovian conditioning seem to suggest that D2eNpHR mice showed slightly higher conditioned responding during training, which contrasts with their low entry rates at test. I don't see any of this as problematic -- but more should be done to interpret these findings.

      We thank the reviewer for raising this interesting point regarding magazine entry rates. Since these data are presented in the Supplemental Figures, we have added a section in the Supplemental Material file that elaborates on these findings. This section does not address trial order eMects, as trial order was fully counterbalanced in our experiments and the relevant statistical analyses would lack adequate power. Baseline normalization was not conducted because the reviewer's suggestion was based on their assumption that eNpHR3.0 rats in the D2-SPNs experiment showed slightly higher magazine entries during Pavlovian training. However, this was not the case. In fact, like the eNpHR3.0 rats in the D1-SPNs experiment, they tended to display lower magazine entries during training. The added section therefore focuses on the potential role of response competition during outcome-specific PIT tests. Although we concluded that response competition cannot explain our findings, we believe it may complicate interpretation of magazine entry behavior. Thus, we recommend that future studies examine the role of NAc-S SPNs using purely Pavlovian tasks. It is worth nothing that we have recently completed experiments (unpublished) examining NAc-S D1- and D2-SPN silencing during stimulus presentation in a Pavlovian task identical to the one used here. Silencing of either SPN population had no eMect on magazine entry behavior.

      Reviewer 3 (Recommendations for the Author):

      Broad comments:

      Throughout the manuscript, the authors draw parallels between the eNect established via pharmacological manipulations and those shown here with optogenetic manipulation. I understand using the pharmacological data to launch this investigation, but these two procedures address very diNerent physiological questions. In the case of a pharmacological manipulation, the targets are receptors, wherever they are expressed, and in the case of D2 receptors, this means altering function in both pre-synaptically expressed autoreceptors and post-synaptically expressed D2 MSN receptors. In the case of an optogenetic approach, the target is a specific cell population with a high degree of temporal control. So I would just caution against comparing results from these types of studies too closely.

      Related to this point is the consideration of the physiological relevance of the manipulation. Under normal conditions, dopamine acts at D1-like receptors to increase the probability of cell firing via Ga signaling. In contrast, dopamine binding of D2-like receptors decreases the cell's firing probability (signaling via Gi/o). Thus, shunting D1MSN activation provides a clear impression of the role of these cells and, putatively, the role of dopamine acting on these cells. However, inhibiting D2-MSNs more closely mimics these cells' response to dopamine (though optogenetic manipulations are likely far more impactful than Gi signaling). All this is to say that when we consider the results presented here in Experiment 2, it might suggest that during PIT testing, normal performance may require a halting of DA release onto D2-MSNs. This is highly speculative, of course, just a thought worth considering.

      We agree with the comments made by the Reviewer, and the original manuscript included statements acknowledging that pharmacological approaches are limited in the capacity to inform about the function of NAc-S SPNs (pages 4 and 9). As noted by the Reviewer, these limitations are especially salient when considering NAc-S D2-SPNs. Based on the Reviewer’s comment, we have modified our discussion to further underscore these limitations (page 12). Finally, we agree with the suggestion that PIT may require a halting of DA release onto D2-SPNs. This is consistent with the model presented, whereby D2-SPNs function is required to trigger enkephalin release (page 13).     

      Section-Specific Comments and Questions:

      Results:

      Anterograde tracing and ex vivo cell recordings in D1 Cre and A2a Cre rats: Why are there no statistics reported for the e-phys data in this section? Was this merely a qualitative demonstration? I realize that the A2a-Cre condition only shows 3 recordings, so I appreciate the limitations in analyzing the data presented.

      The reviewer is correct that we initially intended to provide a qualitative demonstration. However, we have now included statistical analyses for the ex vivo recordings. It is important to note that there were at least 5 recordings per condition, though overlapping data points may give the impression of fewer recordings in certain conditions. We have provided the exact number of recordings in both the main text (page 5) and figure legend. 

      What does trial by trial analysis look like, because in addition to the eNects of extinction, do you know if the responsiveness of the opsin to light stimulation is altered after repeated exposures, or whether the cells themselves become compromised in any way with repeated light-inhibition, particularly given the relatively long 2m duration of the trial.

      The Reviewer raises an interesting point, and we provide complete trial-by-trial data for each experiment below. As identified by the Reviewer, there is some evidence for extinction, although it remained modest. Importantly, the data suggest that light stimulation did not aMect the physiology of the targeted cells. In eNpHR3.0 rats, performance across OFF trials remained stable (both for Same and DiMerent) even though they were preceded by ON trials, indicating no carryover eMects from optical stimulation.

      Author response image 2.

       

      The statistics for the choice test are not reported for eNpHR-D1-Cre rats, but do show a weakening of the instrumental devaluation eNect "Group x Lever x LED: F1,18 = 10.04, p < 0.01, = 0.36". The post hoc comparisons showed that all groups showed devaluation, but it is evident that there is a weakening of this eNect when the LED was on (η<sup>2</sup> = 0.41) vs oN (η<sup>2</sup> = 0.78), so I think the authors should soften the claim that NAcS-D1s are not involved in value-based decision-making. (Also, there is a typo in the legend in Figure S1, where the caption for panel "F" is listed as "E".) I also think that this could be potentially interesting in light of the fact that with circuit manipulation, this same weakening of the instrumental devaluation eNect was not observed. To me, this suggests that D1-NAcS that project to a diNerent region (not VP) contribute to value-based decision making.

      This comment overlaps with one made in the Public Review, for which we have already provided a response. Given its importance, we have added a section addressing this point in the supplemental discussion of the Supplementary Material file, which aligns with the location of the relevant data. The caption labelling error has been corrected.

      Materials and Methods:

      Subjects:

      Were these heterozygous or homozygous rats? If hetero, what rats were used for crossbreeding (sex, strain, and vendor)? Was genotyping done by the lab or outsourced to commercial services? If genotyping was done within the lab, please provide a brief description of the protocol used. How was food restriction established and maintained (i.e., how many days to bring weights down, and was maintenance achieved by rationing or by limiting ad lib access to food for some period in the day)?

      The information requested by the Reviewer have been added to the subjects section (pages 15-16).  

      Were rats pair/group housed after implantation of optic fibers?

      We have clarified that rats were group houses throughout (see subjects section; pages 15-16). 

      Behavioral Procedures:

      How long did each 0.2ml sucrose infusion take? For pellets, for each US delivery, was it a single pellet or two in quick succession?

      We have modified the method section to indicate that the sucrose was delivered across 2 seconds and that a single pellet was provided (page 17). 

      The CS to ITI duration ratio is quite low. Is there a reason such a short ratio was used in training?

      These parameters are those used in all our previous experiments on outcome-specific PIT. There is no specific reason for using such a ratio, except that it shortens the length of the training session. 

      Relative to the end of training, when were the optical implantation surgeries conducted, and how much recovery time was given before initiating reminder training and testing?

      Fibre-optic implantation was conducted 3-4 days after training and another 3-4 days were given for recovery. This has been clarified in the Materials and methods section (pages 15-16).

      I think a diagram or schematic showing the timeline for surgeries, training, and testing would be helpful to the audience.

      We opted for a text-based experimental timeline rather than a diagram due to slight temporal variations across experiments (page 15).

      On trials, when the LED was on, was light delivered continuously or pulsed? Do these opto-receptors 'bleach' within such a long window?

      We apologize for the lack of clarity; the light was delivered continuously. We have modified the manuscript (pages 6 and 19) and figure legend accordingly. The postmortem analysis did not provide evidence for photobleaching (Supplemental Figures) and as noted above, the behavioural results do not indicate any negative physiological impact on cell function.  

      Immunofluorescence: The blocking solution used during IHC is described as "NHS"; is this normal horse serum?

      The Reviewer is correct; NHS stands for normal horse serum. This has been added (page 21). 

      Microscopy and imaging:

      For the description of rats excluded due to placement or viral spread problems, an n=X is listed for the NAc S D1 SPNs --> VP silencing group. Is this a typo, or was that meant to read as n=0? Also, was there a major sex diNerence in the attrition rate? If so, I think reporting the sex of the lost subjects might be beneficial to the scientific community, as it might reflect a need for better guidance on sex-specific coordinates for targeting small nuclei.

      We apologize for the error regarding the number of excluded animals. This error has been corrected (page 23). There were no major sex diMerences in the attrition rate. The manuscript has been updated to provide information about the sex of excluded animals (page 23). 

      References

      Cao, J., Willett, J. A., Dorris, D. M., & Meitzen, J. (2018). Sex DiMerences in Medium Spiny Neuron Excitability and Glutamatergic Synaptic Input: Heterogeneity Across Striatal Regions and Evidence for Estradiol-Dependent Sexual DiMerentiation. Front Endocrinol (Lausanne), 9, 173. https://doi.org/10.3389/fendo.2018.00173

      Corbit, L. H., Muir, J. L., Balleine, B. W., & Balleine, B. W. (2001). The role of the nucleus accumbens in instrumental conditioning: Evidence of a functional dissociation between accumbens core and shell. J Neurosci, 21(9), 3251-3260. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&id=11312 310&retmode=ref&cmd=prlinks

      Corbit, L. H., & Balleine, B. W. (2011). The general and outcome-specific forms of Pavlovian-instrumental transfer are diMerentially mediated by the nucleus accumbens core and shell. J Neurosci, 31(33), 11786-11794. https://doi.org/10.1523/JNEUROSCI.2711-11.2011

      Laurent, V., Bertran-Gonzalez, J., Chieng, B. C., & Balleine, B. W. (2014). δ-Opioid and Dopaminergic Processes in Accumbens Shell Modulate the Cholinergic Control of Predictive Learning and Choice. J Neurosci, 34(4), 1358-1369. https://doi.org/10.1523/JNEUROSCI.4592-13.2014

      Laurent, V., Leung, B., Maidment, N., & Balleine, B. W. (2012). μ- and δ-opioid-related processes in the accumbens core and shell diMerentially mediate the influence of reward-guided and stimulus-guided decisions on choice. J Neurosci, 32(5), 1875-1883. https://doi.org/10.1523/JNEUROSCI.4688-11.2012

      Matamales, M., McGovern, A. E., Mi, J. D., Mazzone, S. B., Balleine, B. W., & BertranGonzalez, J. (2020). Local D2- to D1-neuron transmodulation updates goal-directed learning in the striatum. Science, 367(6477), 549-555. https://doi.org/10.1126/science.aaz5751

      Parkes, S. L., Bradfield, L. A., & Balleine, B. W. (2015). Interaction of insular cortex and ventral striatum mediates the eMect of incentive memory on choice between goaldirected actions. J Neurosci, 35(16), 6464-6471. https://doi.org/10.1523/JNEUROSCI.4153-14.2015

      Pettibone, J. R., Yu, J. Y., Derman, R. C., Faust, T. W., Hughes, E. D., Filipiak, W. E., Saunders, T. L., Ferrario, C. R., & Berke, J. D. (2019). Knock-In Rat Lines with Cre Recombinase at the Dopamine D1 and Adenosine 2a Receptor Loci. eNeuro, 6(5). https://doi.org/10.1523/ENEURO.0163-19.2019

      Willett, J. A., Will, T., Hauser, C. A., Dorris, D. M., Cao, J., & Meitzen, J. (2016). No Evidence for Sex DiMerences in the Electrophysiological Properties and Excitatory Synaptic Input onto Nucleus Accumbens Shell Medium Spiny Neurons. eNeuro, 3(1), ENEURO.0147-15.2016. https://doi.org/10.1523/ENEURO.0147-15.2016

    1. Author response:

      Reviewer #1 (Public review):

      In the current article, Octavia Soegyono and colleagues study "The influence of nucleus accumbens shell D1 and D2 neurons on outcome-specific Pavlovian instrumental transfer", building on extensive findings from the same lab. While there is a consensus about the specific involvement of the Shell part of the Nucleus Accumbens (NAc) in specific stimulus-based actions in choice settings (and not in General Pavlovian instrumental transfer - gPIT, as opposed to the Core part of the NAc), mechanisms at the cellular and circuitry levels remain to be explored. In the present work, using sophisticated methods (rat Cre-transgenic lines from both sexes, optogenetics, and the well-established behavioral paradigm outcome-specific PIT-sPIT), Octavia Soegyono and colleagues decipher the differential contribution of dopamine receptors D1 and D2 expressing spiny projection neurons (SPNs).

      After validating the viral strategy and the specificity of the targeting (immunochemistry and electrophysiology), the authors demonstrate that while both NAc Shell D1- and D2-SPNs participate in mediating sPIT, NAc Shell D1-SPNs projections to the Ventral Pallidum (VP, previously demonstrated as crucial for sPIT), but not D2-SPNs, mediates sPIT. They also show that these effects were specific to stimulus-based actions, as value-based choices were left intact in all manipulations.

      This is a well-designed study, and the results are well supported by the experimental evidence. The paper is extremely pleasant to read and adds to the current literature.

      We thank the Reviewer for their positive assessment.

      Reviewer #2 (Public review):

      Summary:

      This manuscript by Soegyono et al. describes a series of experiments designed to probe the involvement of dopamine D1 and D2 neurons within the nucleus accumbens shell in outcome-specific Pavlovian-instrumental transfer (osPIT), a well-controlled assay of cue-guided action selection based on congruent outcome associations. They used an optogenetic approach to phasically silence NAc shell D1 (D1-Cre mice) or D2 (A2a-Cre mice) neurons during a subset of osPIT trials. Both manipulations disrupted cue-guided action selection but had no effects on negative control measures/tasks (concomitant approach behavior, separate valued guided choice task), nor were any osPIT impairments found in reporter-only control groups. Separate experiments revealed that selective inhibition of NAc shell D1 but not D2 inputs to ventral pallidum was required for osPIT expression, thereby advancing understanding of the basal ganglia circuitry underpinning this important aspect of decision making.

      Strengths:

      The combinatorial viral and optogenetic approaches used here were convincingly validated through anatomical tract-tracing and ex vivo electrophysiology. The behavioral assays are sophisticated and well-controlled to parse cue and value-guided action selection. The inclusion of reporter-only control groups is rigorous and rules out nonspecific effects of the light manipulation. The findings are novel and address a critical question in the literature. Prior work using less decisive methods had implicated NAc shell D1 neurons in osPIT but suggested that D2 neurons may not be involved. The optogenetic manipulations used in the current study provide a more direct test of their involvement and convincingly demonstrate that both populations play an important role. Prior work had also implicated NAc shell connections to ventral pallidum in osPIT, but the current study reveals the selective involvement of D1 but not D2 neurons in this circuit. The authors do a good job of discussing their findings, including their nuanced interpretation that NAc shell D2 neurons may contribute to osPIT through their local regulation of NAc shell microcircuitry.

      We thank the Reviewer for their positive assessment.

      Weaknesses:

      The current study exclusively used an optogenetic approach to probe the function of D1 and D2 NAc shell neurons. Providing a complementary assessment with chemogenetics or other appropriate methods would strengthen conclusions, particularly the novel demonstration of D2 NAc shell involvement. Likewise, the null result of optically inhibiting D2 inputs to the ventral pallidum leaves open the possibility that a more complete or sustained disruption of this pathway may have impaired osPIT.

      We acknowledge the reviewer's valuable suggestion that demonstrating NAc-S D1- and D2-SPN engagement in outcome-specific PIT through another technique would strengthen our optogenetic findings. Several approaches could provide this validation. Chemogenetic manipulation, as the reviewer suggested, represents one compelling option. Alternatively, immunohistochemical assessment of phosphorylated histone H3 at serine 10 (P-H3) offers another promising avenue, given its established utility in reporting striatal SPN plasticity in the dorsal striatum (Matamales et al., 2020). We hope to complete such an assessment in future work since it would address the limitations of previous work that relied solely on ERK1/2 phosphorylation measures in NAc-S SPNs (Laurent et al., 2014).

      Regarding the null result from optical silencing of D2 terminals in the ventral pallidum, we agree with the reviewer's assessment. While we acknowledge this limitation in the current manuscript (see discussion), we aim to address this gap in future studies to provide a more complete mechanistic understanding of the circuit.

      Reviewer #3 (Public review):

      Summary:

      The authors present data demonstrating that optogenetic inhibition of either D1- or D2-MSNs in the NAc Shell attenuates expression of sensory-specific PIT while largely sparing value-based decision on an instrumental task. They also provide evidence that SS-PIT depends on D1-MSN projections from the NAc-Shell to the VP, whereas projections from D2-MSNs to the VP do not contribute to SS-PIT.

      Strengths:

      This is clearly written. The evidence largely supports the authors' interpretations, and these effects are somewhat novel, so they help advance our understanding of PIT and NAc-Shell function.

      We thank the Reviewer for their positive assessment.

      Weaknesses:

      I think the interpretation of some of the effects (specifically the claim that D1-MSNs do not contribute to value-based decision making) is not fully supported by the data presented.

      We appreciate the reviewer's comment regarding the marginal attenuation of value-based choice observed following NAc-S D1-SPN silencing. While this manipulation did produce a slight reduction in choice performance, the behavior remained largely intact. We are hesitant to interpret this marginal effect as evidence for a direct role of NAc-S D1-SPNs in value-based decision-making, particularly given the substantial literature demonstrating that NAc-S manipulations typically preserve such choice behavior (Corbit & Balleine, 2011; Corbit et al., 2001; Laurent et al., 2012). Notably, previous work has shown that NAc-S D1 receptor blockade impairs outcome-specific PIT while leaving value-based choice unaffected (Laurent et al., 2014). We favor an alternative explanation for our observed marginal reduction. As documented in Supplemental Figure 1, viral transduction extended slightly into the nucleus accumbens core (NAc-C), a region established as critical for value-based decision-making (Corbit & Balleine, 2011; Corbit et al., 2001; Laurent et al., 2012). The marginal impairment may therefore reflect inadvertent silencing of a small NAc-C D1-SPN population rather than a functional contribution from NAc-S D1-SPNs. Future studies specifically targeting larger NAc-C D1-SPN populations would help clarify this possibility and provide definitive resolution of this question.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      The authors used high-density probe recordings in the medial prefrontal cortex (PFC) and hippocampus during a rodent spatial memory task to examine functional sub-populations of PFC neurons that are modulated vs. unmodulated by hippocampal sharp-wave ripples (SWRs), an important physiological biomarker that is thought to have a role in mediating information transfer across hippocampal-cortical networks for memory processes. SWRs are associated with the reactivation of representations of previous experiences, and associated reactivation in hippocampal and cortical regions has been proposed to have a role in memory formation, retrieval, planning, and memory-guided behavior. This study focuses on awake SWRs that are prevalent during immobility periods during pauses in behavior. Previous studies have reported strong modulation of a subset of prefrontal neurons during hippocampal SWRs, with some studies reporting prefrontal reactivation during SWRs that have a role in spatial memory processes. The study seeks to extend these findings by examining the activity of SWR-modulated vs. unmodulated neurons across PFC sub-regions, and whether there is a functional distinction between these two kinds of neuronal populations with respect to representing spatial information and supporting memory-guided decision-making.

      Strengths:

      The major strength of the study is the use of Neuropixels 1.0 probes to monitor activity throughout the dorsal-ventral extent of the rodent medial prefrontal cortex, permitting an investigation of functional distinction in neuronal populations across PFC sub-regions. They are able to show that SWR-unmodulated neurons, in addition to having stronger spatial tuning than SWR-modulated neurons as previously reported, also show stronger directional selectivity and theta-cycle skipping properties.

      Weaknesses:

      (1) While the study is able to extend previous findings that SWR-modulated PFC neurons have significantly lower spatial tuning that SWR-unmodulated neurons, the evidence presented does not support the main conclusion of the paper that only the unmodulated neurons are involved in spatial tuning and signaling upcoming choice, implying that SWR-modulated neurons are not involved in predicting upcoming choice, as stated in the abstract. This conclusion makes a categorical distinction between two neuronal populations, that SWR-modulated neurons are involved and SWR-unmodulated are not involved in predicting upcoming choice, which requires evidence that clearly shows this absolute distinction. However, in the analyses showing non-local population decoding in PFC for predicting upcoming choice, the results show that SWR-unmodulated neurons have higher firing rates than SWR-modulated neurons, which is not a categorical distinction. Higher firing rates do not imply that only SWR-unmodulated neurons are contributing to the non-local decoding. They may contribute more than SWR-modulated neurons, but there are no follow-up analyses to assess the contribution of the two sub-populations to non-local decoding.

      We agree with the reviewer that this is indeed not a categorical distinction, and do not wish to claim that the SWR-modulated neurons have absolutely no role in non-local decoding and signaling upcoming choice. We have adjusted this in the title, abstract and text to clarify this for the reader. Furthermore, we have performed additional analyses to elucidate the role of SWR-modulated neurons in non-local decoding by creating separate decoding models for SWR-modulated and unmodulated PFC neurons respectively. These analyses show that the SWR-unmodulated neurons are indeed encoding representations of the upcoming choice more often than the alternative choice, whereas the SWR-modulated neurons do not reliably differentiate the upcoming and alternative choices in non-local decoding at the choice point (see new Fig 4d).

      (2) Further, the results show that during non-local representations of the hippocampus of the upcoming options, SWR-excited PFC neurons were more active during hippocampal representations of the upcoming choice, and SWR-inhibited PFC neurons were less active during hippocampal representations of the alternative choice. This clearly suggests that SWR-modulated neurons are involved in signaling upcoming choice, at least during hippocampal non-local representations, which contradicts the main conclusion of the paper.

      This does not contradict the main conclusion of the paper, but in fact strengthens the hypothesis we are putting forward: that the SWR-modulated neurons are more linked to the hippocampal non-local representations, whereas the SWR-unmodulated neurons seem to have their own encoding of upcoming choice which is not linked to the signatures in the hippocampus (almost no time overlap with hippocampal representations, no phase locking to hippocampal theta, no time locking to hippocampal SWRs, no increased firing during hippocampal representations of upcoming choice).

      (3) Similarly, one of the analyses shows that PFC nonlocal representations show no preference for hippocampal SWRs or hippocampal theta phase. However, the examples shown for non-local representations clearly show that these decodes occur prior to the start of the trajectory, or during running on the central zone or start arm. The time period of immobility prior to the start arm running will have a higher prevalence of SWRs and that during running will have a higher prevalence of theta oscillations and theta sequences, so non-local decoded representations have to sub-divided according to these known local-field potential phenomena for this analysis, which is not followed.

      These analyses are in fact separated based on proximity to SWRs (only segments that occurred within 2 seconds of SWR onset were included, see Methods) and theta periods respectively (selected based on a running speed of more than 5 cm/s and the absence of SWRs in the hippocampus, see Methods). We have clarified this in the main text.

      (4) The primary phenomenon that the manuscript relies on is the modulation of PFC neurons by hippocampal SWRs, so it is necessary to perform the PFC population decoding analyses during SWRs (or examine non-local decoding that occurs specifically during SWRs), as reported in previous studies of PFC reactivation during SWRs, to see if there is any distinction between modulated and unmodulated neurons in this reactivation. Even in the case of independent PFC reactivation as reported by one study, this PFC reactivation was still reported to occur during hippocampal SWRs, therefore decoding during SWRs has to be examined. Similarly, the phenomenon of theta cycle skipping is related to theta sequence representations, so decoding during PFC and hippocampal theta sequences has to be examined before coming to any conclusions.

      The histograms shown in Figure 5a (see updated Fig 5a where we look at the closest SWR in time and compare the occurrence with shuffled data) show that there is no increased prevalence of decoding upcoming and alternative choices in the PFC during hippocampal SWRs. The lack of overlap of non-local decoding between the hippocampus and PFC further shows that these non-local representations occur at different timepoints in the PFC and hippocampus (see updated Fig 4e where we added a statistical test to show the likeliness of the overlap between the decoded segments in the PFC and hippocampus). Based on the reviewer's suggestion, we have additionally decoded the information in the PFC during hippocampal SWRs exclusively, and found that the direction on the maze could not be predicted based on the decoding of SWR time points in the PFC. See figure below. Similarly, we can see from the histograms in Figure 5c that there is no phase locking to the hippocampal theta phase for non-local representations in the PFC, and in contrast there is phase locking of the hippocampal encoding of upcoming choice to the rising phase of the theta cycle (Fig 6c), further highlighting the separation between these two regions in the non-local decoding.

      Reviewer #2 (Public review):

      Summary:

      This work by den Bakker and Kloosterman contributes to the vast body of research exploring the dynamics governing the communication between the hippocampus (HPC) and the medial prefrontal cortex (mPFC) during spatial learning and navigation. Previous research showed that population activity of mPFC neurons is replayed during HPC sharp-wave ripple events (SWRs), which may therefore correspond to privileged windows for the transfer of learned navigation information from the HPC, where initial learning occurs, to the mPFC, which is thought to store this information long term. Indeed, it was also previously shown that the activity of mPFC neurons contains task-related information that can inform about the location of an animal in a maze, which can predict the animals' navigational choices. Here, the authors aim to show that the mPFC neurons that are modulated by HPC activity (SWRs and theta rhythms) are distinct from those "encoding" spatial information. This result could suggest that the integration of spatial information originating from the HPC within the mPFC may require the cooperation of separate sets of neurons.

      This observation may be useful to further extend our understanding of the dynamics regulating the exchange of information between the HPC and mPFC during learning. However, my understanding is that this finding is mainly based upon a negative result, which cannot be statistically proven by the failure to reject the null hypothesis. Moreover, in my reading, the rest of the paper mainly replicates phenomena that have already been described, with the original reports not correctly cited. My opinion is that the novel elements should be precisely identified and discussed, while the current phrasing in the manuscript, in most cases, leads readers to think that these results are new. Detailed comments are provided below.

      Major concerns:

      (1) The main claim of the manuscript is that the neurons involved in predicting upcoming choices are not the neurons modulated by the HPC. This is based upon the evidence provided in Figure 5, which is a negative result that the authors employ to claim that predictive non-local representations in the mPFC are not linked to hippocampal SWRs and theta phase. However, it is important to remember that in a statistical test, the failure to reject the null hypothesis does not prove that the null hypothesis is true. Since this claim is so central in this work, the authors should use appropriate statistics to demonstrate that the null hypothesis is true. This can be accomplished by showing that there is no effect above some size that is so small that it would make the effect meaningless (see https://doi.org/10.1177/070674370304801108).

      We would like to highlight a few important points here. (1) We indeed do not intend to claim that the SWR-modulated neurons are not at all involved in predicting upcoming choice, just that the SWR-unmodulated neurons may play a larger role. We have rephrased the title and abstract to make this clearer. (2) The hypothesis that we put forward is based not only on a negative effect, but on the findings that: the SWR-unmodulated neurons show higher spatial tuning (Fig 3b), more directional selectivity (Fig 3d), more frequent encoding of the upcoming choice at the choice point (new analysis, added in Fig 4d), and higher spike rates during the representations of the upcoming choice (Fig 5b). This is further highlighted by the fact that the representations of upcoming choice in the PFC are not time locked to SWRs (whereas the hippocampal representations of upcoming choice are;  see Fig 5a and Fig 6a), and not time-locked to hippocampal theta phase (whereas the hippocampal representations are; see Fig 5c and Fig 6c). Finally, the representations of upcoming and alternative choices in the PFC do not show a large overlap in time with the representations in the hippocampus (see updated Fig 4e were we added a statistical test to show the likelihood of the overlap of decoded timepoints). All these results together lead us to hypothesize that SWR-modulation is not the driving factor behind non-local decoding in the PFC. (3) Based on the reviewers suggestion, we have added a statistical test to compare the phase-locking based of the non-local decoding to hippocampal SWRs and theta phase to shuffled posterior probabilities. Instead of looking at all SWRs in a -2 to 2 second window, we have now only selected the closest SWR in time within that window, and did the statistical comparison in the bin of 0-20 ms from SWR onset. With this new analysis we are looking more directly at the time-locking of the decoded segments to SWR onset (see updated Fig 5a and 6a).

      (2) The main claim of the work is also based on Figure 3, where the authors show that SWRs-unmodulated mPFC neurons have higher spatial tuning, and higher directional selectivity scores, and a higher percentage of these neurons show theta skipping. This is used to support the claim that SWRs-unmodulated cells encode spatial information. However, it must be noted that in this kind of task, it is not possible to disentangle space and specific task variables involving separate cognitive processes from processing spatial information such as decision-making, attention, motor control, etc., which always happen at specific locations of the maze. Therefore, the results shown in Figure 3 may relate to other specific processes rather than encoding of space and it cannot be unequivocally claimed that mPFC neurons "encode spatial information". This limitation is presented by Mashoori et al (2018), an article that appears to be a major inspiration for this work. Can the authors provide a control analysis/experiment that supports their claim? Otherwise, this claim should be tempered. Also, the authors say that Jadhav et al. (2016) showed that mPFC neurons unmodulated by SWRs are less tuned to space. How do they reconcile it with their results?

      The reviewer is right to assert caution when talking about claims such as spatial tuning where other factors may also be involved. Although we agree that there may be some other factors influencing what we are seeing as spatial tuning, it is very important to note that the behavioral task is executed on a symmetrical 4-armed maze, where two of the arms are always used for the start of the trajectory, and the other two arms (North and South) function as the goal (reward) arms. Therefore, if the PFC is encoding cognitive processes such as task phases related to decision-making and reward, we would not be able to differentiate between the two start arms and the two goal arms, as these represent the same task phases. Note also that the North and South arm are illuminated in a pseudo-random order between trials and during cue-based rule learning this is a direct indication of where the reward will be found. Even in this phase of the task, the PFC encodes where the animal will turn on a trial-to-trial basis (meaning the North and South arm are still differentiated correctly on each trial even though the illumination and associated reward are changing).

      Secondly, importantly, the reviewer mentions that we claimed that Jadhav et al. (2016) showed that mPFC neurons unmodulated by SWRs are less tuned to space, but this is incorrect. Jadhav et al. (2016) showed that SWR-unmodulated neurons had lower spatial coverage, meaning that they are more spatially selective (congruent with our results). We have rephrased this in the text to be clearer.

      (3) My reading is that the rest of the paper mainly consists of replications or incremental observations of already known phenomena with some not necessarily surprising new observations:

      (a) Figure 2 shows that a subset of mPFC neurons is modulated by HPC SWRs and theta (already known), that vmPFC neurons are more strongly modulated by SWRs (not surprising given anatomy), and that theta phase preference is different between vmPFC and dmPFC (not surprising given the fact that theta is a travelling wave).

      The finding that vmPFC neurons are more strongly modulated by SWRs than dmPFC indeed matches what we know from anatomy, but that does not make it a trivial finding. A lot remains unknown about the mPFC subregions and their interactions with the hippocampus, and not every finding will be directly linked to the anatomy. Therefore, in our view this is a significant finding which has not been studied before due to the technical complexity of large-scale recordings along the dorsal-ventral axis of the mPFC.

      Similarly, theta being a traveling wave (which in itself is still under debate), does not mean we should assume that the dorsal and ventral mPFC should follow this signature and be modulated by different phases of the theta cycle. Again, in our view this is not at all trivial, but an important finding which brings us closer to understanding the intricate interactions between the hippocampus and PFC in spatial learning and decision-making.

      (b) Figure 4 shows that non-local representations in mPFC are predictive of the animal's choice. This is mostly an increment to the work of Mashoori et al (2018). My understanding is that in addition to what had already been shown by Mashoori et al here it is shown how the upcoming choice can be predicted. The author may want to emphasize this novel aspect.

      In our view our manuscript focuses on a completely different aspect of learning and memory than the paper the reviewer is referring to (Mashoori et al. 2018). Importantly, the Mashoori et al. paper looked at choice evaluation at reward sites and shows that disappointing reinforcements are associated with reactivations in the ACC of the unselected target. This points to the role of the ACC in error detection and evaluation. Although this is an interesting result, it is in essence unrelated to what we are focusing on here, which is decision making and prediction of upcoming choices. The fact that the turning direction of the animal can be predicted on a trial-to-trial basis, and even precedes the behavioral change over the course of learning, sheds light on the role of the PFC in these important predictive cognitive processes (as opposed to post-choice reflective processes).

      (c) Figure 6 shows that prospective activity in the HPC is linked to SWRs and theta oscillations. This has been described in various forms since at least the works of Johnson and Redish in 2007, Pastalkova et al 2008, and Dragoi and Tonegawa (2011 and 2013), as well as in earlier literature on splitter cells. These foundational papers on this topic are not even cited in the current manuscript.

      We have added these citations to the introduction (line 37).

      Although some previous work is cited, the current narrative of the results section may lead the reader to think that these results are new, which I think is unfair. Previous evidence of the same phenomena should be cited all along the results and what is new and/or different from previous results should be clearly stated and discussed. Pure replications of previous works may actually just be supplementary figures. It is not fair that the titles of paragraphs and main figures correspond to notions that are well established in the literature (e.g., Figure 2, 2nd paragraph of results, etc.).

      We have changed the title of paragraph 2 and Figure 2 to highlight more clearly the novel result (the difference between the dorsal and ventral mPFC), and have improved clarity of the text throughout to highlight the novelty of our results better.

      (d) My opinion is that, overall, the paper gives the impression of being somewhat rushed and lacking attention to detail. Many figure panels are difficult to understand due to incomplete legends and visualizations with tiny, indistinguishable details. Moreover, some previous works are not correctly cited. I tried to make a list of everything I spotted below.

      We have addressed all the comments in the Recommendations for Authors.

      Reviewer #1 (Recommendations for the authors):

      (1) Expanding on the points above, one of the strengths of the study is expanding the previous result that SWR-unmodulated neurons are more spatially selective (Jadhav et al., 2016), across prefrontal sub-regions, and showing that these neurons are more directionally selective and show more theta cycle skipping. Theta cycle skipping is related to theta sequence representations and previous studies have established PFC theta sequences in parallel to hippocampal theta sequences (Tang et al., 2021; Hasz and Redish, 2020; Wang et al., 2024), and the theta cycle skipping result suggests that SWR-unmodulated neurons should show stronger participation than SWR-modulated neurons in PFC theta sequences that decode to upcoming or alternative location, which can be tested in this high-density PFC physiology data. This is still unlikely to make a categorical distinction that only SWR-unmodulated neurons participate in theta sequence decoding, but will be useful to examine.

      We thank the reviewer for their suggestion and have now included results based on separate decoding models that only use SWR-modulated or SWR-unmodulated mPFC neurons. From this analysis we see that indeed SWR-unmodulated neurons are not the only group contributing to theta sequence decoding, but they do distinguish more strongly between the upcoming and alternative arms at the choice point (see new Fig 4d).

      (2) Non-local decoding in 50ms windows on a theta timescale is a valid analysis, but ignoring potential variability in the internal state during running vs. immobility, and as indicated by LFPs by the presence of SWRs or theta oscillations, is incorrect especially when conclusions are being made about decoding during SWRs and theta oscillation phase, and in light of previous evidence that these are distinct states during behavior. There are multiple papers on PFC theta sequences (Tang et al., 2021; Hasz and Redish, 2020; Wang et al., 2024), and on PFC reactivation during SWRs (Shin et al., 2019; Kaefer et al., 2020; Jarovi et al., 2023), and this dataset of high-density prefrontal recordings using Neuropixels 1.0 provides an opportunity to investigate these phenomena in detail. Here, it should be noted that although Kaefer et al. reported independent prefrontal reactivation from hippocampal reactivation, these PFC reactivation events still occurred during hippocampal SWRs in their data, and were linked to memory performance.

      From our data we see that the time segments that represent upcoming or alternative choice in the prefrontal cortex are in fact not time-locked to hippocampal SWRs (updated Fig 5a where we look only at the closest SWR in time and compare this to shuffled data). In addition, these segments do not overlap much with the decoded segments in the hippocampus (see updated Fig 4e where we added a shuffling procedure to assess the likelihood of the overlap with hippocampal decoded segments). Importantly, we are not ignoring the variability during running and immobility, as theta segments were selected based on a running speed of more than 5 cm/s and the absence of SWRs in the hippocampus (see Methods), ensuring that the theta and SWR analyses were done on the two different behavioral states respectively. We have  clarified this in the main text.

      (3) The majority of rodent studies make the distinction between ACC, PrL, and IL, although as the authors noted, there are arguments that rodent mPFC is a continuum (Howland et al., 2022), or even that rodent mPFC is a unitary cingulate cortical region (van Heukelum et al., 2020). The authors choose to present the results as dorsal (ACC + dorsal PrL) vs. ventral mPFC (ventral PrL + IL), however, in my opinion, it will be more useful to the field to see results separately for ACC, PrL, and IL, given the vast literature on connectivity and functional differences in these regions.

      We appreciate the reviewer’s suggestion. Initially, we did perform all analyses separately for the ACC, PLC and ILC subregions. However, we observed that the differences between subregions (strength of SWR-modulation and the phase locking to theta) varied uniformly along the dorsal-ventral axis, i.e., the PLC showed a profile of SWR-modulation and theta phase locking that fell in between that of the ACC and the ILC. This is also highlighted in paragraph 3 of the introduction (lines 52-56). For that reason, and for the sake of reducing the number of variables, increasing statistical power, and improving readability, we focused on the dorsal-ventral distinction instead, as this is where the main differences were seen.

      (4) I suggest that the authors refrain from making categorical distinctions as in their title and abstract, such as "neurons that are involved in predicting upcoming choice are not the neurons that are modulated by hippocampal sharp-wave ripples" when the evidence presented can only support gradation of participation of the two neuronal sub-populations, not an absolute distinction. The division of SWR-modulated and SWR-unmodulated neurons itself is determined by the statistic chosen to divide the neurons into one or two sub-classes and will vary with the statistical threshold employed. Further, previous studies have suggested that SWR-excited and SWR-inhibited neurons comprise distinct functional sub-populations based on their activity properties (Jadhav et al., 2016; Tang et al., 2017), but it is not clear to what degree is SWR-modulated neurons a distinct and singular functional sub-population. In the absence of connectivity information and cross-correlation measures within and across sub-populations, it is prudent to be conservative about this interpretation of SWR-unmodulated neurons.

      We agree with the reviewer that the distinction is not categorical and have changed the wording in the title and abstract. We also do not intend to claim that the SWR-modulated neurons are a distinct and singular functional sub-population, and for that reason the firing rates from the SWR-excited and SWR-inhibited groups are reported separately throughout the paper.

      Reviewer #2 (Recommendations for the authors):

      Minor detailed remarks:

      (1) The authors should provide a statistical test, perhaps against shuffled data, for Figures 5a,c and 6a,c.

      We thank the reviewer for their suggestion and have added statistical tests in Figures 5a, 5c, 6a and 6c.

      (2) The behavioral task is explained only in the legend of Figure 1c, and the explanation is quite vague. In this type of article format, readers need to have a clear understanding of the task without having to refer to the methods section. A clear understanding of the task is crucial for interpreting all subsequent analyses. In my opinion, the word 'trial' in the figure is misleading, as these are sessions composed of many trials.

      We have added a more thorough description of the behavioral task, both in the main text and the Figure legend.

      (3) Figure 1d, legend of markers missing.

      We have added a legend for the markers.

      (4) When there are multiple bars and a single p-value is presented, it is unclear which group comparisons the p-value pertains to. For instance, Figures 2c-f and 3b, d, f (right parts), and 5b...

      For all p-values we have added lines to the figures that indicate the groups that were compared and have added descriptions of the statistical test to the figure legends to indicate what each p-value represents.

      (5) In Figure 3c, the legend does not explain what the colored lines represent, and the lines themselves are very small and almost indistinguishable.

      We have changed the colored lines to quadrants on the maze to clarify what each direction represents.

      (6) Figure 4a is too small, and the elements are so tiny that it is impossible to distinguish them and their respective colors. The term 'segment' has not been unequivocally explained in the text. All the different elements of the panel should be explicitly explained in the legend to make it easily understandable. What do the pictograms of the maze on the left represent? What does the dashed vertical line indicate?

      We have added the definition of a segment in the text (lines 283-286) and have improved the clarity and readability of Figure 4a.

      (7) In Figure 5, what do the red dots on the right part relate to? The legend should explicitly explain what is shown in the left and right parts, respectively. What comparisons do the p-values relate to?

      We have adjusted the legend to explain the left and right parts of the figure and we have added the statistical test that was used to get to the p-value (in addition to the text which already explained this).

      (8) Panels b of Figures 5 and 6 should have the same y-axis scale for comparison. The position of the p-values should also be consistent. With the current arrangement in Figure 6, it is unclear what the p-values relate to.

      We have adjusted the y-scale to be the same for Figures 5 and 6, and we have added a description of the statistical test to the legend.

      (9) Multiple studies have previously shown that mPFC activity contains spatial information (e.g., refs 24-27). It is important that, throughout the paper, the authors frame their results in relation to previous findings, highlighting what is novel in this work.

      We thank the reviewer for this valuable suggestion. In the revised manuscript, we have indicated more clearly which results replicate previous findings and highlighted novel results.

      (10) Please note that Peyrache et al. (2009) do not show trajectory replay, nor do they decode location. I am not familiar with all the cited literature, but this makes me think that the authors may want to double-check their citations to ensure they assign the correct claims to each past work.

      We have adjusted the reference to the work to exclude the word ‘trajectory’ and doublechecked our other citations.

      (11) The authors perform theta-skipping analysis, first described by Kay et al., but do not cite the original paper until the discussion.

      Thank you pointing out this oversight. We have now included this citation earlier in the paper (line 231).

      (12) Additionally, some parts of the text are difficult to grasp, and there are English vocabulary and syntax errors. I am happy to provide comments on the next version of the text, but please include page and line numbers in the PDF. The authors may also consider using AI to correct English mistakes and improve the fluency and readability of their text.

      We have carefully gone through the text to correct any errors.  We have now also included page and line numbers and we will be happy to address any specific issues the reviewer may spot in the revised manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews: 

      Reviewer #1 (Public review): 

      This study presents evidence that remote memory in the APP/PS1 mouse model of Alzheimer's disease (AD) is associated with PV interneuron hyperexcitability and increased inhibition of cortical engram cells. Its strength lies in the fact that it explores a neglected aspect of memory research - remote memory impairments related to AD (for which the primary research focus is usually on recent memory impairments) -which has received minimal attention to date. While the findings are intriguing, the weakness of the paper hovers around purely correlational types of evidence and superficial data analyses, which require substantial revisions as outlined below. 

      We thank the reviewer for their feedback, and we appreciate the recognition of the study’s novelty in addressing remote memory impairments in AD. We acknowledge the reviewer’s concerns and have implemented revisions to strengthen the manuscript.

      Major concerns: 

      (1) In light of previous work, including that by the authors themselves, the data in Figure 1 should be implemented by measurements of recent memory recall in order to assess whether remote memories are exclusively impaired or whether remote memory recall merely represents a continuation of recent memory impairments.

      We agree with the reviewer that is an important point. In line with their suggestion in minor comment 1, we now omitted the statement on recent memory in the results (previously on lines 109-111 and 117). Nonetheless, previous independent experiments from our group have repeatedly shown recent memory deficits in APP/PS1 mice at 12 weeks of age, including a recent article published in 2023. We refer the reviewer to figure 2c in Végh et al. (2014) and figure 2i in Kater et al. (2023). We have added a reference of the latter paper to our discussion section (line 458-459). Therefore, we are confident that the recent memory deficit at 12 weeks of age is a stable phenotype in our APP/PS1 mice.

      With these data in mind, we argue that the remote memory recall impairment is not a continuation of recent memory impairments. Recent memory deficits emerge already at 12 weeks of age, and when remote memory is assessed at 16 weeks (4 weeks after training at 12 weeks of age), APP/PS1 mice are still capable of forming and retrieving a remote memory. This suggests that remote memory retrieval can occur even when recent memory is compromised, arguing against the idea that the remote memory deficit observed at 20 weeks is a continuation of earlier recent memory impairments. We have clarified this point in the revised manuscript by adding the following sentence to the discussion section (line 462-465): 

      ‘This suggests that a remote memory can be formed even when recent memory expression is already compromised, indicating that the remote memory deficit in 20-week-old APP/PS1 mice is not a continuation of earlier recent memory impairments.’

      (2) Figure 2 shows electrophysiological properties of PV cells in the mPFC that correlate with the behavior shown in Figure 1. However, the mice used in Figure 2 are different than the mice used in Figure 1. Thus, the data are correlative at best, and the authors need to confirm that behavioral impairments in the APP/PS1 mice crossed to PV-Cre (and SST-Cre mice) used in Figure 2 are similar to those of the APP/PS1 mice used in Figure 1. Without that, no conclusions between behavioral impairments and electrophysiological as well as engram reactivation properties can be made, and the central claims of the paper cannot be upheld. 

      We thank the reviewer for raising this concern. Indeed, the remote memory impairment and PV hyperexcitability are correlative data, and therefore we do not make causal claims based on these data. However, please note that most of our key findings, including behavioural impairments, characterization of the engram ensemble and reactivation thereof, as well as inhibitory input measurements, were acquired using the same mouse line (APP/PS1), strengthening the coherence of our conclusions. Also, our electrophysiological findings in APP/PS1 (enhanced sIPSC frequency) and APP/PS1-PV-Cre-tdTomato (enhanced PV cell excitability) mice align well. Direct comparisons between the transgenic mouse lines APP/PS1 and APP/PS1 Parv-Cre were performed in our previous studies, confirming that these lines are similar in terms of behaviour and pathology. Specifically, we demonstrated that APP/PS1 mice display spatial memory impairments at 16 weeks of age, Fig 4a-d, consistent with the deficits observed in APP/PS1 Parv-Cre mice at 16 weeks of age, Fig 5a-c (Hijazi et al., 2020a). Additionally, Hijazi et al. (2020a) showed that soluble and insoluble Aβ levels do not differ between APP/PS1 Parv-Cre and APP/PS1 mice (sFig. 1), indicating comparable levels of pathology between these lines. While we do not have a similar characterization of the APP/PS1 SST-Cre line, we should mention that we also did not observe excitability differences in SST cells. We now acknowledge the limitation in the revised discussion section (line 480-487), and stress that our electrophysiology and behavioural findings are correlative in nature:

      ‘Although the excitability measurements were performed in APP/PS1-PV-Cre-tdTomato mice, and not in the APP/PS1 parental line, we previously found that these transgenic mouse lines exhibit comparable amyloid pathology (both soluble and insoluble amyloid beta levels) as well as similar spatial memory deficits (Hijazi et al., 2020a; Kater et al., 2023). Thus, our observations indicate that the APP/PS1 PV-Cre-tdTomato and APP/PS1 lines are similar in terms of pathology and behaviour. Nonetheless, further work is needed to identify a causal link between PV cell hyperexcitability and remote memory impairment.’ 

      (3) The reactivation data starting in Figure 3 should be analysed in much more depth: 

      a) The authors restrict their analysis to intra-animal comparisons, but additional ones should be performed, such as inter-animal (WT vs APP/PS1) as well as inter-age (12-16w vs 16-20w). In doing so, reactivation data should be normalized to chance levels per animal, to account for differences in labelling efficiency - this is standard in the field (see original Tonegawa papers and for a reference). This could highlight differences in total reactivation that are already apparent, such as for instance in WT vs APP/PS1 at 20w (Figure 3o) and highlight a decrease in reactivation in AD mice at this age, contrary to what is stated in lines 213-214. 

      We would like to thank the reviewer for the valuable input on the reactivation data in Figure 3. 

      We agree with the reviewer and now depict the data as normalized to chance levels (Figure 3). The original figures are now supplemental (sFig. 5). The reactivation data normalized to chance are similar to the original results, i.e. no difference was observed in the reactivation of the mPFC engram ensemble between genotypes. The reviewer may have overlooked that we did perform inter-animal (WT vs. APP/PS1) comparisons, however these were not significantly different. We have made this clearer in the main text, lines 277, 288-289, 294-295 and 303-304. Moreover, the reviewer recommended including inter-age group comparisons, which have now been added to the supplemental figures (sFig. 6). No genotype-dependent differences were observed. While a main effect of age group did emerge, indicating that there is a potential increased overlap between Fos+ and mCherry+ in animals aged 16-20 weeks, we caution against overinterpreting this finding. These experimental groups were processed in separate cohorts, with viral injection and 4TM-induced tagging performed at different moments in time, which may have contributed to the observed differences in overlap. We have addressed this point in the revised discussion (line 612-617):

      ‘Furthermore, we also observed an increase in the amount overlap between Fos+ and mCherry+ engram cells when comparing the 12-16w and 16-20w age groups. This finding should be interpreted with caution, as the experimental groups were processed in separate cohorts, with viral injections and 4TM-induced tagging performed at different moments in time. This may have contributed to the observed differences between ages.’

      b) Comparing the proportion of mcherry+ cells in PV- and PV+ is problematic, considering that the PV- population is not "pure" like the PV+, but rather likely to represent a mix of different pyramidal neurons (probably from several layers), other inhibitory neurons like SST and maybe even glial cells. Considering this, the statement on line 218 is misleading in saying that PVs are overrepresented. If anything, the same populations should be compared across ages or groups.  

      We thank the reviewer for their insightful comment and agree that the PV- population of cells is likely more heterogenous than the PV+ population. However, we would like to clarify that all quantified cells were selected based on Nissl immunoreactivity, and to exclude non-neuronal cells, stringent thresholding was applied in the script that was used to identify Nissl+ cells. The threshold information has now been added to the methods section (line 758-760). Thus, although heterogenous, the analysed PV- population reflects a neuronal subset. In response to the reviewer’s suggestion, we have now included overlap measurements relative to chance levels (Figure 3). These analyses did not reveal differences with the original analyses, i.e., there are no genotype specific differences. We have also incorporated the suggested inter-age group comparisons (sFig. 6) and found no differences between age groups. In light of the raised concerns, we have removed the statement that PV cells were overrepresented in the engram ensemble.

      c) A similar concern applies to the mcherry- population in Figure 4, which could represent different types of neurons that were never active, compared to the relatively homogeneous engram mcherry+ population. This could be elegantly fixed by restricting the comparison to mCherry+Fos+ vs mCherry+Fos- ensembles and could indicate engram reactivation-specific differences in perisomatic inhibition by PV cells. 

      The comparison the reviewer suggests, comparing mCherry+Fos+ to mCherry+Fos- is indeed conceptually interesting and could provide more insight into engram reactivation and PV input. However, there are practical limitations to performing this analysis, as neurons in close proximity need to be compared in a pairwise manner to account for local variability in staining intensity. As shown in Figure 3c+k and Figure 4a+b, d+e, PV immunostaining intensity varies to a certain extend within a given image. While pairwise comparisons of neighbouring neurons were feasible when analysing mCherry+ and mCherry- cells, they are unfortunately not feasible for the mCherry+Fos+ vs. mCherry+Fos- comparison. The occurrence of spatially adjacent mCherry+Fos+ and mCherry+Fos- neurons is too sparse for a pairwise comparison. This analysis would therefore result in substantial under-sampling and limit the reliability of the analysis. Nonetheless, we agree with the reviewer that the mCherry- population may be more heterogenous than the mCherry+ population, despite the fact that PV+ neurons and that non-neuronal cells were excluded from both populations in the analyses. We therefore added a statement to the discussion to acknowledge this limitation (line 536-539): 

      ‘Although PV+ cells were not included in this analysis and we excluded non-neuronal cells based on the area of the Nissl stain, the mCherry- population was potentially more heterogenous than the mCherry+ population, which may have contributed to the differences we observed.’

      (4) At several instances, there are some doubts about the statistical measures having been employed: 

      a) In Figure 4f, it is unclear why a repeated measurement ANOVA was used as opposed to a regular ANOVA. 

      b) In Supplementary Figure 2b, a Mann-Whitney test was used, supposedly because the data were not normally distributed. However, when looking at the individual data points, the data does seem to be normally distributed. Thus, the authors need to provide the test details as to how they measured the normalcy of distribution. 

      a) Based on the pairwise comparison of neighbouring neurons within animals, the data in Figure 4f was analysed with a repeated measure ANOVA. 

      b) We thank the author for their comment on Supplementary Figure 2b. The data is indeed normally distributed, and we have analysed it using a D’Agostino & Pearson test. We have corrected this in the supplemental figure. 

      Minor concerns: 

      (1) Line 117: The authors cite a recent memory impairment here, as shown by another paper. However, given the notorious difficulty in replicating behavioral findings, in particular in APP/PS1 mice (number of backcrossings, housing conditions, etc., might differ between laboratories), such a statement cannot be made. The authors should either show in their own hands that recent memory is indeed affected at 12 weeks of age, or they should omit this statement. 

      We thank the reviewer for this thoughtful comment. As noted in our response to major concern (1), we have addressed this concern by providing additional information and clarification in the discussion (line 462-465) regarding the possibility that remote memory impairments are a continuation of recent memory impairments. As mentioned in our response, we have added a reference to a more recent study from our lab (Kater et al. (2023). These findings are consistent with the earlier report from our lab (Végh et al. (2014), underscoring the reproducibility of this phenotype across independent cohorts and time. Notably, the experiments in the 2023 and present study were performed using the same housing and experimental conditions. Nevertheless, in light of the reviewer’s suggestion, and to avoid overstatement or speculation, we have now omitted the sentence referring to recent memory impairments at 12 weeks of age from the results section.

      (2) Pertaining to Figure 3, low-resolution images of the mPFC should be provided to assess the spread of injection and the overall degree of double-positive cells.  

      We agree with the reviewer and have added images of the mPFC as a supplemental figure (sFig. 3) that show the spread of the injection. Unfortunately, it is not possible to visualize the overall degree of double-positive cells at a lower magnification (or low-resolution). Representative examples of colocalization are presented in Figure 3.

      Reviewer #2 (Public review): 

      This study presents a comprehensive investigation of remote memory deficits in the APP/PS1 mouse model of Alzheimer's disease. The authors convincingly show that these deficits emerge progressively and are paralleled by selective hyperexcitability of PV interneurons in the mPFC. Using viral-TRAP labeling and patch-clamp electrophysiology, they demonstrate that inhibitory input onto labeled engram cells is selectively increased in APP/PS1 mice, despite unaltered engram size or reactivation. These findings support the idea that alterations in inhibitory microcircuits may contribute to cognitive decline in AD. 

      However, several aspects of the study merit further clarification. Most critically, the central paradox, i.e., increased inhibitory input without an apparent change in engram reactivation, remains unresolved. The authors propose possible mechanisms involving altered synchrony or impaired output of engram cells, but these hypotheses require further empirical support. Additionally, the study employs multiple crossed transgenic lines without reporting the progression of amyloid pathology in the mPFC, which is important for interpreting the relationship between circuit dysfunction and disease stage. Finally, the potential contribution of broader network dysfunction, such as spontaneous epileptiform activity reported in APP/PS1 mice, is also not addressed. 

      We thank the reviewer for their evaluation and appreciate the positive assessment of our study’s contributing to understanding remote memory deficits and the dysfunction of inhibitory microcircuits in AD. We also acknowledge the relevant points raised and have revised the manuscript to clarify our interpretations. 

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors): 

      (1) Line 68: What are "APP23xPS45" mice? This is most likely a typo.

      This line is a previously reported double transgenic amyloid beta mouse model that was obtained by crossing APP23 (overexpressing human amyloid precursor protein with the Swedish double mutation at position 670/671) with PS45 (carrying a transgene for mutant Presenilin 1, G384A mutation) (Busche et al., 2008; Grienberger et al., 2012). 

      (2) Line 148: The authors should also briefly describe in the main text that APP/PS1 x SST-Cre mice were generated and used here.  

      We thank the reviewer for their comment and have added their suggestion to the main text (line 166-168):

      ‘To do this, APP/PS1 mice were crossed with SST-Cre mice to generate APP/PS1 SST-Cre mice. Following microinjection of AAV-hSyn::DIO-mCherry into the mPFC, recordings were obtained from SST neurons.’

      (3) The discussion should be condensed because of redundancies on several occasions. For example, memory allocation is discussed starting on line 371, then again on line 392. This should be combined. Likewise, how the correlative nature of the findings about PV interneurons could be further functionally addressed is discussed on lines 413 and 454, and should be condensed into one paragraph. 

      We thank the reviewer for this suggestion and have revised the discussion to remove the redundancies as proposed.  

      Reviewer #2 (Recommendations for the authors): 

      To strengthen the manuscript, the following points should be addressed: 

      (1) Quantify amyloid pathology: It is essential to assess amyloid-β levels (soluble and insoluble) in the mPFC of APP/PS1-PV-Cre-tdTomato mice at the studied ages. This would help determine whether the observed circuitlevel changes track with disease progression as seen in canonical APP/PS1 models. 

      We thank the reviewer for this valuable suggestion and agree that assessing Aβ levels in the mPFC is important to determine whether the observed circuit level alterations in APP/PS1 mice coincide with the progression of amyloid pathology. Therefore, we assessed the amyloid plaque load in the mPFC of APP/PS1 mice at 16 and 20 weeks of age (new supplemental figure sFig. 1) and observed no difference in plaque load between these two time points. This suggests that the increased excitability in the mPFC cannot be attributed to differences in plaque load (insoluble amyloid beta).

      In line with this, we previously studied both soluble and insoluble Aβ levels in the CA1 and reported that there are no differences between 12 and 16 weeks of age (Kater et al., 2023), while PV cell hyperexcitability is present at 16 weeks of age (Hijazi et al., 2020a). From 24 weeks onwards, the level of amyloid beta increases. Similarly, Végh et al. (2014) showed using immunoblotting that monomeric and low molecular weight oligomeric forms of soluble Aβ are already present as early as 6 weeks of age and become more prominent at 24 weeks of age. Although the soluble Aβ measurements were performed in the hippocampus, we think these findings can be extrapolated to cortical regions, as the APP and PS1 mutations in APP/PS1 mice are driven by a prion promotor, which should induce consistent expression across brain regions. Data from other research groups support this hypothesis (Kim et al., 2015; Zhang et al., 2011). Thus, large regional differences in soluble Aβ are not expected. The temporal progression suggests that increasing levels of soluble amyloid beta might contribute to the emergence of PV cell hyperexcitability. We have added this point to the manuscript (line 585-591):

      ‘Since amyloid beta plaque load in the mPFC remains comparable between 16- and 20-week-old APP/PS1 mice, the observed increased excitability is unlikely the result of changes in insoluble amyloid beta levels. Previous data from our lab show that soluble amyloid beta is already present as early as 6 weeks of age and becomes more prominent at 24 weeks of age (Kater et al., 2023; Végh et al., 2014). The progressive increase in soluble amyloid beta levels may contribute to the emergence of PV cell hyperexcitability.’

      Finally, we previously compared soluble and insoluble amyloid beta levels in APP/PS1 and APP/PS1 Parv Cre mice and show that these are similar (Hijazi et al., 2020a). While our current study shows the progression of amyloid beta accumulation in APP/PS1 mice, these mice also exhibit altered microcircuitry (enhanced sIPSC frequency on engram cells) at 20 weeks of age, the same age at which we observed PV cell hyperexcitability in APP/PS1 Parv Cre tdTomato mice. This further supports the generalizability of our findings across genotypes, between APP/PS1 and APP/PS1 Parv Cre tdTomato mice. 

      (2) Examine later disease stages: Since the current effects are modest, assessing memory performance, PV cell excitability, and engram inhibition at more advanced stages could clarify whether these alterations become more pronounced with disease progression. 

      We thank the reviewer for this thoughtful suggestion. Investigating advanced disease stages could indeed provide valuable insights into whether the observed alterations in memory performance, PV cell hyperexcitability and engram inhibition become more pronounced over time. Our previous work has shown that changes in pyramidal cell excitability emerge at a later stage than in PV cells, supporting the idea of progressive circuit dysfunction (Hijazi et al., 2020a). However, at these more advanced stages, additional pathological processes, such as an increased gliosis (Janota, Brites, Lemere, & Brito, 2015; Kater et al., 2023) and synaptic loss (Alonso-Nanclares, MerinoSerrais, Gonzalez, & DeFelipe, 2013; Bittner et al., 2012), will likely contribute to both electrophysiological and behavioural measurements. Furthermore, we would like to point out that the current changes observed in memory performance, PV hyperexcitability and increased inhibitory input on engram cells at 16-20 weeks of age are not modest, but already quite substantial. Our focus on these early time points in APP/PS1 mice were intentional, as it helps us understand the initial changes in Alzheimer’s disease at a circuit level and to identify therapeutic targets early intervention. What happens at later stages is certainly of interest, but beyond the scope of this study and should therefore be addressed in future studies. We have incorporated a discussion related to this point into the revised manuscript (line 602-606):

      ‘Moreover, it is relevant to investigate whether changes in PV and PYR cell excitability, as well as input onto engram cells in the mPFC, become more pronounced at later disease stages. Nonetheless, by focussing on early disease timepoints in the present study, we aimed to understand the initial circuit-level changes in AD and identify targets for early therapeutic intervention.’

      (3) Address network hyperexcitability: Spontaneous epileptiform activity has been reported in APP/PS1 mice from 4 months of age (Reyes-Marin & Nuñez, 2017). Including EEG data or discussing this point in relation to your findings would help contextualize the observed inhibitory remodeling within broader network dysfunction. 

      We thank the reviewer for this valuable input and for highlighting the study by Reyes-Marin and Nuñez (2017). In line with this, we recently reported longitudinal local field potential (LFP) recordings in freely behaving APP/PS1 Parv-Cre mice and wild type control animals between the ages of 3 to 12 months (van Heusden et al., 2023). Weekly recordings were performed in the home cage under awake mobile conditions. These data showed no indications of epileptiform activity during wakefulness, consistent with previous findings that epileptic discharges in APP/PS1 mice predominantly occur during sleep (Gureviciene et al., 2019). Recordings were obtained from the prefrontal cortex (PFC), parietal cortex and the hippocampus. In contrast, the study by Reyes-Marin and Nuñez (2017) recorded from the somatosensory cortex in anesthetized animals. Here, during spontaneous recordings, no differences were observed in delta, theta or alpha frequency bands between APP/PS1 and WT mice. Interestingly, we observed an early increase in absolute power, particularly in the hippocampus and parietal cortex from 12 to 24 weeks of age in APP/PS1 mice. In the PFC we found a shift in relative power from lower to higher frequencies and a reduction in theta power. Connectivity analyses revealed a progressive, age-dependent decline in theta/alpha coherence between the PFC and both the parietal cortex and hippocampus. Given the well-established role of PV interneurons network synchrony and coordinating theta and gamma oscillations critical for cognitive function (Sohal, Zhang, Yizhar, & Deisseroth, 2009; Xia et al., 2017), these findings support the idea of early circuit dysfunction in APP/PS1 mice. Our findings, i.e. hyperexcitability of PV cells, align with these LFP based networklevel observations. These data suggest an early shift in the E/I balance, contributing to altered oscillatory dynamics and impaired inter-regional connectivity, possibly leading to alterations in memory. However, whether the observed PV hyperexcitability in our study directly contributes to alterations in power and synchrony remains to be elucidated. Furthermore, it would be interesting to determine the individual contribution of PV cell hyperexcitability in the hippocampus versus the mPFC to network changes and concurrent memory deficits. We have added a statement on network hyperexcitability to the discussion (line 561-565). 

      ‘Interestingly, we recently found a progressive disruption of oscillatory network synchrony between the mPFC and hippocampus in APP/PS1 Parv-Cre mice (van Heusden et al., 2023). However, whether the observed PV cell hyperexcitability directly contributes to changes in inter-regional synchrony, and whether this leads to alterations at a network level, i.e. increased inhibitory input on engram cells, and consequently to memory deficits, remains to be elucidated in future studies.’ 

      (4) Mechanisms responsible for PV hyperexcitability: Related to the previous point, a discussion of the possible underlying mechanisms, e.g., direct effects of amyloid-β, inflammatory processes, or compensatory mechanisms, would strengthen the discussion. 

      We agree with the reviewer that this will strengthen the discussion. We have now added a comprehensive discussion in the revised manuscript to address potential mechanisms responsible for PV cell hyperexcitability (line 579-594).:

      ‘Prior studies have shown that neurons in the vicinity of amyloid beta plaques show increased excitability (Busche et al., 2008). We demonstrated that PV neurons in the CA1 are hyperexcitable and that treatment with a BACE1 inhibitors, i.e. reducing amyloid beta levels, rescues PV excitability (Hijazi et al., 2020a). In line with this, we also reported that addition of amyloid beta to hippocampal slices increases PV excitability, without altering pyramidal cell excitability (Hijazi et al., 2020a). Finally, applying amyloid beta to an induced mouse model of PV hyperexcitability further impairs PV function (Hijazi et al., 2020b). Since amyloid beta plaque load in the mPFC remains comparable between 16- and 20-week-old APP/PS1 mice, the observed increased excitability is unlikely the result of changes in insoluble amyloid beta levels. Previous data from our lab show that soluble amyloid beta is already present as early as 6 weeks of age and becomes more prominent at 24 weeks of age (Kater et al., 2023; Végh et al., 2014). The progressive increase in soluble amyloid beta levels may contribute to the emergence of PV cell hyperexcitability. We hypothesize that the hyperexcitability induced by amyloid beta may result from disrupted ion channel function, as PV neuron dysfunction can result from altered potassium (Olah et al., 2022) and sodium channel activity (Verret et al., 2012).’

      (5) Excitatory-inhibitory balance: While the main focus is on increased inhibition onto engram cells, the reported increase in sEPSC frequency (Figure 5g) across genotypes suggests the presence of excitatory remodelling as well. A brief discussion of how this may interact with increased inhibition would be valuable.  

      We thank the reviewer for this comment regarding the interaction between excitatory and inhibitory remodelling. We have now incorporated this discussion point into the revised manuscript (line 528-534):

      ‘Interestingly, both WT and APP/PS1 mice showed an increase in sEPSC frequency onto engram cells, suggesting that increased excitatory input is a consequence of memory retrieval and not affected by genotype. However, only in APP/PS1 mice, the augmented excitatory input coincided with an elevation of inhibitory input onto engram cells. The resulting imbalance between excitation and inhibition could therefore potentially disrupt the precise control of engram reactivation and contribute to the observed remote memory impairment.’

      References

      Alonso-Nanclares, L., Merino-Serrais, P., Gonzalez, S., & DeFelipe, J. (2013). Synaptic changes in the dentate gyrus of APP/PS1 transgenic mice revealed by electron microscopy. J Neuropathol Exp Neurol, 72(5), 386-395. doi:10.1097/NEN.0b013e31828d41ec

      Bittner, T., Burgold, S., Dorostkar, M. M., Fuhrmann, M., Wegenast-Braun, B. M., Schmidt, B., . . . Herms, J. (2012). Amyloid plaque formation precedes dendritic spine loss. Acta Neuropathologica, 124(6), 797807. doi:10.1007/s00401-012-1047-8

      Busche, M. A., Eichhoff, G., Adelsberger, H., Abramowski, D., Wiederhold, K. H., Haass, C., . . . Garaschuk, O. (2008). Clusters of hyperactive neurons near amyloid plaques in a mouse model of Alzheimer's disease. Science, 321(5896), 1686-1689. doi:10.1126/science.1162844

      Grienberger, C., Rochefort, N. L., Adelsberger, H., Henning, H. A., Hill, D. N., Reichwald, J., . . . Konnerth, A. (2012). Staged decline of neuronal function in vivo in an animal model of Alzheimer's disease. Nat Commun, 3, 774. doi:10.1038/ncomms1783

      Gureviciene, I., Ishchenko, I., Ziyatdinova, S., Jin, N., Lipponen, A., Gurevicius, K., & Tanila, H. (2019). Characterization of Epileptic Spiking Associated With Brain Amyloidosis in APP/PS1 Mice. Front Neurol, 10, 1151. doi:10.3389/fneur.2019.01151

      Hijazi, S., Heistek, T. S., Scheltens, P., Neumann, U., Shimshek, D. R., Mansvelder, H. D., . . . van Kesteren, R. E. (2020a). Early restoration of parvalbumin interneuron activity prevents memory loss and network hyperexcitability in a mouse model of Alzheimer's disease. Mol Psychiatry, 25(12), 3380-3398. doi:10.1038/s41380-019-0483-4

      Hijazi, S., Heistek, T. S., van der Loo, R., Mansvelder, H. D., Smit, A. B., & van Kesteren, R. E. (2020b). Hyperexcitable Parvalbumin Interneurons Render Hippocampal Circuitry Vulnerable to Amyloid Beta. iScience, 23(7), 101271. doi:10.1016/j.isci.2020.101271

      Janota, C. S., Brites, D., Lemere, C. A., & Brito, M. A. (2015). Glio-vascular changes during ageing in wild-type and Alzheimer's disease-like APP/PS1 mice. Brain Res, 1620, 153-168. doi:10.1016/j.brainres.2015.04.056

      Kater, M. S. J., Huffels, C. F. M., Oshima, T., Renckens, N. S., Middeldorp, J., Boddeke, E., . . . Verheijen, M. H. G. (2023). Prevention of microgliosis halts early memory loss in a mouse model of Alzheimer's disease. Brain Behav Immun, 107, 225-241. doi:10.1016/j.bbi.2022.10.009

      Kim, H. Y., Kim, H. V., Jo, S., Lee, C. J., Choi, S. Y., Kim, D. J., & Kim, Y. (2015). EPPS rescues hippocampus-dependent cognitive deficits in APP/PS1 mice by disaggregation of amyloid-β oligomers and plaques. ature Communications, 6(1), 8997. doi:10.1038/ncomms9997

      Olah, V. J., Goettemoeller, A. M., Rayaprolu, S., Dammer, E. B., Seyfried, N. T., Rangaraju, S., . . . Rowan, M. J. M. (2022). Biophysical Kv3 channel alterations dampen excitability of cortical PV interneurons and contribute to network hyperexcitability in early Alzheimer’s. Elife, 11, e75316. doi:10.7554/eLife.75316

      Reyes-Marin, K. E., & Nuñez, A. (2017). Seizure susceptibility in the APP/PS1 mouse model of Alzheimer's disease and relationship with amyloid β plaques. Brain Res, 1677, 93-100. doi:10.1016/j.brainres.2017.09.026

      Sohal, V. S., Zhang, F., Yizhar, O., & Deisseroth, K. (2009). Parvalbumin neurons and gamma rhythms enhance cortical circuit performance. Nature, 459(7247), 698-702. doi:10.1038/nature07991

      van Heusden, F. C., van Nifterick, A. M., Souza, B. C., França, A. S. C., Nauta, I. M., Stam, C. J., . . . van Kesteren, R. E. (2023). Neurophysiological alterations in mice and humans carrying mutations in APP and PSEN1 genes. Alzheimers Res Ther, 15(1), 142. doi:10.1186/s13195-023-01287-6

      Végh, M. J., Heldring, C. M., Kamphuis, W., Hijazi, S., Timmerman, A. J., Li, K. W., . . . van Kesteren, R. E. (2014). Reducing hippocampal extracellular matrix reverses early memory deficits in a mouse model of Alzheimer's disease. Acta Neuropathol Commun, 2, 76. doi:10.1186/s40478-014-0076-z

      Verret, L., Mann, E. O., Hang, G. B., Barth, A. M., Cobos, I., Ho, K., . . . Palop, J. J. (2012). Inhibitory interneuron deficit links altered network activity and cognitive dysfunction in Alzheimer model. Cell, 149(3), 708-721. doi:10.1016/j.cell.2012.02.046

      Xia, F., Richards, B. A., Tran, M. M., Josselyn, S. A., Takehara-Nishiuchi, K., & Frankland, P. W. (2017). Parvalbumin-positive interneurons mediate neocortical-hippocampal interactions that are necessary for memory consolidation. Elife, 6. doi:10.7554/eLife.27868

      Zhang, W., Hao, J., Liu, R., Zhang, Z., Lei, G., Su, C., . . . Li, Z. (2011). Soluble Aβ levels correlate with cognitive deficits in the 12-month-old APPswe/PS1dE9 mouse model of Alzheimer's disease. Behavioural Brain Research, 222(2), 342-350. doi:https://doi.org/10.1016/j.bbr.2011.03.072

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      In this manuscript, Chang et al. investigated the cell type-specific role of the integrin activator Shv in activity-dependent synaptic remodeling. Using the Drosophila larval neuromuscular junction as a model, they show that glial-secreted Shv modulates synaptic plasticity by maintaining the extracellular balance of neuronal Shv proteins and regulating ambient extracellular glutamate concentrations, which in turn affects postsynaptic glutamate receptor abundance. Furthermore, they report that genetic perturbation of glial morphogenesis phenocopies the defects observed with the loss of glial Shv. Altogether, their findings propose a role for glia in activity-induced synaptic remodeling through Shv secretion. While the conclusions are intriguing, several issues related to experimental design and data interpretation merit further discussion.

      We appreciate the insightful and constructive comments. We have added new data and modified the text to address your concerns.  In doing so, the manuscript has been substantially strengthened.  Please see our detailed point-by-point response below. 

      Reviewer #2 (Public review):

      In this paper Chang et al follow up on their lab's previous findings about the secreted protein Shv and its role in activity-induced synaptic remodeling at the fly NMJ. Previously they reported that shv mutants have impaired synaptic plasticity. Normally a high stimulation paradigm should increase bouton size and GluR expression at synapses but this does not happen in shv mutants. The phenotypes relating to activity dependent plasticity were completely recapitulated when Shv was knocked down only in neurons and could be completely rescued by incubation in exogenously applied Shv protein. The authors also showed that Shv activation of integrin signaling on both the pre- and post- synapse was the molecular mechanism underlying its function. Here they extend their study to consider the role of Shv derived from glia in modulating synaptic features at baseline and remodeling conditions. This study is important to understand if and how glia contribute to these processes. Using cell-type specific knockdown of Shv only in glia causes abnormally high baseline GluR expression and prevents activity-dependent increases in bouton size or GluR expression post-stimulation. This does not appear to be a developmental defect as the authors show that knocking down Shv in glia after basic development has the same effects as lifelong knockdown, so Shv is acting in real time. Restoring Shv in ONLY glia in mutant animals is sufficient to completely rescue the plasticity phenotypes and baseline GluR expression, but glial-Shv does not appear to activate integrin signaling which was shown to be the mechanism for neuronally derived Shv to control plasticity. This led the authors to hypothesize that glial Shv works by controlling the levels of neuronal Shv and extracellular glutamate. They provide evidence that in the absence of glial Shv, synaptic levels of Shv go up overall, presumably indicating that neurons secrete more Shv. In this context which could then work via integrin signaling as described to control plasticity. They use a glutamate sensor and observe decreased signal (extracellular glutamate) from the sensor in glial Shv KD animals, however, this background has extremely high GluR levels at the synapse which may account for some or all of the decreases in sensor signal in this background. Additional controls to test if increased GluR density alone affects sensor readouts and/or independently modulating GluR levels in the glial KD background would help strengthen this data. In fact, glialspecific shv KD animals have baseline levels of GluR that are potentially high enough to have hit a ceiling of expression or detection that accounts for the inability for these levels to modulate any higher after strong stimulation and such a ceiling effect should be considered when interpreting the data and conclusions of this paper. Several outstanding questions remain-why can't glial derived Shv activate integrin pathways but exogenously applied recombinant Shv protein can? The effects of neuronal specific rescue of shv in a shv mutant are not provided vis-à-vis GluR levels and bouton size to compare to the glial only rescue. Inclusion of this data might provide more insight to outstanding questions of how and why the source of Shv seems to matter for some aspects of the phenotypes but not others despite the fact that exogenous Shv can rescue and in some experimental paradigms but not others.

      We appreciate your insightful comments. We have added new data and modified the text to address your concerns.  In doing so, the manuscript has been substantially strengthened.  Please also see the enclosed point-by-point response.

      To address the question of whether altered GluR density alone affects sensor readouts, we expressed GluR using a mhc promoter-driven GluRIIA fusion line, which increases total GluRIIA expression in muscle independently of the Gal4/UAS system. As shown in Figure 6 – figure supplement 1, mhc-GluRIIA animals exhibited elevated levels of not only GluRIIA but also the obligatory GluRIIC subunit. Despite this increase in GluR expression, we did not observe any change in extracellular glutamate levels, as measured by live imaging using the neuronal iGluSnFR sensor (updated Figure 6A). These results suggest that elevated GluR density alone does not alter iGluSnFR sensors dynamics and further support our conclusions.

      In regard to the question about ceiling effect, we do not think that the lack of GluR enhancement in repo>shv-RNAi is due to a saturated postsynaptic state. This is based on results in Figure 6, which shows that GluR levels can increase up to fourfold upon stimulation in the presence of glutamate, whereas repo>shv-RNAi results in only a ~2-fold increase in baseline GluR concentration. These results suggest that the synapse retains the capacity for further upregulation. 

      To address the question of why exogenously applied Shv activates integrin while glial derived Shv does not, we tested whether glia and neurons could differentially modify Shv. Based on Western blot analyses of adult heads and larval brains showing that Shv is present as a single band (Fig. 1A and Figure 2 – figure supplement 1B), the functional differences in neuronal or glial Shv is not likely due to the presence of different isoforms. Consistent with this, FlyBase also suggests that shv encodes a single isoform. However, while we did not detect obvious posttranslational modifications when Shv protein was expressed in neurons or glia (Figure 5 – figure supplement 1A), we cannot exclude the possibility that different cell types process Shv differently through post-transcriptional or post-translational mechanisms. Notably, shv is predicted to undergo A-to-I RNA editing, including an editing site in the coding region, which will result in a single amino acid change (St Laurent et al., 2013). Given that ADAR, the editing enzyme, is enriched in neurons and absent from glia (Jepson et al., 2011), such cell-specific editing could contribute to functional differences. It will be interesting to investigate this in the future. We have now included this in the Discussion section.

      Additionally, we have now included new data on neuronal Shv rescue of shv<sup>1</sup> mutants as suggested in the updated Figure 4. Consistent with previous findings that neuronal Shv rescues integrin signaling and electrophysiological phenotypes (Lee et al., 2017), we found that it also restores bouton size, GluR levels, and activity-induced synaptic remodeling. These results support the functional contribution of neuronal Shv. 

      Reviewer #3 (Public review):

      Summary:

      The manuscript by Chang and colleagues provides compelling evidence that glia-derived Shriveled (Shv) modulates activity-dependent synaptic plasticity at the Drosophila neuromuscular junction (NMJ). This mechanism differs from the previously reported function of neuronally released Shv, which activates integrin signaling. They further show that this requirement of Shv is acute and that glial Shv supports synaptic plasticity by modulating neuronal Shv release and the ambient glutamate levels. However, there are a number of conceptual and technical issues that need to be addressed.

      We appreciate the insightful and constructive comments. We have added new data and modified the text to address your concerns.  In doing so, the manuscript has been substantially strengthened.  Please see our detailed point-by-point response below.

      Major comments:

      (1) From the images provided for Fig 2B +RU486, the bouton size appears to be bigger in shv RNAi + stimulation, especially judging from the outline of GluR clusters.

      Thank you for pointing this out. We have selected another image to better represent the data.

      (2) The shv result needs to be replicated with a separate RNAi.

      We have used another independent RNAi line targeting shv to confirm our findings (BDSC 37507). This shv-RNAi<sup>37507</sup> line also showed the same phenotype, including increased GluR levels and impaired activity-induced synaptic remodeling line (new Figure 2 – figure supplement 1A).

      (3) The phenotype of shv mutant resembles that of neuronal shv RNAi - no increased GluR baseline. Any insights why that is the case?

      This is an interesting question. We speculate that neuronal Shv normally has a dominant role in maintaining GluR levels during development, mainly through its ability to activate integrin signaling. Consistent with this, we have shown that mutations in integrin leads to a drastic reduction in GluR levels at the NMJ (Lee et al., 2017). While we have shown that neuronal knockdown of shv elevates Shv from glia (Fig. 5E), glial Shv cannot activate integrin signaling (Fig. 5B, 5C). Additionally, high levels of glial Shv will elevate ambient glutamate concentrations (Figure 6A), which will likely reduce GluR abundance and impair synaptic remodeling (Augustin et al.  2007, Chen et al., 2009, and Figure 6B). Therefore, neuronal knockdown of Shv resulted in the same phenotype as shv<sup>1</sup> mutant. 

      (4) In Fig 3B, SPG shv RNAi has elevated GluR baseline, while PG shv RNAi has a lower baseline. In both cases, there is no activity induced GluR increase. What could explain the different phenotypes?

      SPG is the middle glial cell layer in the fly peripheral nervous system and may also influence the PG layer through signaling mechanisms (Lavery et al., 2007), therefore having a stronger effect. We have now mentioned this in the text. 

      (5) In Fig 4C, the rescue of PTP is only partial. Does that suggest neuronal shv is also needed to fully rescue the deficit of PTP in shv mutants?

      This is indeed a possibility. We have shown that neuronal and glial Shv each contribute to activity-induced synaptic remodeling through different mechanisms. It will be interesting test this in the future.

      (6) The observation in Fig 5D is interesting. While there is a reduction in Shv release from glia after stimulation, it is unclear what the mechanism could be. Is there a change in glial shv transcription, translation or the releasing machinery? It will be helpful to look at the full shv pool vs the released ones. 

      Thank you for the suggestion. To address this, we monitored the levels of intracellular Shv using a permeabilized preparation (we found that the addition of detergent to permeabilize the sample strips away extracellular Shv). Combined with the extracellular staining results, we can get an idea about the total amount of Shv. As shown in the updated Figure 5D, intracellular Shv levels (permeabilized) remained unchanged following stimulation, indicating that there is no intracellular accumulation and that the observed decrease in extracellular Shv is unlikely due to impaired release machinery.

      (7) In Fig 5E, what will happen after stimulation? Will the elevated glial Shv after neuronal shv RNAi be retained in the glia? 

      Thank you for the interesting question. We agree that examining Shv distribution following neuronal activity would be highly informative. While we plan to perform time-lapse experiments in future studies to address this, we feel that such analyses are beyond the scope of the current manuscript.

      (8) It would be interesting to see if the localization of shv differs based on if it is released by neuron or glia, which might be able to explain the difference in GluR baseline. For example, by using glia-Gal4>UAS-shv-HA and neuronal-QF>QUAS-shv-FLAG. It seems important to determine if they mix together after release? It is unclear if the two shv pools are processed differently.

      We agree that investigating whether neuronal and glial shv pools colocalize or are differentially processed is an important future direction. We hope to examine how each pool responds to stimulation in the shv<sup>1</sup> mutant background using LexA and Gal4 systems in the future

      (9) Alternatively, do neurons and glia express and release different Shv isoforms, which would bind different receptors?

      Thank you for the questions. We have now addressed this in the discussion and also enclosed below:

      Based on Western blot analyses of adult heads and larval brains showing that Shv is present as a single band (Fig. 1A and Figure 2 – figure supplement 1B), the functional differences in neuronal or glial Shv is not likely due to the presence of different isoforms. Consistent with this, FlyBase also suggests that shv encodes a single isoform (Ozturk-Colak et al., 2024). However, while we did not detect obvious post-translational modifications when Shv protein was expressed in neurons or glia (Figure 5 – figure supplement 1A), we cannot exclude the possibility that different cell types process Shv differently through posttranscriptional or post-translational mechanisms. Notably, shv is predicted to undergo A-to-I RNA editing, including an editing site in the coding region, which could result in a single amino acid change (St Laurent et al., 2013). Given that ADAR, the editing enzyme, is enriched in neurons and absent from glia (Jepson et al., 2011), such cell-specific editing could contribute to functional differences. It will be interesting to investigate this in the future.

      (10) It is claimed that Sup Fig 2 shows no observable change in gross glial morphology, further bolstering support that glial Shv does not activate integrin. This seems quite an overinterpretation. There is only one image for each condition without quantification. It is hard to judge if glia, which is labeled by GFP (presumably by UAS-eGFP?), is altered or not.

      Thank you for raising this concern. To strengthen our claim, we now include additional images (Figure 5, figure supplement 2). No obvious change in overall glial morphology was observed, with glia continuing to wrap the segmental nerves and extend processes that closely associate with proximal synaptic boutons (Figure 5, figure supplement 2). These observations suggest that glial  Shv is not essential for maintaining normal glial structure or survival, and is consistent with the idea that glial Shv does not activate integrin, as integrin signaling is required to maintain the integrity of peripheral glial layers. 

      (11) The hypothesis that glutamate regulates GluR level as a homeostatic mechanism makes sense. What is the explanation of the increased bouton size in the control after glutamate application in Fig 6?

      We speculate that it could be due to a retrograde signaling mechanism activated by elevated extracellular glutamate, allowing neurons to modulate bouton morphology in response to synaptic demand. It will be interesting to investigate this possibility in the future.  

      (12) What could be a mechanism that prevents elevated glial released Shv to activate integrin signaling after neuronal shv RNAi, as seen in Fig 5E?

      One potential mechanism is post-translational or post-transcriptional processing of Shv. Although our Western blots did not reveal differences in the molecular weight of glial vs. neuronal Shv, we cannot exclude the possibility that modifications not readily detectable by this method are responsible. Additionally, as mentioned in the Discussion section, post-transcriptional processing such as A-to-I RNA editing could introduce changes in the Shv protein, potentially altering its ability to interact with or activate integrin. 

      (13) Any speculation on how the released Shv pool is sensed?

      The same RNA editing modification mentioned earlier or post-translational modifications in Shv may also influence how it is sensed by target cells. 

      Reviewer #1 (Recommendations for the authors):

      Issues Regarding Cell Type-Specific Secretion and the Role of Shv:

      Extracellular Secretion of Shv:

      (1) The data in Figure 1 suggest that Shv is not secreted under resting conditions, challenging the proposed extracellular role of Shv. It remains unclear whether Shv secretion can be confirmed using Shv-eGFP (knock-in) following high K+ stimulation.

      We apologize for not being clear. In Figure 1, Shv signals we’ve shown are from permeabilized preparation, which preferentially labels intracellular Shv. We do observe secreted Shv-eGFP following stimulation (Figure 5E), consistent with our hypothesis. However, endogenous extracellular Shv-eGFP signal is very weak, and was therefore detected using the GFP antibody and amplified with a  fluorescent secondary antibody. We have now also included additional controls in Figure 5E to demonstrate the specificity of the staining.

      (2) In Figure 5D, total Shv staining should be included to evaluate potential presynaptic accumulation of intracellular Shv, which may lead to extracellular secretion upon stimulation. Additionally, the representative images of glial rescue do not seem to align with the quantification data; more extracellular Shv signals were observed after stimulation.

      Thank you for the comments. We monitored the levels of intracellular Shv using a permeabilized preparation (detergent treatment stripped away extracellular Shv signal). When combined with non-permeabilized extracellular staining, this approach provides insights into total Shv levels. We found no intracellular accumulation of Shv and the intracellular levels remained unchanged following stimulation (updated Figure 5D), suggesting that reduced extracellular Shv is not likely due to impaired release. Additionally, we have selected another image for glial rescue by avoiding the trachea region, which better represent the quantification data.

      (3) In Figure 5E, "extracellular" Shv staining in repo>shv-RNAi samples appears localized within synaptic boutons. This raises concerns about the staining protocol potentially labeling intracellular proteins. Control experiments using presynaptic cytosolic markers are needed to confirm staining specificity.

      Thank you for the thoughtful suggestion. To validate that our staining protocol is selective for extracellular proteins, we also stained for cysteine string protein (CSP), an intracellular synaptic vesicle protein predominantly located in the presynaptic terminals (Zinsmaier et al., 1990; Umbach et al., 1994), under the same conditions. CSP was detected only in the permeabilized condition (updated Figure 5E), suggesting that the non-permeabilizing protocol is selective for extracellular proteins. 

      (4) The study does not clarify why Shv knockdown in either perineurial glia or subperineurial glia abolishes stimulus-dependent synaptic remodeling. Does Shv secretion occur from PG, SPG, or both toward the synaptic bouton?

      Thank you for raising this point. SPG is the middle glial cell layer in the fly peripheral nervous system and may also influence the PG layer through signaling mechanisms (Lavery et al., 2007). Consistent with this, we observed a stronger effect on GluR levels when SPG was disrupted compared to PG. It will be interesting to distinguish whether Shv is released by PG or SPG in the future.

      (5) The possibility of an inter-glial role for Shv via integrin signaling in regulating glial morphogenesis is underexplored. The rough morphological characterization in Supplemental Figure 2 requires more detailed quantification and the use of sub-glial typespecific GAL4 drivers.

      We now include additional images (Figure 5, figure supplement 2) to examine the overall glial morphology. There was no obvious change in gross glial morphology, with glia continuing to wrap the segmental nerves and extend processes that closely associate with proximal synaptic boutons when shv is knocked down in glia (Figure 5, figure supplement 2). These observations suggest that glial  Shv is not essential for maintaining normal glial structure or survival, and is consistent with the idea that glial Shv does not activate integrin, as integrin signaling is required to maintain the integrity of peripheral glial layers (Xie and Auld, 2011; Hunter et al., 2020).

      (6) While repo>shv rescues stimulus-dependent bouton size and GluR increases in the shv mutant (Figure 5), the interaction between neuronal and glial Shv remains unclear. Does neuronal Shv influence the expression or distribution of glial Shv?

      We agree that investigating whether neuronal and glial shv pools influence each other’s expression or distribution is an important future direction. We hope to investigate this in more detail in the future using LexA-LexOp and GAL4/UAS dual expression systems.

      Issues Regarding the Regulation of GluR and Perisynaptic Glutamate by Glial Shv:

      (7) The methodology for iGluSnFR measurement (Figure 6A) is inadequately described. If anti-HRP staining was used to normalize signals, it suggests the experiment may have involved fixed tissue. However, iGluSnFR typically measures glutamate levels in live cells, raising concerns about the validity of this approach in fixed samples.

      We apologize for not being clear about the method used to measure iGluSnFR. The original figure was generated from imaging iGluSnFR signals immediately following fixation. To address the reviewer’s concern and validate these results, we have now performed live imaging experiments using a water dipping objective to measure iGluSnFR intensity in unfixed preparations (new Figure 6A). To label synaptic boutons, we co-expressed mtdTomato using the neuronal driver, nSybGAL4. The results from the live imaging experiments confirmed our original observations that glial Shv required to control ambient extracellular glutamate levels (see updated Fig. 6A and text). Additionally, to ascertain that the decrease in iGluSnFR signal reflects a decrease in ambient extracellular glutamate levels rather than glutamate depletion caused by high levels of GluR, we upregulated GluR levels using mhc-GluRIIA, which drives GluRIIA expression in muscles (Petersen et al., 1997). We found mhc-GluRIIA animals exhibited elevated levels of not only GluRIIA but also the obligatory GluRIIC subunit. However, iGluSnFR signals at the synapse remained unchanged (Figure 6A), suggesting that elevated GluR density alone does not reduce signals. Taken together, these results suggest that glial Shv plays a critical role in controlling ambient extracellular glutamate levels. 

      (8) As shown in Figure 2, repo>shv-RNAi increases GluR levels before high K+ stimulation, potentially saturating postsynaptic GluR expression and precluding further increases upon stimulation.

      Our data in Figure 6 show that GluR levels can increase up to four-fold upon stimulation in the presence of glutamate, whereas repo>shv-RNAi results in only a ~2-fold increase in baseline GluR concentration. These results suggest that the synapse retains the capacity for further upregulation. Thus, we do not think that the lack of GluR enhancement in repo>shv-RNAi is due to a saturated postsynaptic state, but rather reflects a requirement for glial Shv in activity-dependent modulation.

      (9) Despite glial shv knockdown lowering extracellular glutamate levels, GluR levels unexpectedly increase (Figure 6B). This contradicts the known requirement for high ambient glutamate concentrations to promote GluR clustering and membrane expression (Chen et al., 2009). Furthermore, adding 2 mM glutamate reverses these increases, suggesting additional complexity in the regulation of Shv synaptic remodeling.

      Thank you for the comment and the opportunity to clarify this point. While it may seem counterintuitive at first glance, our observations are in line with previous reports that showed low ambient glutamate levels significantly elevated GluR intensity at the Drosophila NMJ (Chen et al., 2009), but such increase can be reversed by glutamate supplementation (Augustin et al., 2007; Chen et al., 2009). We have revised the text to more clearly reflect this connection.

      (10) If glial Shv promotes GluR expression, why does the increased extracellular Shv from neuronal shv knockdown (elav>shv-RNAi, Figure 5E) fail to elicit stimulus-dependent GluR elevation?

      We speculate that this is because glial Shv does not activate integrin signaling (Figure 5B, C), and elevated glial Shv increases ambient glutamate concentration (Figure 6A), thereby reducing GluR expression (Augustin et al., 2007; Chen et al., 2009). This is indeed what we observed when shv is knocked down in neurons. 

      Additional Issues:

      (11) The type of bouton used for quantification (e.g., Ib or Is boutons) is not specified, which is critical for interpreting the results.

      We apologize for not being clear. We analyzed type Ib boutons as done previously (Lee et al., 2017 and Chang et al., 2024), and have now included this information in the Methods section.  

      (12) The extent of Shv protein depletion in the repo-GeneSwitch system needs validation to confirm the efficacy of the knockdown.

      Thank you for the suggestion. We confirmed the efficiency of acute shv knockdown by the repo-GeneSwitch system by performing Western blot analysis of dissected larval brains (Figure 2 – figure supplement 1B). Acute glial knockdown using the repo-GeneSwitch driver resulted in a 30% reduction in Shv levels, similar to the decrease observed with the repo-GAL4 driver, suggesting that the GeneSwitch driver is functional. Furthermore, knockdown of shv by the ubiquitous tubulin-GAL4 driver completely eliminated Shv protein, indicating that the RNAi construct is effective.  

      Reviewer #2 (Recommendations for the authors):

      (1) General comment on statistics/data presentation: The authors employ an unusual method of using both one-way ANOVA and multiple t-test stats for the same data. Would a 2-way ANOVA be the more appropriate solution to this problem (to analyze across genotype and stimulation condition)? Also a chart in the supplementals showing all comparisons rather than just the fraction explicitly reported in the graphs would be helpful (it is not clear if no indication on significance indicates no difference or just not reported between some of the baseline levels, especially since everything is presented as ratios and in some cases this could help with data interpretation of which baseline levels are different and how they compare to other baselines and other post-stim levels). Further, there are no sample sizes given for any experiment, nor are any values of means, SD, etc ever explicitly given.

      We appreciate the thoughtful suggestion. While a two-way ANOVA could be used to examine interaction effects between genotype and stimulation condition, our analysis was designed to address a specific biological question: whether each genotype, independent of baseline levels, is capable of undergoing activitydependent synaptic remodeling. To this end, we used t-tests to directly compare unstimulated vs. stimulated conditions within each genotype, allowing us to determine whether stimulation produces a significant effect in an all-or-none manner. In parallel, we applied one-way ANOVA with post hoc tests to analyze differences among baseline (unstimulated) conditions across genotypes. This approach is justified by the fact that stimulation was applied acutely and separately, and therefore the baseline values should not be influenced by the stimulated condition. Because we were not aiming to compare the extent of synaptic remodeling between genotypes, we did not use a two-way ANOVA to analyze interaction effects across all conditions.

      In response to the reviewer’s suggestion, we have now added the sample number in the graphs. Additionally, in the Methods section, we include information that each sample represents biological repeats, and that data are presented as fold-change relative to unstimulated controls from the same experimental batch. This normalization is necessary, as absolute GluR intensities can vary depending on microscope settings and staining conditions.

      (2) To clarify distinct roles of Shv coming from neurons vs glia it would help if the authors could include more data on the rescue of shv mutants with UAS-Shv in neurons alone. This data is never shown in the manuscript and data on what effect this rescue has on the pertinent phenotypes in this paper (bouton size and GluR staining) is not reported in the referred to 2017 paper. What this does and does not do for these phenotypes has important implications for how to interpret the glia-only rescue findings.

      Thank you for the suggestion. We have now included new data on neuronal Shv rescue in shv<sup>1</sup> mutants as suggested (updated Figure 4A). Consistent with previous findings that neuronal Shv rescues integrin signaling and electrophysiological phenotypes (Lee et al., 2017), we found that it also restores bouton size, GluR levels, and activity-induced synaptic remodeling. These results support the functional contribution of neuronal Shv. 

      (3) Figure 1C: Where are the images in the periphery taken? The morphology of the glia is odd in that "blobs" of glial membrane seemingly unattached to anything else are floating about? Perhaps these are a thin stack projection and so the connection to the main glia "stalks" are just cut off? Could a specific individual synapse be shown? Also consider HRP shown on its own so that where the actual boutons are could be more clear. It seems like both the Tomato and HRP channels are really overexposed making visualizing the morphology quite confusing. Also why not use the antibody against Shv to directly visualize expression which is more direct than a knock-in tagged version?

      Figure 1C shows a single optical slice of the NMJ at muscle segment 2, selected to clearly highlight Shv-eGFP localization at a branch in close contact with the glial membrane. The glial stalk is not visible in this image because it lies in a different focal plane from the branch of interest. We have now specified this information in the figure legend. In the original figure, the HRP signal (405 channel) was oversaturated, which interfered with visual clarity. In the updated Figure 1C, we reduced the intensity of overexposed channels to better reveal the weak ShveGFP signal and fine glial processes. While we have generated an antibody against Shv, the amount is extremely limited, and hence the Shv-eGFP fusion serves as a valuable tool for visualizing subcellular localization.

      (4) Do glutamate levels really rise in glia Shv KD? Although iGluSnFR signal changes could it be the high level of GluR at the synapse acting as sponges to sequester glutamate so that it can't stimulate the sensor as well? One way to test this would be to overexpress or KD GluRs in muscle in wildtype (or in the repo>Shv RNAi background) to see if that alone can modulate iGluSnfR signals?

      Thank you for suggesting this important control. To address the question of whether high level GluR density alone could influence neuronal iGluSnFR sensor readouts, we expressed GluR using a mhc promoter-driven GluRIIA fusion line, which increases total GluRIIA expression in muscle independently of the Gal4/UAS system. As shown in Figure 6 – figure supplement 1, mhc-GluRIIA animals exhibited elevated levels of not only GluRIIA but also the obligatory GluRIIC subunit. Despite this increase in GluR expression, we did not observe any change in extracellular glutamate levels, as measured by live imaging using the neuronal iGluSnFR sensor (updated Figure 6A). These results suggest that elevated GluR density alone does not alter iGluSnFR sensors  dynamics and further support our conclusions.

      (5) The authors have some Shv constructs that can't be secreted or can't bind to integrins. Performing cell type specific rescues with these constructs might also help distinguish how source matters for each proposed sub-function of Shv though this may be outside the scope of this study. 

      Thank you for noticing the Shv constructs we have. We hope to further test subfunctions of Shv in the future.

      (6) At one point the authors discuss experiments that measure how much Shv is released by glia during neuronal stimulation. Then state that "These data indicate that glial Shv does not directly inhibit integrin signaling." But how this experiment relates to integrin signaling is not explained and unclear.

      We apologize for the confusion. We have now updated the text to better explain our logic: “This activity-induced decrease in glial Shv levels, along with reduced integrin activation (Fig. 5B), suggest that glial Shv does not act by directly inhibiting integrin signaling.”

      Reviewer #3 (Recommendations for the authors):

      Minor comments

      (1) Readers are left wondering what causes the increased baseline of GluR after glial shv RNAi at Fig 1, which is addressed much later. It would be helpful to preemptively mention this.

      Thank you for the suggestion. To maintain a logical flow, we chose to first present the phenotypic data in Figures 1 and 2 and then return to the mechanistic explanation once we introduced ambient glutamate measurements. 

      (2) Be consistent with eGFP vs EGFP.

      Thank you, we have corrected the inconsistencies.  

      (3) Scale bar for Fig 1B is missing in the low-magnification panel.

      Thank you for pointing out. We’ve put in the scale bar for Figure 1B.   

      (4) Fig 1C, it would be helpful to elaborate on the anatomy. For example, what NMJ/abdominal segment is this? Why only some axons are surrounded by glia?

      Figure 1C presents a single optical slice of the NMJ at muscle segment 2, chosen to highlight Shv-eGFP localization at a branch closely juxtaposed to the glial membrane. The glial stalk is not shown in this image because it resides in a different focal plane than the branch being visualized. We have now included this information in the figure legend.

      (5) For Fig 3B, while it is stated that "we observed normal synaptic remodeling using alrmGAL4," the effect size is smaller. There seems to be a decrease in the amount of synaptic remodeling occurring?

      Thank you for pointing this out. Our primary goal was to determine whether each genotype, regardless of baseline GluR levels, is capable of undergoing activitydependent synaptic remodeling in response to stimulation. For this reason, we focused on detecting the presence or absence of remodeling rather than comparing the extent of remodeling across genotypes. While a smaller effect on activity-induced bouton size was observed with alrm-GAL4, the change was still statistically significant, indicating that remodeling does occur in this genotype. Currently, we do not have a clear biological interpretation for differences in the magnitude of remodeling, and therefore chose not to emphasize cross-genotype comparisons.

    1. Author response:

      Reviewer #1 (Public review):

      Major Concerns:

      (1) Lack of Direct Evidence for RadD-NKp46 Interaction

      The central claim that RadD interacts with NKp46 is not formally demonstrated. A direct binding assay (e.g., Biacore, ELISA, or pull-down with purified proteins) is essential to support this assertion. The absence of this fundamental experiment weakens the mechanistic conclusions of the study.

      The reviewer is correct. Direct assays are currently quite impossible because RadD is huge protein and it will take years to purify it. Instead, we used immunoprecipitation assays using NKp46-Ig (Author response images 1 and 2). Fusobacteria were lysed using RIPA buffer, and the lysates were centrifuged twice to separate the supernatant from the pellet (which contains the bacterial membranes). The resulting lysates were incubated overnight with 2.5 µg of purified NKp46 and protein G-beads. After thorough washing, the bound proteins were placed in sample buffer and heated at 95 °C for 8 minutes. The eluates were run on a 10% acrylamide gel and visualized by Coomassie blue staining. As can be seen the NKp46-Ig was able to precipitate protein band around 350Kd in both F. polymorphum ATCC10953 (Author response image 1) and in F. nucleatum ATCC23726 (Author response image 2).

      Author response image 1. NKp46 immunoprecipitation with Fusobacterium polymorphum (ATCC 10953) lysates. The resulting lysates of supernatant and pellet of Fusobacterium were immunoprecipitated (IP) with 2.5 μg of control fusion protein (RBD-Ig) or with NKp46-Ig. A 2.5 μg of purified fusion proteins were also run on gel.

      Author response image 2. NKp46 immunoprecipitation with Fusobacterium nucleatum (ATCC 23726) lysates. The resulting lysates of supernatant and pellet of Fusobacterium were immunoprecipitated (IP) with 2.5 μg of Control fusion protein (RBD-Ig) or with NKp46-Ig. 2.5 μg of purified fusion proteins were also run on gel.

      (2) Figure 2: Binding Specificity and Bacterial Strains

      A CEACAM1-Ig control should be included in all binding experiments to distinguish between specific and non-specific Ig interactions. There is differential Ig binding between strains ATCC 23726 and 10953. The authors should quantify RadD expression in each strain to determine if the difference in binding is due to variation in RadD levels.

      No significant difference in mCEACAM-1-Ig binding was observed across multiple independent experiments. Author response image 3 shows a representative histogram showing mCEACAM-1-Ig binding to F. nucleatum ATCC 23726 and F. polymorphum ATCC 10953. Comparable binding levels were detected in both bacterial species (upper histogram). Similarly, NKp46-Ig and Ncr1-Ig fusion proteins exhibited comparable binding patterns (lower histogram). It is currently not possible to quantify RadD expression directly, as no anti-RadD antibody is available.

      Author response image 3. CEACAM-1 Ig binding to Fusobacterium ATCC 23726 and ATCC 10953. Upper histograms show staining with secondary antibody alone (gray) compared to CEACAM-1 Ig (black line). Lower histograms show binding of NKp46 and Ncr1 fusion proteins to the two Fusobacterium strains. Gray represent secondary antibody controls.

      (3) Figure 3: Flow Cytometry Inconsistencies and Missing Controls

      What do the FITC-negative, Ig-negative events represent? The authors should clarify whether these are background signals, bacterial aggregates, or debris.

      We now present the gating strategy used in these experiments (Author response image 4). Fusion negative Ig samples were the bacterial samples stained only with the secondary antibody APC (anti-human AF647). The TITC-negative represent unlabeled bacteria.

      Author response image 4. Gating strategy for FITC-labeled Fusobacterium stained with fusion proteins. Bacteria were first gated as shown in the left panel. The gated population was then further analyzed in the right plot: the lower-left quadrant represents bacterial debris, the upper-left quadrant corresponds to FITC-stained bacteria only, and the upper-right quadrant shows bacteria double-positive for FITC and APC, indicating binding of the fusion proteins.

      Panel B, CEACAM1-Ig binding appears markedly increased compared to WT bacteria. The reason for this enhancement should be discussed-does it reflect upregulation of the bacterial ligand or an artifact of overexpression? Fluorescence compensation should be carefully reviewed for the NKp46/NCR1-Ig binding assays to ensure that the signals are not due to spectral overlap or nonspecific binding. Importantly, binding experiments using the FadI/RadD double knockout strain are missing and should be included. This control is essential.

      We don’t know why expression of CEACAM1-Ig binding is increased. Indeed, it will be nice to have the FadI/RadD double knockout strain which we currently don’t have.

      In Panel E, the basis for calculating fold-change in MFI is unclear. Please indicate the reference condition to which the change is normalized.

      The mean fluorescence intensity (MFI) fold change was calculated by dividing the MFI obtained from staining with the fusion proteins by the MFI of the corresponding secondary antibody control (bacteria incubated without fusion proteins).

      (4) Figure 4: Binding Inhibition and Receptor Sensitivity

      Panel A lacks representative FACS plots and is currently difficult to interpret.

      Fusobacteria binding to CEACAM-1, NKp46, and NCR1 fusion proteins was tested in the presence of 5 and 10 mM L-arginine (Author response image 5). L-arginine inhibited the binding of NKp46-Ig and NCR1-Ig, whereas no effect was observed on CEACAM-1-Ig binding.

      Author response image 5. Fusobacterium binding inhibition by L-Arginine. The figure shows the binding of CEACAM1-Ig (left panel), NKp46-Ig (middle panel), and Ncr1-Ig (right panel) in the presence of 0 mM (black), 5 mM (red), and 10 mM (blue) L-arginine.

      Differences in the sensitivity of human vs. mouse NKp46 to arginine inhibition should be discussed, given species differences in receptor-ligand interactions.

      Ncr1, the murine orthologue of human NKp46, shares approximately 58% sequence identity with its human counterpart (1). The observed differences in arginine-mediated inhibition of bacterial binding between mouse and human NKp46 might stem from structural differences or distinct posttranslational modifications, such as glycosylation. Indeed, prediction algorithms combined with high-performance liquid chromatography analysis revealed that Ncr1 possesses two putative novel O-glycosylation sites, of which only one is conserved in humans (2).

      References

      (1) Biassoni R., Pessino A., Bottino C., Pende D., Moretta L., Moretta A. The murine homologue of the human NKp46, a triggering receptor involved in the induction of natural cytotoxicity. Eur J Immunol. 1999 Mar; 29(3).

      (2) Glasner A., Roth Z., Varvak A., Miletic A., Isaacson B., Bar-On Y., Jonjić S., Khalaila I., Mandelboim O. Identification of putative novel O-glycosylations in the NK killer receptor Ncr1 essential for its activity. Cell Discov. 2015 Dec 22; 1:15036.

      What are the inhibition results using F. nucleatum strains deficient in FadI?

      The inhibition pattern observed in the F. nucleatum ΔFadI mutant was comparable to that of the wild-type strain (Author response image 6). When cultured under identical conditions and exposed to increasing concentrations of arginine (0, 5, and 10 mM), the F. nucleatum ΔFadI strain also demonstrated a dose-dependent reduction in binding to NKp46 and Ncr1.

      Author response image 6. Arginine inhibition of NKp46-Ig and Ncr1-Ig binding in F. nucleatum ΔFadI. Histograms show NKp46-Ig (A, C) and Ncr1-Ig (B, D) binding to F. nucleatum ATCC10953 ΔFadI (A and B) and to F. nucleatum ATCC23726 ΔFadI (A and B) following exposure to 5 mM and 10 mM L-Arginine. Panels (E) and (F) display the mean fluorescence intensity (MFI) quantification corresponding to (A and B) and (C and D), respectively.

      In Panel B, CEACAM1-Ig and RadD-deficient bacteria must be included as negative controls for binding specificity upon anti-NKp46 blocking.

      We appreciate the request to include CEACAM1-Ig and RadD-deficient bacteria as negative controls for specificity under anti-NKp46 blocking. We don’t not think it is necessary since the 02 antibody is specific for NKp46, we used other anti0NKp46 antibodies that did not block the interaction and an irrelevant antibofy, we showed that arginine produced a dose-dependent reduction in NKp46/Ncr1 binding, consistent with an arginine-inhibitable RadD interaction already shown in our manuscript (Fig. 4A). The ΔRadD strains we used already demonstrate loss of NKp46/Ncr1 binding and loss of NK-boosting activity (Figs. 3, 5). Collectively, these data establish that NKp46/Ncr1 recognition of a high-molecular-weight ligand consistent with RadD is specific and functionally relevant.

      Figure 5: Functional NK Activation and Tumor Killing

      In Panels B and C, the key control condition (NK cells + anti-NKp46, without bacteria) is missing. This is needed to evaluate if NKp46 recognition is involved in tumor killing. The authors should explicitly test whether pre-incubation of NK cells with bacteria enhances their anti-tumor activity.

      No significant difference in NK cell cytotoxicity was observed between untreated NK cells and NK cells incubated with anti-NKp46 antibody in the absence of bacteria. Therefore, the NK + anti-NKp46 (O2) group was included as an additional control alongside the other experimental conditions shown in Figures 5b and 5c, and is presented in Author response image 7 below.

      Author response image 7. NK cytotoxicity against breast cancer cell lines. NK cell cytotoxicity against T47D (left) and MCF7 (right) breast cancer cell lines. This experiment follows the format of Figure 5b and 5c, with the addition of the NK cells + O2 antibody group. No significant differences were observed when values were normalized to NK cells alone.

      Could bacteria induce stress signals in tumor cells that sensitize them to NK killing? This distinction is critical.

      It remains unclear whether the bacteria induce stress-related signals in tumor cells that render them more susceptible to NK cell–mediated cytotoxicity.

      (6) Figure 5D: Mechanism of Peripheral Activation

      It is suggested that contact between bacteria and NK cells in the periphery leads to their activation. Can the authors confirm whether this pre-activation leads to enhanced killing of tumor targets, or if bacteria-tumor co-localization is required? The literature indicates that F. nucleatum localizes intracellularly within tumor cells. If so, how is RadD accessible to NKp46 on infiltrating NK cells?

      We do not expect that pre-activation of NK cells with bacteria would enhance their tumor-killing capacity. In fact, when NK cells were co-incubated with bacteria, we occasionally observed NK cell death. Although F. nucleatum can reside intracellularly, bacterial entry requires prior adhesion to tumor cells. At this stage—before internalization—the bacteria are accessible for recognition and binding by NK cells.

      (8) Figure 5E and In Vivo Relevance

      Surprisingly, F. nucleatum infection is associated with increased tumor burden. Does this reflect an immunosuppressive effect? Are NK cells inhibited or exhausted in infected mice (TGIT, SIGLEC7...)? If NK cell activation leads to reduced tumor control in the infected context, the role of RadD-induced activation needs further explanation. RadD-deficient bacteria, which do not activate NK cells, result in even poorer tumor control. This paradox needs to be addressed: how can NK activation impair tumor control while its absence also reduces tumor control?

      Siglec-7 lacks a direct orthologue in mice, and neither mouse TIGIT nor CEACAM1 bind F. nucleatum. The increased tumor burden observed in infected mice may therefore result from bacterial interference with immune cell infiltration and accumulation within the tumor microenvironment (Parhi, L., Alon-Maimon, T., Sol, A. et al. Breast cancer colonization by Fusobacterium nucleatum accelerates tumor growth and metastatic progression. Nat Commun 11, 3259 (2020)). Consequently, the NK cells that do reach the tumor site can recognize and kill F. nucleatum–bearing tumor cells through RadD–NKp46 interactions. In the absence of RadD, this recognition is impaired, leading to reduced NK-mediated cytotoxicity and increased tumor growth.

      (9) NKp46-Deficient Mice: Inconsistencies

      In Ncr1⁻/⁻ mice, infection with WT or RadD-deficient F. nucleatum has no impact on tumor burden. This suggests that NKp46 is dispensable in this context and casts doubt on the physiological relevance of the proposed mechanism. This contradiction should be discussed more thoroughly.

      Ncr1 is also directly involved in mediating NK cell–dependent killing of tumor cells, even in the absence of bacterial infection. Therefore, in Ncr1-deficient mice, F. nucleatum has no additional effect on tumor progression (Glasner, A., Ghadially, H., Gur, C., Stanietsky, N., Tsukerman, P., Enk, J., Mandelboim, O. Recognition and prevention of tumor metastasis by the NK receptor NKp46/NCR1. J Immunol. 2012).

      Reviewer #2 (Public review):

      Weaknesses:

      (1) A previous study by this group (PMID: 38952680) demonstrated that RadD of F. nucleatum binds to NK cells via Siglec-7, thereby diminishing their cytotoxic potential. They further proposed that the RadD-Siglec-7 interaction could act as an immune evasion mechanism exploited by tumor cells. In contrast, the present study reports that RadD of F. nucleatum can also bind to the activating receptor NKp46 on NK cells, thereby enhancing their cytotoxic function.

      Siglec-7 lacks a direct orthologue in mice, and neither mouse TIGIT nor CEACAM1 bind F. nucleatum. In contrast, NKp46 and its murine homologue, Ncr1, both recognize and bind the bacterium.

      While F. nucleatum-mediated tumor progression has been documented in breast and colon cancers, the current study proposes an NK-activating role for F. nucleatum in HNSC. However, it remains unclear whether tumor-infiltrating NK cells in HNSC exhibit differential expression of NKp46 compared to Siglec-7. Furthermore, heterogeneity within the NK cell compartment, particularly in the relative abundance of NKp46⁺ versus Siglec-7⁺ subsets, may differ substantially among breast, colon, and HNSC tumors. Such differences could have been readily investigated using publicly available single-cell datasets. A deeper understanding of this subset heterogeneity in NK cells would better explain why F. nucleatum is passively associated with a favorable prognosis in HNSC but correlates with poor outcomes in breast and colon cancers.

      Currently, there are no publicly available single-cell datasets suitable for characterizing NK cell heterogeneity in the context of F. nucleatum infection—particularly regarding the expression of Siglec-7, NKp46, or CEACAM1 and their potential association with poor clinical outcomes in breast, head and neck squamous cell carcinoma (HNSC), or colorectal cancer (CRC). Furthermore, no RNA-seq datasets are available for breast cancer cases specifically associated with F. nucleatum infection and poor prognosis. Therefore, we analyzed bulk RNA expression datasets for Siglec-7 and CEACAM1 and evaluated their associations with HNSC and CRC using the same patient databases utilized in our manuscript (Author response image 8). No significant differences in Siglec-7 expression were detected between HNSC and CRC samples (Author response image 8A). Although CEACAM1 mRNA levels did not differ between F. nucleatum–positive and –negative cases within either cancer type, its overall expression was higher in CRC compared to HNSC (Author response image 8B).

      Author response image 8. Siglec7 and Ceacam1 expression and the prognostic effect of F. nucleatum in a tumor-type-specific manner. Comparison of Siglec7 (A) and Ceacam1 (B) expression across HNSC and CRC tumors. Log₂ expression levels of NKp46 mRNA were compared across HNSC and CRC cohorts, stratified by F. nucleatum positive and negative. Results were analyzed by one-way ANOVA with Bonferroni post hoc correction.

      (2) The in vivo tumor data (Figure 5D-F) appear to contradict the authors' claims. Specifically, Figure 5E suggests that WT mice engrafted with AT3 breast tumors and inoculated with WT F. nucleatum exhibited an even greater tumor burden compared to mice not inoculated with F. nucleatum, indicating a tumor-promoting effect. This finding conflicts with the interpretation presented in both the results and discussion sections.

      Siglec-7 lacks a direct orthologue in mice, and neither mouse TIGIT nor CEACAM1 bind F. nucleatum. The increased tumor burden observed in infected mice may therefore result from bacterial interference with immune cell infiltration and accumulation within the tumor microenvironment (Parhi, L., Alon-Maimon, T., Sol, A. et al. Breast cancer colonization by Fusobacterium nucleatum accelerates tumor growth and metastatic progression. Nat Commun 11, 3259 (2020)). Consequently, the NK cells that do reach the tumor site can recognize and kill F. nucleatum–bearing tumor cells through RadD–NKp46 interactions. In the absence of RadD, this recognition is impaired, leading to reduced NK-mediated cytotoxicity and increased tumor growth.

      (3) Although the authors acknowledge that F. nucleatum may have tumor context-specific roles in regulating NK cell responses, it is unclear why they chose a breast cancer model in which F. nucleatum has been reported to promote tumor growth. A more appropriate choice would have been the well-established preclinical oral cancer model, such as the 4-nitroquinoline 1-oxide (4NQO)-induced oral cancer model in C57BL/6 mice, which would more directly relate to HNSC biology.

      The tumor model we employed is, to date, the only model in which F. nucleatum has been shown to exert a measurable effect, which is why we selected it for our study (Parhi, L., Alon-Maimon, T., Sol, A. et al. Breast cancer colonization by Fusobacterium nucleatum accelerates tumor growth and metastatic progression. Nat Commun. 2020; 11: 3259). We have not tested the 4-nitroquinoline-1-oxide (4NQO)–induced oral cancer model, and we are uncertain whether its use would be ethically justified.

      (4) Since RadD of F. nucleatum can bind to both Siglec-7 and NKp46 on NK cells, exerting opposing functional effects, the expression profiles of both receptors on intratumoral NK cells should be evaluated. This would clarify the balance between activating and inhibitory signals in the tumor microenvironment and provide a more mechanistic explanation for the observed tumor context-dependent outcomes.

      This question was answered in Author response image 8 above.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      This is an interesting study on the role of FGF signaling in the induction of primitive streak-like cells (PS-LC) in human 2D-gastruloids. The authors use a previously characterized standard culture that generates a ring of PSLCs (TBXT+) and correlate this with pERK staining. A requirement for FGF signaling in TBXT induction is demonstrated via pharmacological inhibition of MEK and FGFR activity. A second set of culture conditions (with no exogenous FGFs) suggests that endogenous FGFs are required for pERK and TBXT induction. The authors then characterize, via scRNA-seq, various components of the FGF pathway (genes for ligands, receptors, ERK regulators, and HSPG regulation). They go on to characterize the pFGFR1, receptor isoforms, and polarized localization of this receptor. Finally, they perform FGF4 inhibition and use a cell line with a limited FGF17 inactivation (heterozygous null) and show that loss of these FGFs reduces PS-LC and derivative cell types. 

      Strengths: 

      (1) As the authors point out, the role of FGF signaling in gastrulation is less well understood than other signaling pathways. Hence this is a valuable contribution to that field. 

      (2) The FGF4 and FGF17 loss-of-function experiments in Figure 5 are very intriguing. This is especially so given the intriguing observation that these FGFs appear to be dominating in this model of human gastrulation, in contrast to what FGFs dominate in mice, chicks, and frogs. 

      (3) In general this paper is valuable as a further development of the Human gastruloid system and the role of FGF signaling in the induction of PS-CLs. The wide net that the authors cast in characterizing the FGF ligand gene, receptor isoforms, and downstream components provides a foundation for future work. As the authors write near the beginning of the Discussion "Many questions remain." 

      We thank the reviewer for these positive comments.

      Weaknesses: 

      (1) FGFs are cell survival factors in various aspects of development. The authors fail to address cell death due to loss of FGF signaling in their experiments. For example, in Figure 1E (which requires statistical analysis) and 1G (the bottom FGFRi row), there appears to be a significant amount of cell loss. Is this due to cell death? The authors should address the question of whether the role of FGF/ERK signaling is to keep the cells alive. 

      Indeed, FGF also strongly affects cell survival and it is an interesting question to what extent this depends on ERK. Our manuscript focuses instead on the role of FGF/ERK signaling in cell fate patterning. As mentioned in our discussion, figure 1de show that doxycycline induced pERK leads to more TBXT+ cells than the control without restoring cell number, suggesting the role of FGF in controlling cell number is independent of the requirement for FGF/ERK in PS-LC differrentiation. To further support this, we have added data showing low doses of MEKi are sufficient to inhibit differentiation without affecting cell number (Supp. Fig. 1i).

      To address the reviewers question regarding the cause of cell loss, we now stained for BrdU and cleaved Cas3 to assess proliferation and apoptosis in the presence and absence of MEK and FGFR inhibition (new Supp. Fig.

      1ef). This shows that the effect of these inhibitors on cell number is primarily due to a reduction in proliferation. We have also included statistical analysis in Fig.1e. 

      (2) Regarding the sparse cells in 1G, is there a reduction in cell number only with FGFRi and not MEKi? Is this reproducible? Gattiglio et al (Development, 2023, PMID: 37530863) present data supporting a "community effect" in the FGF-induced mesoderm differentiation of mouse embryonic stem cells. Could a community effect be at play in this human system (especially given the images in the bottom row of 1G)? If the authors don't address this experimentally they should at least address the ideas in Gattoglio et al. 

      Indeed, FGFRi reproducibly affects cell number more than MEKi, in line with the fact that pathways other than MAPK/ERK downstream of FGF (e.g. PI3K) play important roles in cell survival and growth. However, we think the lack of differentiation in MEKi and FGFRi in Fig.1g cannot be attributed to a loss of cells combined with a community effect. This is because without FGFRi or MEKi cells efficiently differentiate to primitive streak at much lower densities than those originally shown, consistent with the data we discuss in response to (1) arguing against a primarily indirect effect of FGF on PS-LC differentiation through cell density. In the context of directed differentiation (rather than 2D gastruloids), we have now shown in a controlled manner that the effect of MEKi and FGFRi does not depend on a community effect by repeating the experiment in Fig.1g while adjusting cell seeding densities to obtain similar final cell densities in all three conditions (new Fig.1g, new Supp Fig.1g). Furthermore we have included new data showing extremely sparse cells without MEKi or FGFRi still differentiate without problems (new Supp Fig 1h). We have also include Gattoglio et al in our revised discussion.

      (3) Do the FGF4 and FGF17 LOF experiments in Figure 5 affect cell numbers like FGFRi in Figure 1? 

      We did not observe major changes in cell number in the FGF4 and FGF17 loss of function experiments. This is in line with our observation that low levels of ERK signaling are sufficient to maintain proliferation (new Supp. Fig. 1i), and the fact that low levels of ERK signaling are maintained in the absence of FGF4 and FGF17 (Fig.5), likely by FGF2 (Fig. 2). In contrast, FGFRi treatment in Fig.1 leads to a nearly complete loss of FGF signaling (ERK and other pathways) that has a dramatic effect on cell number.

      Why examine PS-LC induction only in FGF17 heterozygous cells and not homozygous FGF17 nulls? 

      We were unable to obtain homozygous FGF17 nulls, it is not clear if there is a reason for this. In the absence of homozygous nulls, we have now further corroborated our findings with additional knockdown data (described in response to other comments below).

      (4) The idea that FGF8 plays a dominant role during gastrulation of other species but not humans is so intriguing it warrants deeper testing. The authors dismiss FGF8 because its mRNA "...levels always remained low." (line 363) as well as the data published in Zhai et al (PMID: 36517595) and Tyser et al (PMID: 34789876). But there are cases in mouse development where a gene was expressed at levels so low, that it might be dismissed, and yet LOF experiments revealed it played a role or even was required in a developmental process. The authors should consider FGF8 inhibition or inactivation to explore its potential role, despite its low levels of expression. 

      We thank the reviewer for this suggestion. We have now analyzed the role of FGF8 using FISH to visualize its expression and siRNA to understand its function (Fig.5d,f,h; Supp.Fig.5e,g,6e). We found that FGF8 expression is higher earlier in differentiation, preceding most expression of TBXT. Our scRNA-seq only analyzed samples at 42h so did not capture this. Furthermore, FGF8 expression localized inside the PS-like ring rather than coinciding with it like FGF4. Surprisingly, FGF8 knockdown led to an increase in primitive streak-like differentiation, suggesting it may counteract FGF4. The results are shown in the revised Fig. 5 and Supplemental Fig. 5. While this certainly merits further investigation, understanding the role of FGF8 in more detail is beyond the scope of the current work. 

      (5) Redundancy is a common feature in FGF genetics. What is the effect of inhibiting FGF4 in FGF17 LOF cells? 

      Further siRNA and shRNA experiments showed that FGF17 knockdown had a much smaller effect than FGF4 knockdown on expression of primitive streak markers (Fig.5i, Supp.Fig.6f-i) but that FGF17 knockdown did lead to a complete loss of the mesoderm marker TBX6 (Fig.5j, Supp.Fig.6j). A double knockdown of FGF4+FGF17 looked similar to FGF4 alone (Supp.Fig.6k). Thus, we now think the more likely scenario is that FGF17 is downstream of FGF4-dependent PS-differentiation and although this may have a positive feedback effect whereby this FGF17 can then enhance further PS-differentiation, which we previously interpreted as partial redundancy, the primary role of FGF17 may be later, in mesoderm differentiation.

      (6) I suggest stating that the authors take more caution in describing FGF gradients. For example, in one Results heading they write "Endogenous FGF4 and FGF17 gradients underly the ERK activity pattern.", implying an FGF protein gradient. However, they only present data for FGF mRNA , not protein. This issue would be clarified if they used proper nomenclature for gene, mRNA (italics), and protein (no italics) throughout the paper. 

      Thank you for the suggestion. We have edited the paper to more clearly distinguish protein and mRNA. We do think our data provide substantial indirect evidence for a protein gradient which is what the results heading is meant to convey. Receptor activation is high where ERK activity is high (Fig.3), and receptor activation is limited by ligands, since creating a scratch to let exogenous FGF reach the basal side of cells in the center leads to receptor activation (Fig.4). This strongly suggests ERK activity reflects an FGF protein gradient. 

      Reviewer #2 (Public review): 

      Summary: 

      The role of FGFs in embryonic development and stem cell differentiation has remained unclear due to its complexity. In this study, the authors utilized a 2D human stem cell-based gastrulation model to investigate the functions of FGFs. They discovered that FGF-dependent ERK activity is closely linked to the emergence of primitive streak cells. Importantly, this 2D model effectively illustrates the spatial distribution of key signaling effectors and receptors by correlating these markers with cell fate markers, such as T and ISL1. Through inhibition and loss-of-function studies, they further corroborated the needs of FGF ligands. Their data shows that FGFR1 is the primary receptor, and FGF2/4/17 are the key ligands for primitive streak development, which aligns with observations in primate embryos. Additional experiments revealed that the reduction of FGF4 and FGF17 decreases ERK activity. 

      Strengths: 

      This study provides comprehensive data and improves our understanding of the role of FGF signaling in primate

      primitive streak formation. The authors provide new insights related to the spatial localization of the key components of FGF signaling and attempt to reveal the temporal dynamics of the signal propagation and cell fate decision, which has been challenging. 

      Weaknesses: 

      Given the solid data, the work only partially clarifies the complex picture of FGF signaling, so details remain somewhat elusive. The findings lack a strong punchline, which may limit their broader impact. 

      We thank this reviewer for their valuable feedback and compliment on the solidity of our data. The punchline of our work is that FGF4 and FGF17-dependent ERK signaling plays a key role in differentiation of human PS-like cells and mesoderm, and that these are different FGFs than those thought to drive mouse gastrulation. A second key point is that like BMP and TGFβ signaling, FGF signaling is restricted to the basolateral sides of pluripotent stem cell colonies due to polarized receptor expression, which is crucial for understanding the response to exogenous ligands added to the cell medium. Indeed, many facets of FGF signaling remain to be investigated in the future, such as how FGF regulates and is regulated by other signals, which we will dedicate a different manuscript to. 

      Reviewer #3 (Public review): 

      Jo and colleagues set out to investigate the origins and functions of localized FGF/ERK signaling for the differentiation and spatial patterning of primitive streak fates of human embryonic stem cells in a well-established micropattern system. They demonstrate that endogenous FGF signaling is required for ERK activation in a ringdomain in the micropatterns, and that this localized signaling is directly required for differentiation and spatial patterning of specific cell types. Through high-resolution microscopy and transwell assays, they show that cells receive FGF signals through basally localized receptors. Finally, the authors find that there is a requirement for exogenous FGF2 to initiate primitive streak-like differentiation, but endogenous FGFs, especially FGF4 and FGF17, fully take over at later stages. 

      Even though some of the authors' findings - such as the localized expression of FGF ligands during gastrulation and the importance of FGF/ERK signaling for cell differentiation in the primitive streak - have been reported in model organisms before, this is one of the first studies to investigate the role of FGF signaling during primitive streak-like differentiation of human cells. In doing so, the paper reports a number of interesting and valuable observations, namely the basal localization of FGF receptors which mirrors that of BMP and Nodal receptors, as well as the existence of a positive feedback loop centered on FGF signaling that drives primitive-streak differentiation. The authors also perform a comparison of the role of different FGFs across species and try to assign specific functions to individual FGFs. In the absence of clean genetic loss-of-function cell lines, this part of the work remains less strong. 

      We thank the reviewer for emphasizing the value of our findings in a human model for gastrulation. We agree more loss-of-function experiments would provide further insight into the role of different FGFs. While we did not manage to create knockout cell lines, we have now performed both siRNA and shRNA knock-down of all FGF4, and FGF17 in two different hPSC lines, performed siRNA knockdown of FGF8, and also made a FGF4+FGF17 shRNA double knockdown cell lines to more completely test the functions of the individual FGFs (Fig.5, Supp.Fig.5,6). Our data suggest FGF17 may be downstream of FGF4 and primarily required for mesoderm differentiation while FGF8 appears to counteract FGF4. In doing this we have added a large amount of new data to the manuscript and we have removed the heterozygous knockout data in the first version of the manuscript which we felt added little to the new data. Further experiments are still needed to solidify our interpretation but those are beyond the scope of the current work.   

      Reviewer #1 (Recommendations for the authors): 

      (1) FGF2 is added to culture experiments (e.g. Figure 4), but the commercial source is not mentioned in Methods. For example, it could be added to "Supplementary Table 1: Cell signaling reagents." 

      We apologize for this oversight and have now added the information to Supplementary Table 1.

      (2) Line 117-118: "For example, by controlling the expression of Wnt or Nodal which are both required for PS-like differentiation". It is clear what the authors mean, but this is not a complete sentence. 

      We edited this for clarity, it now reads: “First, is FGF/ERK signaling required directly for PS-like differentiation, or does it act indirectly? These possibilities are not mutually exclusive. For example, FGF/ERK could be required directly but also act indirectly by controlling Wnt or Nodal expression, as both Wnt and Nodal signaling are required for PS-like differentiation.”

      (3) Line 246 "...found its spatial pattern to strongly resembles that of pERK..." either remove "to" or change "resembles" to "resemble" 

      Thank you for catching this. We removed “to”.

      (4) Lines 391- 393 seem to be missing a word in the last phrase: "...with FGF17 more important continued differentiation to mesoderm and endoderm." Maybe "during" after the word "important"? 

      Thank you for catching this, indeed the word “during” was missing and we have now added it.

      (5) Please define acronyms in Figure 3D (PS-LC was defined previously, but not others). 

      We apologize for the oversight, we have now defined the acronyms.

      (6) The three blue lines in Figure 5B (right) are hard to discern (and I'm not colorblind). I suggest also using a variety of dotted lines in a subset of these FGFs. 

      Thanks you for the suggestion. We have now given all the FGFs colors that are more clearly distinct and made the TBXT and TBX6 lines dashed.  

      Reviewer #2 (Recommendations for the authors): 

      (1) The reviewer acknowledges that FGF signaling is complex, particularly when dynamics and its correlation with cell fates are considered. To improve the clarity of the findings, the authors are encouraged to provide an additional schematic figure that clearly delineates the main findings of this study.  

      Thank you for the suggestion. We have now added a summary figure (Fig.6) to our discussion, which we hope helps present our findings more clearly.

      (2) The data suggest that FGF signaling may function differently in mice compared to primates, and their stem cell model aligns more closely with the latter. While the authors discuss this in the contents only based on sequencing data, it would be valuable to conduct some experiments with mouse embryos to validate the key differences. 

      It is unclear to us which experiments the reviewer has in mind. There is ample data on FGF expression in the mouse literature, as are many knockout phenotypes. Furthermore, verifying loss of function phenotypes (e.g. FGF17 knockout) in mouse is beyond our expertise.

      (3) Heparan sulfate proteoglycan (HSPG) is mentioned as an important component of FGF signaling; however, the only data related to HSPG is single-cell sequencing results. The authors should consider performing immunostaining or other assays to validate HSPG expression and spatial distribution, similar to the approach they used for other signaling components. 

      Our scratch experiments in Fig. 4 strongly argue against HSPGs as being responsible for the spatial pattern of FGF receptor activation: after a scratch across the colony the response is strong all along the scratch as expected if presence of FGF (an FGF gradient) controls the level of activity. If HSPGs were limiting, FGF flowing in from the media show not be able to uniformly activate receptors around the scratch.

      In addtion, we have now included an immunostain for HS in a newly added Supp. Fig. 4 which does not explain the observed pattern of ERK signaling.

      (4) In the scratch experiment, particularly high PERK expression is observed at the edge of the scratch. The authors should provide an explanation for why this expression is significantly higher compared to the edges of the colony. Additionally, it would be interesting to investigate the fate of the cells with super high PERK expression.  

      We have now determined that adaptive response to FGF is the reason that the response around the scratch is initially much higher than in the ERK activity ring that overlaps with the primitive streak-like cells. We have added figures showing that although the intial response to FGF exposure after scratching is very high, the response around the scratch adapts to levels similar in those in the ERK ring over the course of 6 hours (Fig.4ij). 

      (5) For some of the key experiments, multiple cell lines should be used to ensure that the findings are reproducible and applicable across different human stem cell lines.

      We have now checked FISH stainings and knockdown phenotypes for different FGFs in two different cell lines: ESI17 (hESC, XX) and PGP1 (hiPSC, XY). These results are shown in Supplementary Figures 6. We found all results to be consistent.

      (6) Where applicable, the meaning of error bars needs to be more clearly presented, including details on the number of independent experiments or samples used. 

      Thank you for pointing this out. Where error bar definitions were missing we have now added them to the figure captions.

      Reviewer #3 (Recommendations for the authors): 

      (1) The authors only analyze the ppERK ring in micropatterns of a single size. What was the motivation for the choice of this size? Can the authors how the ppERK ring is expected to depend on colony size? 

      Much smaller patterns lose the interior pluripotent regions while much larger patters have a much larger pluripotent region, which requires larger tilings to image without providing additional insight. The colony sizedependence of cell fate patterning was described in the paper that established the 2D gastruloids model (Warmflash Nat Methods 2014) and we later showed this due to a fixed length scale of the BMP and Nodal signaling gradients from the colony edge (Jo et al Elife 2022). We have now included data showing that the ERK patterns behaves similarly, with a fixed length scale of the pattern implying that in smaller colonies the ERK ring becomes a disc and the entire center of the colony has high ERK signaling (Supp Fig 1a).

      (2) The scRNAseq is somewhat confusing - why do the two datasets not overlap in the PHATE representation? This is unexpected, because the two samples have been treated similarly, and the authors have integrated their data to iron out possible batch effects. This discrepancy should be discussed. The authors should also specify from which reference exactly the first dataset comes from.  

      The two datasets do overlap nicely, the same fates are well mixed in the same place and the gene expresison profiles for the integrated data (e.g., Fig.2e) look smooth, so we believe the integration is good, but different cell fates are represented to different degrees. In particular, sample 2 shows much more mesoderm differentiation making the mesoderm branch mostly orange. Occassionally samples differentiate faster or slower than average which we see here, and these samples were collected far apart in time. We do not believe this affects our conclusions, if anything, we think performing the analysis on two samples that differ this much should make the conclusions more robust.  

      (3) If find it intriguing that exogenous FGF2 is important early on for primitive streak-like differentiation, although the authors show that it does not reach the center of the colony. The authors may want to discuss this conundrum. Does the FGF2 effect propagate from the outside to the inside, or does it act at an early stage when the cells have not yet formed a tight epithelium on the micropattern? 

      The cells in the experiment in Fig. 5a were given 24h to epithelialize, so we we do believe it acts from the edge. We believe this may be due to FGF2 modulating the early BMP response on the edge and are working on a manuscript that further explores this pathway crosstalk.

      (4) The authors' statement that FGF4 and FGF17 have partially redundant functions is not very strong, mainly because the study lacks a full FGF17 loss-of-function cell line. If the authors wanted to improve on this point, they could knock down FGF4 in the FGF17 heterozygous line, or produce a homozygous FGF17 KO line. If there are specific reasons why FGF17 homozygous lines cannot be produced, this could be interesting to discuss, too. Finally, I noticed that the methods list experiments with an FGF17 siRNA, but these are not shown in the manuscript. 

      We agree our evidence was previously not as strong as it could be. While there is no reason we know of why homozygous knockout lines cannot be produced, we failed to produce on. To strengthen our evidence we have therefore included substantial new knockdown data.  We have now performed both siRNA and shRNA knockdown of all FGF4, and FGF17 in two different hPSC lines, performed siRNA knockdown of FGF8, and also made a FGF4+FGF17 shRNA double knockdown cell lines to more completely test the functions of the individual FGFs (Fig.5, Supp.Fig.5,6). These experiments showed that FGF17 knockdown had a much smaller effect than FGF4 knockdown on expression of primitive streak markers (Fig.5i, Supp.Fig.6f-i) but that FGF17 knockdown did lead to a complete loss of the mesoderm marker TBX6 (Fig.5j, Supp.Fig.6j). A double knockdown of FGF4+FGF17 looked similar to FGF4 alone (Supp.Fig.6k). Thus, we now think the more likely scenario is that FGF17 is downstream of FGF4-dependent PS-differentiation and although this may have a positive feedback effect whereby this FGF17 can then enhance further PS-differentiation, which we previously interpreted as partial redundancy, the primary role of FGF17 may be later, in mesoderm differentiation. Furthermore, our new data suggests FGF8 may counteract FGF4 and limit PS-like differentiation. 

      Minor 

      (5) Line 63: Reference(s) appear to be missing. 

      This whole paragraph summarizes the results of the references given on line 55, we have now repeated the relevant references where the reviewer indicated.

      (6) Supplementary Figure 1a,b does not show ppERK, unlike stated in lines 102 - 104. 

      Indeed, the data described in lines 102-104 is shown in Fig.1a and we have removed the original Supplementary Figure 1ab since it did not provide relevant information.

      (7) Line 201: It is not clear whether this is a new sequencing dataset, or if existing datasets have been reanalyzed. 

      We agree our description was unclear. We have edited the text, which now explicitly states that our analysis is based on one dataset we collected previously and a replicate that was newly collected and deposited on GEO for this manuscript.

      (8) Figure 2f; Supplementary Figure 2b, c: The colors need to be explained in scale bars. How has this data been normalized to allow for comparison between very different sample types? 

      We have now added color bars indicating the scale for each of these figure panels. As the caption stated, the interspecies comparison was normalized within each species, so the highest FGF level for any FGF at any time within each species is normalized to one. We are thus comparing between species the relative expression of different FGFs within each species. Indeed there is no good way to compare absolute expression between species. For extra clarity we have expanded our description of the interspecies comparison analysis and normalization in the methods section.

      (9) Line 232: Where is the expression of SEF shown? 

      It is shown in Fig. 2i, under the official gene name IL17RD.

      (10) Supplementary Figure 4 seems to be missing. 

      Thank you for pointing this out. We have now added a supplementary Fig.4.

      (11) Line 437: Citation needed. 

      We have included citations now.

      (12) Line 439: A similar feedback loop has been proposed to operate during mesoderm differentiation in mouse ESC (pmid: 37530863 ). The authors may consider citing this work. 

      Thank you for the suggestion, we have now included this work in the discussion. The feedback loop proposed in that work involves FGF8, while we were trying to explain why FGF4 and not FGF8 appears to be conserved across species by invoking an FGF4 feedback loop. Thus, it becomes even harder to explain differences in FGF4 and FGF8 expression between human and mouse gastrulation.

      (13) Supplementary Figure 6 is not described in the main text. 

      We have removed the original Supplementary Figure 6 and corresponding heterozygous knockout data in the main figure which we felt added little to the extensive knockdown data we now present. We did create a new Supplementary Figure 6 showing additional knockdown data which is described in the main tekst.

      (14) Submission of sequencing data to GEO needs to be updated. 

      We have now made the GEO data public.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      All experiments are convincing, clearly visualized and quantified. 

      The strength of the paper is that it clearly indicates that there are temporal controlled feedback systems which is important for endothelial collective cell behavior. 

      A limitation of the study is that the inhibitory studies in vivo may include off-target effects as well. Future endeavors, including specific knockout models, optogenetics and/or transgenic zebrafish lines that visualize endothelial cell properties (proliferation and migration) will be informative to track individual endothelial cell responses upon feedback signals.

      We agree with the reviewer and are currently conducting experiments with optogenetic tools, knockout models, and transgenic zebrafish lines to dissect the feedback loop dynamics at the cellular scale.    

      Reviewer #2 (Public review):

      Major strengths: The combination of in vitro and in vivo assessment provides evidence for timing in physiologically relevant contexts, and rigorous quantification of outputs is provided. The idea of defining temporal aspects of the system is quite interesting. New RNA profiling supports the model. 

      Weaknesses: Actinomycin D blocks most transcription so exposure for hours likely leads to secondary and tertiary effects and perhaps effects on viability.

      We agree with the reviewer that “off-target” effects are a limitation of the pharmacologic approach. We have also previously shown that long-term treatment with actinomycin D reduces ECFC survival (Mason et al., 2019). 

      Reviewer #3 (Public review):

      Strengths: The authors conduct ASPM assay to find the time scale of feedback when ECFCs attach to three different matrics. They observe the common time scale of feedback. Thus, under very specific conditions they use, the reproducibility is validated by their ASPM assay. The feedback loop mediated by inhibition of gene expression by Actinomycin D is similar to that obtained from YAP/TAZ-depleted cells, suggesting the mechanotranduction might be involved in the feedback loop. The time scale representing infection point might be interesting when considering the continuous motility in cultured endothelial cells, although it might not account for the migration of endothelial cells that is controlled by a wide variety of extracellular cues. In vivo, stiffness of extracellular matrix is merely one of the cues. 

      Weaknesses: ASPM assay is based on attachment-dependent phenomenon. The time scale including the inflection point determined by ASPM experiments using cultured cells and the mechanotransduction-based theory do not seem to fit in vivo ISV elongation. Although it is challenging to find the conserved theory of continuous cell motility of endothelial cells, the data is preliminary and does not support the authors' claim. There is no evidence that mechanotransduction solely determines the feedback loop during elongation of ISVs. The points to be addressed are listed in recommendations for the authors.

      The ASPM assay enabled us to define temporal dynamics of YAP/TAZ mechanotransduction. We then used those insights to design ISV washout experiments that tested if the characteristic time scales were conserved in vivo. However, we agree with the limitations identified by the reviewer. Cells behave and respond to mechanical cues differently in 2D vs 3D environments, and the microenvironment in vivo is much more complex. Future work with optogenetic tools will be useful to dissect the temporal kinetics in vivo during ISV elongation.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #3 (Public review): 

      Summary: 

      The manuscript explores behavioral responses of C. elegans to hydrogen sulfide, which is known to exert remarkable effects on animal physiology in a range of contexts. The possibility of genetic and precise neuronal dissection of responses to H2S motivates the study of responses in C. elegans. The revised manuscript does not seem to have significantly addressed what was lacking in the initial version. 

      The authors have added further characterization of possible ASJ sensing of H2S by calcium imaging but ASJ does not appear to be directly involved. Genetic and parallel analysis of O2 and CO2 responsive pathways do not reveal further insights regarding potential mechanisms underlying H2S sensing. Gene expression analysis extends prior work. Finally, the authors have examined how H2S-evoked locomotory behavioral responses are affected in mutants with altered stress and detoxification response to H2S, most notably hif-1 and egl-9. These data, while examining locomotion, are more suggestive that observed effects on animal locomotion are secondary to altered organismal toxicity as opposed to specific behavioral responedse 

      Overall, the manuscript provides a wide range of intriguing observations, but mechanistic insight or a synthesis of disparate data is lacking. 

      We thank the reviewer for the valuable feedback. We agree that while our investigation provides broad coverage, it does not fully resolve the mechanisms of H<sub>2</sub>S perception. As both reviewers noted, the avoidance response to high levels of H<sub>2</sub>S is most likely driven by its toxicity, particularly at the level of mitochondria, rather than by direct perception of H<sub>2</sub>S. We also favor this model and have revised the results and discussion to highlight this interpretation, while acknowledging that other mechanisms cannot be excluded (main changes lines 387-402 and 535-547).

      Building on this view, our observations point toward mitochondrial ROS transients as the trigger for H<sub>2</sub>S avoidance. First, toxic levels of H<sub>2</sub>S are known to promote ROS production (1). Second, similar to acute H<sub>2</sub>S, brief exposure to rotenone, an ETC complex I inhibitor that rapidly generates mitochondrial ROS, triggers locomotory responses (Figure 7E) (Lines 393-396). Third, regardless of duration, rotenone exposure inhibits H<sub>2</sub>S-evoked avoidance (Figure 7E) (Lines 389-391), likely by preventing or dampening H<sub>2</sub>S-evoked mitochondrial ROS bursts when ETC function is impaired and ROS is already high. Notably, animals subjected to prolonged rotenone exposure, ETC mutants, and quintuple sod mutants, each experiencing chronically high ROS levels, fail to respond to H<sub>2</sub>S and display reduced locomotory activity, presumably due to ROS toxicity and/or activation of stress-adaptive mechanisms (Figure 7).

      Consistent with the activation of stress-responsive pathways, H<sub>2</sub>S exposure alters expression of genes controlled by SKN-1 and HIF-1 signaling. Both pathways are ROS-sensitive and promote adaptation to chronic ROS production (2-4). Their activation, as in egl-9, render these animals insensitive to H<sub>2</sub>S-evoked ROS transients (Figure 5B) (Lines 303-305). Conversely, mutants defective in these adaptive pathways, such as hif-1, still show initial locomotory responses to H<sub>2</sub>S, but rapidly lose activity during prolonged H<sub>2</sub>S exposure (Figure 5D) (Lines 318-319). These observations suggest that HIF-1 pathway is dispensable for initiating the response to H<sub>2</sub>S evoked ROS transients, but essential for protecting against ROS toxicity.

      In this context, the neural circuit we examined, such as ASJ neurons, is not directly involved in H<sub>2</sub>S perception (Line 165-169 and 448-457). Instead, it likely modulates a circuit that is responsive to ROS toxicity. This circuit is also influenced by ambient O<sub>2</sub> levels, the state of O<sub>2</sub> sensing circuit, and nutrient status, in a manner reminiscent of the CO<sub>2</sub> responses (5, 6).

      Reviewer #4 (Public review): 

      Summary: 

      The authors establish a behavioral paradigm for avoidance of H2S and conduct a large candidate screen to identify genetic requirements. They follow up by genetically dissecting a large number of implicated pathways - insulin, TGF-beta, oxygen/HIF-1, and mitochondrial ROS, which have varied effects on H2S avoidance. They additionally assay whole-animal gene expression changes induced by varying concentrations and durations of H2S exposure. 

      Strengths: 

      The implicated pathways are tested extensively through mutants of multiple pathway molecules. The authors address previous reviewer concerns by directly testing the ability of ASJ to respond to H2S via calcium imaging. This allows the authors to revise their previous conclusion and determine that ASJ does not directly respond to H2S and likely does not initiate the behavioral response. 

      We thank the reviewer for the supportive comments.

      Weaknesses: 

      Despite the authors focus on acute perception of H2S, I don't think the experiments tell us much about perception. I think they indicate pathways that modulate the behavior when disrupted, especially because most manipulations used broadly affect physiology on long timescales. For instance, genetic manipulation of ASJ signaling, oxygen sensing, HIF-1 signaling, mitochondrial function, as well as starvation are all expected to constitutively alter animal physiology, which could indirectly modulate responses to H2S. The authors rule out effects on general locomotion in some cases, but other physiological changes could relatively specifically modulate the H2S response without being involved in its perception. 

      I am actually not convinced that H2S is directly perceived by the C. elegans nervous system at all. As far as I can tell, the avoidance behavior could be a response to H2S-induced tissue damage rather than the gas itself. 

      We thank the reviewer for the valuable insights, and fully agree that the H<sub>2</sub>S may not be directly perceived by C. elegans. Please see detailed responses below.

      Reviewer #4 (Recommendations for the authors): 

      The clarity of the paper is improved in this version. My main issue has to do with "perception" of H2S. At times the authors suggest that hydrogen sulfide should be perceived by a neural circuit ("we did not specifically identify the neural circuit mediating H2S signaling"), while at other times they discuss the possibility that it is not directly perceived neuronally ("Supporting the idea that acute mitochondrial ROS generation initiates avoidance of high H2S levels,"). The authors should clearly state their model for H2S perception. Do they think there is a receptor and sensory neuron for H2S (not identified in this paper)? If not, what does it mean for there to be a neural circuit mediating the response? To me, it looks more like what is being "perceived" by a neural circuit is ROS-induced toxicity, not H2S itself. 

      To drill down on direct modulation of acute perception, are any of the pathway manipulations used in this paper performed on the timescale of perception? Rotenone for 10 mins is close to that timescale, and in fact it increases speed independently of H2S, consistent with ROSinduced toxicity, not H2S being the signal that induces the behavior. Optogenetic activation of RMG could also be on the acute timescale. Can the authors clarify for how long blue light was on the worms before the start of the assay? Or was it turned on at the same time as video acquisition commenced? This could be evidence that RMG acutely modulates this behavioral response. 

      I feel that the ASJ calcium imaging data should be in the main figure given its importance in revising the original model. 

      We thank the reviewer for the valuable advice.

      As suggested, ASJ calcium imaging data are displayed in the main figure (Figure 2I) (Line 167).

      As both reviewers noted, our initial presentation was not sufficiently clear regarding the mechanism underlying H<sub>2</sub>S avoidance. We agree with the reviewer that H<sub>2</sub>S avoidance is unlikely mediated by direct perception via a H<sub>2</sub>S-specific receptor, but likely arises from acute mitochondrial dysfunction and ROS generation. 

      ROS

      In line with the reviewer’s perspective, our observations point toward mitochondrial ROS transients as the trigger for H<sub>2</sub>S avoidance. First, toxic levels of H<sub>2</sub>S are known to promote ROS production (1). Second, similar to acute H<sub>2</sub>S, brief exposure to rotenone, an ETC complex I inhibitor that rapidly generates mitochondrial ROS, triggers locomotory responses (Figure 7E) (Lines 393-396). Third, regardless of duration, rotenone exposure inhibits H<sub>2</sub>S-evoked avoidance (Figure 7E) (Lines 389-391), likely by preventing or dampening H<sub>2</sub>S-evoked mitochondrial ROS bursts when ETC function is impaired and ROS is already high. Notably, animals subjected to prolonged rotenone exposure, ETC mutants, and quintuple sod mutants, each experiencing chronically high ROS levels, fail to respond to H<sub>2</sub>S and display reduced locomotory activity, presumably due to ROS toxicity and/or activation of stress-adaptive mechanisms (Figure 7). We revised the Results and Discussion to present the model more consistently (main changes lines 387-402 and 535-547).

      Consistent with the activation of stress-responsive pathways, H<sub>2</sub>S exposure alters expression of genes controlled by SKN-1 and HIF-1 signaling. Both pathways are ROS-sensitive and promote adaptation to chronic ROS production (2-4). Their activation, as in egl-9, render these animals insensitive to H<sub>2</sub>S-evoked ROS transients (Figure 5B) (Lines 303-305). Conversely, mutants defective in these adaptive pathways, such as hif-1, still show initial locomotory responses to H<sub>2</sub>S, but rapidly lose activity during prolonged H<sub>2</sub>S exposure (Figure 5D) (Lines 318-319). These observations suggest that HIF-1 pathway is dispensable for initiating the response to H<sub>2</sub> Sevoked ROS transients, but essential for protecting against ROS toxicity.

      ASJ neurons

      ASJ neurons and DAF-11 signaling are required for H<sub>2</sub>S-evoked behavioral responses. However, ASJ does not exhibit an H<sub>2</sub>S-evoked calcium transient. It suggests that ASJ neurons do not directly detect H<sub>2</sub>S (Line 165-169 and 448-457), but likely modulate the circuit responsive to ROS toxicity. This circuit can also be modulated by ambient O<sub>2</sub> levels, the state of O<sub>2</sub> sensing circuit, and nutrient status, in a manner reminiscent of the CO<sub>2</sub> responses (5, 6). 

      O<sub>2</sub> sensing circuit

      Consistent with the reviewer’s view, we favor the model that H<sub>2</sub>S avoidance is likely induced by ROS transients. We believe that the state of O<sub>2</sub> sensing circuit, similar to ASJ neurons, modulates the neural circuit that is responsive to H<sub>2</sub>S-evoked ROS toxicity. This circuit is inhibited as long as O<sub>2</sub> sensing circuit is active. In the RMG optogenetic experiment, channelrhodopsin was photo-stimulated as soon as the assay was initiated at 7% O<sub>2</sub> (Methods Lines 633-634 and Figure legend Lines 1177-1178), therefore RMG remained active throughout the assay including at 7% O<sub>2</sub>. Our interpretation is that RMG activation inhibits this ROSresponsive circuit and H<sub>2</sub>S avoidance. However, these observations do not resolve if H<sub>2</sub>S is acutely and directly perceived. The modulation of H<sub>2</sub>S response by O<sub>2</sub> circuit was discussed between Lines 437-447.

      References

      (1) J. Jia et al., SQR mediates therapeutic effects of H(2)S by targeting mitochondrial electron transport to induce mitochondrial uncoupling. Sci Adv 6, eaaz5752 (2020).

      (2) S. J. Lee, A. B. Hwang, C. Kenyon, Inhibition of Respiration Extends C. elegans Life Span via Reactive Oxygen Species that Increase HIF-1 Activity. Current Biology 20, 2131-2136 (2010).

      (3) C. Lennicke, H. M. Cocheme, Redox metabolism: ROS as specific molecular regulators of cell signaling and function. Mol Cell 81, 3691-3707 (2021).

      (4) D. A. Patten, M. Germain, M. A. Kelly, R. S. Slack, Reactive oxygen species: stuck in the middle of neurodegeneration. J Alzheimers Dis 20 Suppl 2, S357-367 (2010).

      (5) A. J. Bretscher, K. E. Busch, M. de Bono, A carbon dioxide avoidance behavior is integrated with responses to ambient oxygen and food in Caenorhabditis elegans. Proc Natl Acad Sci U S A 105, 8044-8049 (2008).

      (6) E. A. Hallem, P. W. Sternberg, Acute carbon dioxide avoidance in Caenorhabditis elegans. Proc Natl Acad Sci U S A 105, 8038-8043 (2008).

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      Okazaki et al. showed flickering stimuli to patients with unilateral spatial neglect (USN) and measured EEG responses. They compared this with another patient group (post-stroke, but no USN) and healthy controls. The author's rationale was to entrain intrinsic brain rhythms using the flicker of different frequencies (3-30 Hz). Effects found unique to the 9-Hz stimulation condition differentiate USN patients from the other groups, leading them to conclude that USN can be characterized by increased hemispheric alpha asymmetry, driven by a relatively increased response in the intact hemisphere.

      Strengths:

      This study is principled empirical work that benefits from access to special patient groups of considerable size (about 60 stroke patients in total, and 20 USN). The authors use state-of-the-art established methods to (1) deliver and (2) quantify the responses to the flicker stimulation in the EEG recordings. In addition, they use phase-coupling measures to investigate cross-frequency coupling (here: alpha-gamma) and a measure of directed connectivity between brain areas, transfer entropy. The results are supported by means of simulations using a coupled-oscillators model.

      Weaknesses:

      In my eyes, the major conceptual weakness of the study is that the authors make the a priori assumption that the flicker stimulation entrains intrinsic brain rhythms, especially alpha (9 Hz). To date, there is no direct (and only equivocal indirect) evidence that alpha rhythms can be entrained with periodic visual stimulation. In the present study, the assumption of alpha entrainment permeates some analytical decisions - where it would be possible to separate stimulus-driven from intrinsic rhythms more strongly than is currently the case, potentially yielding deeper insights into the oscillopathy of USN - and, ultimately, the interpretation of the results. Another potential issue to consider here is the analysis of gamma rhythms in EEG data, absent a control of miniature eye movements, a known problem (Yuval-Greenberg et al., 2008, https://doi.org/10.1016/j.neuron.2008.03.027) that may be exacerbated here, given that USN patients could show different auxiliary gaze behaviour.

      Reviewer #1 expressed concern that alpha entrainment is assumed a priori; however, our interpretation is based on the empirical observation of frequency-specific (9 Hz) hemispheric asymmetry, not on a prior assumption. This 9 Hz specificity is difficult to explain by a simple summation of stimulus-evoked responses and is more appropriately interpreted as a resonance phenomenon in the alpha band, which is close to the intrinsic resonance frequency of the visual system [1, 2]. In the revision, we will strengthen the conceptual distinction between stimulus-driven and intrinsic components and clarify that entrainment is a conclusion supported by our data and modeling.

      Gamma contamination by eye movements is a valid theoretical concern. However, it is unlikely that saccadic spike potentials explain our α-γ coupling findings, due to several factors including timing constraints and spectral properties. In the revision, we will add explicit discussion of this limitation while explaining why our coupling patterns are more consistent with physiological neural coupling than with artifacts.

      Reviewer #2 (Public review):

      This study investigates how altered neural oscillations may contribute to unilateral spatial neglect (USN) following right-hemisphere stroke. By combining steady-state visual evoked potentials (SSVEPs), phase-amplitude coupling (PAC), transfer entropy (TE), and computational modeling, the authors aim to show that USN arises from disrupted hemispheric synchronization dynamics rather than simply from lesion extent. The integration of empirical EEG data with a mechanistic model is a major strength and offers a valuable new perspective on how frequency-specific neural dynamics relate to clinical symptoms.

      The work has several notable strengths. The combination of experimental and modeling approaches is innovative and powerful, and the findings provide a coherent mechanistic framework linking abnormal neural entrainment to attentional deficits. The study also provides concrete evidence to support the potential for frequency-specific neuromodulatory interventions, which could have translational relevance At the same time, there are areas where the evidence could be clarified or contextualized further. The manuscript would benefit from more detailed characterization of lesions, since differences in lesion topography (white vs. gray matter, occipital vs. parietal areas) could greatly improve our understanding of the physiopathology causing unilateral spatial neglect and the altered neural oscillations reported. Methodological choices, such as focusing analyses on occipital electrodes rather than parietal sites, and the potential influence of volume conduction in transfer entropy analyses, also need clearer justification/elaboration. In addition, while the authors report several neural metrics, it is not always clear why SSVEP power was chosen as the primary correlate of clinical severity over other measures. More broadly, the manuscript would be strengthened by clearer definitions of dependent variables and reporting of software and toolboxes used.

      Overall, the study makes a significant contribution by demonstrating that USN can be conceptualized as a disorder of disrupted oscillatory dynamics. With some clarifications and expansions, the paper will provide readers with a clearer understanding of both the strengths and the limitations of the evidence, and it will stand as a valuable reference for future work on oscillatory mechanisms in stroke and attention.

      We agree that further lesion characterization would be generally useful. However, as shown in Supplementary Figure 1, lesions in our USN cohort involved both cortical and subcortical regions, and cortical damage often extended into adjacent white matter. Therefore, a strict gray-versus-white-matter classification was not feasible. This anatomical diversity suggests that the frequency-specific hemispheric asymmetry observed here cannot be fully explained by lesion location or size alone, but rather may reflect altered network dynamics following right-hemisphere damage. We will clarify this point in the revised Discussion.

      Regarding transfer entropy (TE) and volume conduction, TE is theoretically insensitive to zero-lag correlations and quantifies temporally directed information transfer. Furthermore, we used amplitude envelopes rather than raw oscillations as input, which should greatly reduce the risk of spurious causal estimation due to sinusoidal autocorrelation structure. Moreover, if such spurious connectivity due to autocorrelation had occurred, it would have been expected to appear equally in both feedforward and feedback directions. Therefore, the feedforward-limited (visual→frontal) asymmetry observed in our study cannot be explained by volume conduction or autocorrelation effects. We will maintain this position clearly in the revision.

      Regarding other methodological points: we focused on occipital electrodes (O1/O2) because visual stimuli primarily drive the visual system (we also analyzed parietal sites but found no significant hemispheric differences; Figure 4). We chose SSVEP power for clinical correlation because it was the primary phenomenon distinguishing USN from non-USN patients. In the revision, we will clarify these points and include software and toolbox information.

      We believe these revisions will substantially strengthen the manuscript and clarify the conceptual and methodological contributions of our study.

      References

      (1) Rosanova, M., Casali, A., Bellina, V., Resta, F., Mariotti, M., and Massimini, M. (2009). Natural frequencies of human corticothalamic circuits. J Neurosci 29, 7679-7685.

      (2) Okazaki, Y.O., Nakagawa, Y., Mizuno, Y., Hanakawa, T., and Kitajo, K. (2021). Frequency- and Area-Specific Phase Entrainment of Intrinsic Cortical Oscillations by Repetitive Transcranial Magnetic Stimulation. Front Hum Neurosci 15, 608947.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      Summary:

      The manuscript characterizes a functional peptidergic system in the echinoderm Apostichopus japonicus that is related to the widely conserved family of calcitonin/diuretic hormone 31 (CT/DH31) peptides in bilaterian animals. In vitro analysis of receptor-ligand interactions, using multiple receptor activation assays, identifies three cognate receptors for two CT-like peptides in the sea cucumber, which stimulate cAMP, calcium, and ERK signaling. Only one of these receptors clusters within the family of calcitonin and calcitonin-like receptors (CTR/CLR) in bilaterian animals, whereas two other receptors cluster with invertebrate pigment dispersing factor receptors (PDFRs). In addition, this study sheds light on the expression and in vivo functions of CT-like peptides in A. japonicus, by quantitative real-time PCR, immunohistochemistry, pharmacological experiments on body wall muscle and intestine preparations, and peptide injection and RNAi knockdown experiments. This reveals a conserved function of CT-like peptides as muscle relaxants and growth regulators in A. japonicus.

      Strengths:

      This work combines both in vitro and in vivo functional assays to identify a CT-like peptidergic system in an economically relevant echinoderm species, the sea cucumber A. japonicus. A major strength of the study is that it identifies three G protein-coupled receptors for AjCT-like peptides, one related to the CTR/CLR family and two related to the PDFR family. A similar finding was previously reported for the CT-related peptide DH31 in Drosophila melanogaster that activates both CT-type and PDF-type receptors. Here, the authors expand this observation to a deuterostomian animal, which suggests that receptor promiscuity is a more general feature of the CT/DH31 peptide family and that CT/DH31-like peptides may activate both CT-type and PDF-type receptors in other animals as well.

      Besides the identification of receptor-ligand pairs, the downstream signaling pathways of AjCT receptors have been characterized, revealing broad and in some cases receptor-specific effects on cAMP, calcium, and ERK signaling.

      Functional characterization of the CT-related peptide system in heterologous cells is complemented with ex vivo and in vivo experiments. First, peptide injection and RNAi knockdown experiments establish transcriptional regulation of all three identified receptors in response to changing AjCT peptide levels. Second, ex vivo experiments reveal a conserved role for the two CT-like peptides as muscle relaxants, which have differential effects on body wall muscle and intestine preparations. Finally, peptide injection and knockdown experiments uncover a growth-promoting role for one CT-like peptide (AjCT2). Injection of AjCT2 at high concentration, or long-term knockdown of the AjCT precursor, affects diverse growth-related parameters including weight gain rate, specific growth rate, and transcript levels of growth-regulating transcription factors. The authors also reveal a growth-promoting function for the PDFR-like receptor AjPDFR2, suggesting that this receptor mediates the effects of AjCT2 on growth.

      Weaknesses:

      The authors present a more detailed phylogenetic analysis in the revised version, including a larger number of species. But some clusters in the analysis are not well supported because they have only low bootstrap values. This makes it difficult to interpret the clustering in some parts of the tree.

      Thank you for the reviewer’s comments. In response, we have produced a new phylogenetic analysis using the maximum likelihood method. This was done by Nayeli Escudero Castelán and Kite Jones in the Elphick group at QMUL and therefore they have been added as co-authors of this paper. The new phylogenetic tree (Figure 2, line 206) includes broad taxonomic sampling of CT-type receptors and PDF-type receptors. CRH-type receptors, which are also members of the secretin-type GPCR sub-family, have been included as an outgroup to root the tree. In the previous version the much more distantly related vasopressin/oxytocin-type receptors, which are rhodopsin-type GPCRs, were included as an outgroup. Furthermore, VIP-type receptors were also included in the previous tree but these have been omitted from the new tree because VIP receptor orthologs only occur in vertebrates and therefore they are not representative of a bilaterian GPCR family. The new tree shows high bootstrap support for key clades, notably achieving a bootstrap value of 100 for a clade comprising both deuterostomian and protostomian PDF receptors. This provides important evidence that the A. japonicus PDF-type receptors characterised in this study (AjPDFR1, AjPDFR2) are co-orthologs of the PDF-type receptor that has been characterised previously in Drosophila. Similarly, there is strong bootstrap support (100) for a clade comprising CT/DH31-type receptors and, importantly, the CT-type receptor characterised in this study (AjCTR) is positioned in a branch of this clade that comprises deuterostomian CT-type receptors (with bootstrap support of 100). Details of methods employed to produce the new receptor tree are included in lines 727-739. The new phylogenetic tree is shown below and has been incorporated into the revised manuscript (Figure 2, line 206). The description of new phylogenetic tree has also been modified accordingly in the revised manuscript (line 169-183).

      References:

      Bauknecht P, Jékely G. Large-Scale Combinatorial Deorphanization of Platynereis Neuropeptide GPCRs. Cell reports, 2015, 12(4), 684–693. doi:  10.1016/j.celrep.2015.06.052.

      Beets I, Zels S, Vandewyer E, Demeulemeester J, et al. System-wide mapping of peptide-GPCR interactions in C. elegans. Cell reports, 2023, 42(9), 113058. doi: 10.1016/j.celrep.2023.113058.

      Cardoso J C, Mc Shane J C, Li Z, et al. Revisiting the evolution of Family B1 GPCRs and ligands: Insights from mollusca. Molecular and cellular endocrinology, 2024, 586, 112192. doi: 10.1016/j.mce.2024.112192.

      Gorn A H, Lin H Y, Yamin M, et al. Cloning, characterization, and expression of a human calcitonin receptor from an ovarian carcinoma cell line. The Journal of clinical investigation, 1992, 90(5), 1726–1735. doi: 10.1172/JCI116046.

      Huang T, Su J, Wang X, et al. Functional Analysis and Tissue-Specific Expression of Calcitonin and CGRP with RAMP-Modulated Receptors CTR and CLR in Chickens. Animals: an open access journal from MDPI, 2024, 14(7), 1058. doi: 10.3390/ani14071058.

      Johnson E C, Shafer O T, Trigg J S, et al. A novel diuretic hormone receptor in Drosophila: evidence for conservation of CGRP signaling. Journal of Experimental Biology, 2005, 208(7): 1239-1246. doi: 10.1242/jeb.01529.

      McLatchie L M, Fraser N J, Main M J, et al. RAMPs regulate the transport and ligand specificity of the calcitonin-receptor-like receptor. Nature, 1998, 393(6683): 333-339. doi: 10.1038/30666.

      Schwartz J, Réalis-Doyelle E, Dubos M P, et al. Characterization of an evolutionarily conserved calcitonin signaling system in a lophotrochozoan, the Pacific oyster (Crassostrea gigas). Journal of Experimental Biology, 2019, 222(13): jeb201319. doi: 10.1242/jeb.201319.

      Sekiguchi T, Kuwasako K, Ogasawara M, et al. Evidence for conservation of the calcitonin superfamily and activity-regulating mechanisms in the basal chordate Branchiostoma floridae: insights into the molecular and functional evolution in chordates. Journal of Biological Chemistry, 2016, 291(5): 2345-2356. doi: 10.1074/jbc.M115.664003.

      Expression of CT-like peptides was investigated both at transcript and protein level, but insight into the expression of the three peptide receptors is limited. This makes it difficult to understand the mechanism underlying the (different) functions of the two CT-like peptides in vivo. The authors identify differences in signal transduction cascades activated by each peptide, which might underpin distinct functions, but these differences were established only in heterologous cells.

      We appreciate the reviewer's insightful comments. Regarding expression of CT-like peptide receptors, we have quantitatively analyzed the mRNA expression levels of the three receptors in key tissues using qRT-PCR (Figure 6, line 319) and receptor expression exhibits significant tissue-specific differences. Combined with the heterologous expression assays and In vivo functional validation, we believe our findings have provided clear mechanistic insights into the functional divergence of the two CT-like peptides. Investigation of the expression of the three receptor proteins in A. japonicus would require generation of specific antibodies, which was beyond the scope of this study. Furthermore, immunohistochemical visualization of neuropeptide receptor expression in other invertebrates has not been reported widely, which likely reflects technical difficulties in generation of antibodies that can be used to specifically detect receptor proteins that are typically expressed a low level in comparison to the neuropeptides that act as their ligands. 

      We acknowledge that investigating signal transduction cascades in heterologous cells (rather than native A. japonicus cells) is a limitation. However, as a non-model organism, A. japonicus currently lacks established cell lines for such research. Therefore, using heterologous cells was the most feasible approach to examine the differential signaling cascades activated by the peptides through the three receptors. Importantly, our in vivo experiments demonstrated that long-term knockdown of either the AjCT precursor or AjPDFR2 resulted in similar and significant growth defects. The phenotypic consistency strongly suggests that AjCT2 and AjPDFR2 function within the same signaling pathway, with AjPDFR2 serving as the key receptor functionally activated by AjCT2.

      The authors show overlapping phenotypes for a long-term knockdown of the AjCT precursor and the AjPDFR2 receptor, suggesting that the growth-regulating functions of AjCT2 are mediated by this receptor pathway. However, it remains unclear whether this mechanism underpins the growth-regulating function of AjCT2, until further in vivo evidence for this ligand-receptor interaction is presented. For example, the authors could investigate whether knockdown of AjPDFR2 attenuates the effects of AjCT2 peptide injection. In addition, a functional PDF system in this species remains uncharacterized, and a potential role of PDF-like peptides in growth regulation has not yet been investigated in A. japonicus. Therefore, it also remains unclear whether the ability of CT-like peptides to activate PDFRs is an evolutionary ancient property of this peptide family or whether this is an example of convergent evolution in some protostomian (Drosophila) and deuterostomian (sea cucumber) species.

      Thank you for the reviewer’s insightful comments and constructive questions. We acknowledge the request for more direct evidence to demonstrate how AjCT2 functions in vivo through AjPDFR2. However, long-term knockdown of the AjCT precursor and AjPDFR2 both resulted in identical and significant growth defect phenotypes. The high phenotypic consistency, combined with the activation effect of AjCT2 on AjPDFR2 in heterologous cells, strongly suggests that they function within the same signaling pathway, with AjPDFR2 serving as the key receptor functionally activated by AjCT2. While exogenous peptide injection combined with receptor knockdown is a classic method for verifying receptor activation, phenotypic overlap itself is widely accepted in genetic research as robust evidence for pathway association (Shafer and Taghert, 2009; Van Sinay et al., 2017). A. japonicus is a non-model organism with a 3-month aestivation period in summer followed shortly by winter hibernation. During these periods, we are unable to conduct in vivo experiments. Any single experimental suggestion from reviewers could potentially require one more year of research and we have already conducted an additional year of research, in response to reviewer feedback, since submitting the original manuscript. We hope therefore that these challenges associated with working with aquatic invertebrate non-model organisms is recognized by the reviewers.

      We fully agree that the functional PDF/PDFR system in A. japonicus and its potential role in growth regulation remain uncharacterized. Currently, the precursors of the PDF-type neuropeptide in echinoderms remain unidentified, which precludes clear pharmacological characterization of the two receptors. While further exploration of echinoderm PDF-type neuropeptides is still needed, our phylogenetic analysis-conducted using the maximum likelihood method with optimized parameters and rigorous sequence curation-demonstrates that the deuterostomian PDFRs (including AjPDFR1 and AjPDFR2) are positioned in a clade with the well-characterized protostomian PDFR clades with extremely high bootstrap support (value=100). Therefore, these two receptors in A. japonicus clearly belong to the PDF receptor family and our findings clearly indicate that the ability of CT-like peptides to activate PDFRs is either an evolutionarily ancient and conserved property or has arisen independently in different lineages. Details of methods employed to produce the new receptor tree are included in line 727-739. The new phylogenetic tree is shown below and has been incorporated into the revised manuscript (Figure 2, line 206). The description of new phylogenetic tree has also been modified accordingly in the revised manuscript (line 169-183).

      References:

      Bauknecht P, Jékely G. Large-Scale Combinatorial Deorphanization of Platynereis Neuropeptide GPCRs. Cell reports, 2015, 12(4), 684–693. doi:  10.1016/j.celrep.2015.06.052.

      Beets I, Zels S, Vandewyer E, Demeulemeester J, et al. System-wide mapping of peptide-GPCR interactions in C. elegans. Cell reports, 2023, 42(9), 113058. doi: 10.1016/j.celrep.2023.113058.

      Cardoso J C, Mc Shane J C, Li Z, et al. Revisiting the evolution of Family B1 GPCRs and ligands: Insights from mollusca. Molecular and cellular endocrinology, 2024, 586, 112192. doi: 10.1016/j.mce.2024.112192.

      Gorn A H, Lin H Y, Yamin M, et al. Cloning, characterization, and expression of a human calcitonin receptor from an ovarian carcinoma cell line. The Journal of clinical investigation, 1992, 90(5), 1726–1735. doi: 10.1172/JCI116046.

      Huang T, Su J, Wang X, et al. Functional Analysis and Tissue-Specific Expression of Calcitonin and CGRP with RAMP-Modulated Receptors CTR and CLR in Chickens. Animals: an open access journal from MDPI, 2024, 14(7), 1058. doi: 10.3390/ani14071058.

      Johnson E C, Shafer O T, Trigg J S, et al. A novel diuretic hormone receptor in Drosophila: evidence for conservation of CGRP signaling. Journal of Experimental Biology, 2005, 208(7): 1239-1246. doi: 10.1242/jeb.01529.

      McLatchie L M, Fraser N J, Main M J, et al. RAMPs regulate the transport and ligand specificity of the calcitonin-receptor-like receptor. Nature, 1998, 393(6683): 333-339. doi: 10.1038/30666.

      Schwartz J, Réalis-Doyelle E, Dubos M P, et al. Characterization of an evolutionarily conserved calcitonin signaling system in a lophotrochozoan, the Pacific oyster (Crassostrea gigas). Journal of Experimental Biology, 2019, 222(13): jeb201319. doi: 10.1242/jeb.201319.

      Sekiguchi T, Kuwasako K, Ogasawara M, et al. Evidence for conservation of the calcitonin superfamily and activity-regulating mechanisms in the basal chordate Branchiostoma floridae: insights into the molecular and functional evolution in chordates. Journal of Biological Chemistry, 2016, 291(5): 2345-2356. doi: 10.1074/jbc.M115.664003.

      Shafer, O. T., & Taghert, P. H. (2009). RNA-interference knockdown of Drosophila pigment dispersing factor in neuronal subsets: the anatomical basis of a neuropeptide's circadian functions. PloS one, 4(12), e8298. doi: 10.1371/journal.pone.0008298.

      Van Sinay, E., Mirabeau, O., Depuydt, G., Van Hiel, M. B., Peymen, K., Watteyne, J., Zels, S., Schoofs, L., & Beets, I. (2017). Evolutionarily conserved TRH neuropeptide pathway regulates growth in Caenorhabditis elegans. Proceedings of the National Academy of Sciences of the United States of America, 114(20), E4065–E4074. doi: 10.1073/pnas.1617392114.

      Reviewer #2 (Public review):

      Summary:

      The authors show that A. japonicus calcitonins (AjCT1 and AjCT2) activate not only the calcitonin/calcitonin-like receptor, but they also activate the two "PDF receptors", ex vivo. They also explore secondary messenger pathways that are recruited following receptor activation. They determine the source of CT1 and CT2 using qPCR and in situ hybridization and finally test the effects of these peptides on tissue contractions, feeding and growth. This study provides solid evidence that CT1 and CT2 act as ligands for calcitonin receptors; however, evidence supporting cross-talk between CT peptides and "PDF receptors" is weak.

      Strengths:

      This is the first study to report pharmacological characterization of CT receptors in an echinoderm. Multiple lines of evidence in cell culture (receptor internalization and secondary messenger pathways) support this conclusion.

      Weaknesses:

      The authors claim that A. japonicus CTs activate "PDF" receptors and suggest that this cross-talk is evolutionary ancient since similar phenomenon also exists in the fly Drosophila melanogaster. These conclusions are not fully supported. The authors perform phylogenetic analysis to show that the two "PDF" receptors form an independent clade. The bootstrap support is quite low in a lot of instances, especially for the deuterostomian and protostomian PDFR clades which is below 30. With such low support, it is unclear if the clade comprising deuterostomian "PDFR" is in fact PDFRs and not another receptor type whose endogenous ligand (besides CT) remains to be discovered.

      Thank you for the reviewer’s comments. In response, we have produced a new phylogenetic analysis using the maximum likelihood method. This was done by Nayeli Escudero Castelán and Kite Jones in the Elphick group at QMUL and therefore they have been added as co-authors of this paper. The new phylogenetic tree (Figure 2, line 206) includes broad taxonomic sampling of CT-type receptors and PDF-type receptors. CRH-type receptors, which are also members of the secretin-type GPCR sub-family, have been included as an outgroup to root the tree. In the previous version the much more distantly related vasopressin/oxytocin-type receptors, which are rhodopsin-type GPCRs, were included as an outgroup. Furthermore, VIP-type receptors were also included in the previous tree but these have been omitted from the new tree because VIP receptor orthologs only occur in vertebrates and therefore they are not representative of a bilaterian GPCR family. The new tree shows high bootstrap support for key clades, notably achieving a bootstrap value of 100 for a clade comprising both deuterostomian and protostomian PDF receptors. This provides important evidence that the A. japonicus PDF-type receptors characterized in this study (AjPDFR1, AjPDFR2) are co-orthologs of the PDF-type receptor that has been characterized previously in Drosophila. Similarly, there is strong bootstrap support (100) for a clade comprising CT/DH31-type receptors and, importantly, the CT-type receptor characterized in this study (AjCTR) is positioned in a branch of this clade that comprises deuterostomian CT-type receptors (with bootstrap support of 100). Details of methods employed to produce the new receptor tree are included in lines 727-739. The new phylogenetic tree is shown below and has been incorporated into the revised manuscript (Figure 2, line 206). The description of new phylogenetic tree has also been modified accordingly in the revised manuscript (line 169-183).

      References:

      Bauknecht P, Jékely G. Large-Scale Combinatorial Deorphanization of Platynereis Neuropeptide GPCRs. Cell reports, 2015, 12(4), 684–693. doi:  10.1016/j.celrep.2015.06.052.

      Beets I, Zels S, Vandewyer E, Demeulemeester J, et al. System-wide mapping of peptide-GPCR interactions in C. elegans. Cell reports, 2023, 42(9), 113058. doi: 10.1016/j.celrep.2023.113058.

      Cardoso J C, Mc Shane J C, Li Z, et al. Revisiting the evolution of Family B1 GPCRs and ligands: Insights from mollusca. Molecular and cellular endocrinology, 2024, 586, 112192. doi: 10.1016/j.mce.2024.112192.

      Gorn A H, Lin H Y, Yamin M, et al. Cloning, characterization, and expression of a human calcitonin receptor from an ovarian carcinoma cell line. The Journal of clinical investigation, 1992, 90(5), 1726–1735. doi: 10.1172/JCI116046.

      Huang T, Su J, Wang X, et al. Functional Analysis and Tissue-Specific Expression of Calcitonin and CGRP with RAMP-Modulated Receptors CTR and CLR in Chickens. Animals: an open access journal from MDPI, 2024, 14(7), 1058. doi: 10.3390/ani14071058.

      Johnson E C, Shafer O T, Trigg J S, et al. A novel diuretic hormone receptor in Drosophila: evidence for conservation of CGRP signaling. Journal of Experimental Biology, 2005, 208(7): 1239-1246. doi: 10.1242/jeb.01529.

      McLatchie L M, Fraser N J, Main M J, et al. RAMPs regulate the transport and ligand specificity of the calcitonin-receptor-like receptor. Nature, 1998, 393(6683): 333-339. doi: 10.1038/30666.

      Schwartz J, Réalis-Doyelle E, Dubos M P, et al. Characterization of an evolutionarily conserved calcitonin signaling system in a lophotrochozoan, the Pacific oyster (Crassostrea gigas). Journal of Experimental Biology, 2019, 222(13): jeb201319. doi: 10.1242/jeb.201319.

      Sekiguchi T, Kuwasako K, Ogasawara M, et al. Evidence for conservation of the calcitonin superfamily and activity-regulating mechanisms in the basal chordate Branchiostoma floridae: insights into the molecular and functional evolution in chordates. Journal of Biological Chemistry, 2016, 291(5): 2345-2356. doi: 10.1074/jbc.M115.664003.

      Reviewer #2 (Recommendations for the authors):

      Figure 1C: The bootstrap support is quite low in a lot of instances, especially for the deuterostomian and protostomian PDFR clades which is below 30. With such support, I would be hesitant to label the blue clade as deuterostomian PDFR for two reasons: 1) no members of this clade have been shown to be activated by a PDF-like substance and 2) the current study shows that these receptors are activated by CT-type peptides. Therefore, the phylogenetic analyses do not support the conclusions of this paper. What is the basis for calling these receptors PDFR and not CTR in light of weak phylogenetic support?

      Thank you for the reviewer’s comments. In response, we have produced a new phylogenetic analysis using the maximum likelihood method. This was done by Nayeli Escudero Castelán and Kite Jones in the Elphick group at QMUL and therefore they have been added as co-authors of this paper. The new phylogenetic tree (Figure 2, line 206) includes broad taxonomic sampling of CT-type receptors and PDF-type receptors. CRH-type receptors, which are also members of the secretin-type GPCR sub-family, have been included as an outgroup to root the tree. In the previous version the much more distantly related vasopressin/oxytocin-type receptors, which are rhodopsin-type GPCRs, were included as an outgroup. Furthermore, VIP-type receptors were also included in the previous tree but these have been omitted from the new tree because VIP receptor orthologs only occur in vertebrates and therefore they are not representative of a bilaterian GPCR family. The new tree shows high bootstrap support for key clades, notably achieving a bootstrap value of 100 for a clade comprising both deuterostomian and protostomian PDF receptors. This provides important evidence that the A. japonicus PDF-type receptors characterized in this study (AjPDFR1, AjPDFR2) are co-orthologs of the PDF-type receptor that has been characterized previously in Drosophila. Similarly, there is strong bootstrap support (100) for a clade comprising CT/DH31-type receptors and, importantly, the CT-type receptor characterized in this study (AjCTR) is positioned in a branch of this clade that comprises deuterostomian CT-type receptors (with bootstrap support of 100). Details of methods employed to produce the new receptor tree are included in lines 727-739 The new phylogenetic tree is shown below and has been incorporated into the revised manuscript (Figure 2, line 206). The description of new phylogenetic tree has also been modified accordingly in the revised manuscript (line 169-183).

      We agree with the reviewer that no members of the PDF-type receptor clade in deuterostomes have yet been shown to be activated by a PDF-like substance. That is because the precursors of the PDF-type neuropeptides in echinoderms remain unidentified so far, which precludes clear pharmacological characterization of these receptors within the deuterostomian PDFR clade. However, the new phylogenetic tree now provides strong support (bootstrap value = 100) for the clade comprising deuterostomian and protostomian PDFRs, confirming the classification of AjPDFR1 and AjPDFR2 as PDF-type receptors. 

      References:

      Bauknecht P, Jékely G. Large-Scale Combinatorial Deorphanization of Platynereis Neuropeptide GPCRs. Cell reports, 2015, 12(4), 684–693. doi:  10.1016/j.celrep.2015.06.052.

      Beets I, Zels S, Vandewyer E, Demeulemeester J, et al. System-wide mapping of peptide-GPCR interactions in C. elegans. Cell reports, 2023, 42(9), 113058. doi: 10.1016/j.celrep.2023.113058.

      Cardoso J C, Mc Shane J C, Li Z, et al. Revisiting the evolution of Family B1 GPCRs and ligands: Insights from mollusca. Molecular and cellular endocrinology, 2024, 586, 112192. doi: 10.1016/j.mce.2024.112192.

      Gorn A H, Lin H Y, Yamin M, et al. Cloning, characterization, and expression of a human calcitonin receptor from an ovarian carcinoma cell line. The Journal of clinical investigation, 1992, 90(5), 1726–1735. doi: 10.1172/JCI116046.

      Huang T, Su J, Wang X, et al. Functional Analysis and Tissue-Specific Expression of Calcitonin and CGRP with RAMP-Modulated Receptors CTR and CLR in Chickens. Animals: an open access journal from MDPI, 2024, 14(7), 1058. doi: 10.3390/ani14071058.

      Johnson E C, Shafer O T, Trigg J S, et al. A novel diuretic hormone receptor in Drosophila: evidence for conservation of CGRP signaling. Journal of Experimental Biology, 2005, 208(7): 1239-1246. doi: 10.1242/jeb.01529.

      McLatchie L M, Fraser N J, Main M J, et al. RAMPs regulate the transport and ligand specificity of the calcitonin-receptor-like receptor. Nature, 1998, 393(6683): 333-339. doi: 10.1038/30666.

      Schwartz J, Réalis-Doyelle E, Dubos M P, et al. Characterization of an evolutionarily conserved calcitonin signaling system in a lophotrochozoan, the Pacific oyster (Crassostrea gigas). Journal of Experimental Biology, 2019, 222(13): jeb201319. doi: 10.1242/jeb.201319.

      Sekiguchi T, Kuwasako K, Ogasawara M, et al. Evidence for conservation of the calcitonin superfamily and activity-regulating mechanisms in the basal chordate Branchiostoma floridae: insights into the molecular and functional evolution in chordates. Journal of Biological Chemistry, 2016, 291(5): 2345-2356. doi: 10.1074/jbc.M115.664003.

      The new results following AjCT and AjPDFR2 knockdown are a welcome addition. While this additional evidence supports the claim that AjCT could mediate its effects via AjPDFR2, this evidence does not show that AjCT acts as an endogenous ligand for PDFR in vivo. In combination with the weak phylogenetic analyses, I would recommend the authors to key down their claims that they have functionally characterized a PDFR (in the title and text).

      Thank you for your insightful comments and we do understand the reviewer’s concern. 

      Regarding “the weak phylogenetic analyses”, as highlighted above, we have produced a new phylogenetic tree (Fig 2, line 206) that provides strong bootstrap support for the clade comprising deuterostome and protostome PDF-type receptors. For this reason, it is our opinion that inclusion of “pigment-dispersing factor-type receptors” in the title of the paper is appropriate. The details of phylogenetic analysis method were added in line 727-739, and the updated phylogenetic tree has been incorporated into the revised manuscript (Figure 2, line 206). The description of new phylogenetic tree has also been modified accordingly in the revised manuscript (line 169-183). Besides, long-term knockdown of the AjCT precursor and AjPDFR2 both resulted in identical and significant growth defect phenotypes. And the observation of phenotypic overlap is widely accepted in genetic research as strong evidence for pathway association (Shafer and Taghert, 2009; Van Sinay et al., 2017). This high degree of phenotypic consistency, coupled with our in vitro finding that AjCT2 specifically activates AjPDFR2, strongly supports the conclusion that AjCT2 and AjPDFR2 function within the same signaling pathway in vivo, with AjPDFR2 serving as the key receptor functionally activated by AjCT2.

      References:

      Shafer, O. T., & Taghert, P. H. (2009). RNA-interference knockdown of Drosophila pigment dispersing factor in neuronal subsets: the anatomical basis of a neuropeptide's circadian functions. PloS one, 4(12), e8298. doi: 10.1371/journal.pone.0008298.

      Van Sinay, E., Mirabeau, O., Depuydt, G., Van Hiel, M. B., Peymen, K., Watteyne, J., Zels, S., Schoofs, L., & Beets, I. (2017). Evolutionarily conserved TRH neuropeptide pathway regulates growth in Caenorhabditis elegans. Proceedings of the National Academy of Sciences of the United States of America, 114(20), E4065–E4074. doi: 10.1073/pnas.1617392114.

      Since there is no formal logic defining the use of "type" vs "like" vs "related", I would encourage the authors to use one term (of their choice) to avoid unnecessary confusion. Or another possibility is that these relationships are defined at some point in the manuscript so that it becomes clear to the reader.

      Thank you for the reviewer’s comments. The “CT-related peptides” has defined in the Introduction (line 54-58). As per your suggestion, we have now defined both “CT-type peptides” and “CT-like peptides” in the Introduction (line 76-79). “CT-type peptides” are characterized by an N-terminal disulphide bridge, whereas “CT-like peptides” (diuretic hormone 31 (DH31)-type peptides) lack this feature. Additionally, in accordance with the definitions, we have corrected these three descriptions in the revised manuscript (line 80, 83, 88 for “CT-type peptides”) to ensure consistent and accurate usage of these terms.

      "To provide in vivo evidence supporting CT-mediated activation of "PDF" receptors, we conducted the following experiments: Firstly, we confirmed that AjPDFR1 and AjPDFR2were the functional receptors of AjCT1and AjCT2 (Figure 2, 3 and 4). Secondly, injection of AjCT2 and siAjCTP1/2-1 in vivo induced corresponding changes in AjPDFR1and AjPDFR2expression levels in the intestine (Figure 8C, 9A, 9B and 9C)."

      None of these experiments provide direct evidence that CT activates PDFR in vivo. The functional studies are indeed a welcome addition but they cannot discriminate between correlation and causation.

      Thank you for the reviewer’s insightful comments. We agree that the functional studies do not constitute direct proof that CT’s activation of PDFR in vivo. However, we observed identical and significant growth defect phenotypes following long-term knockdown of the AjCT precursor and the AjPDFR2. This high degree of phenotypic congruence, combined with the established in vitro activation of AjPDFR2 by AjCT2, provides strong support for the conclusion that AjCT2 acts as the key endogenous ligand activating the AjPDFR2 signaling pathway in vivo. Importantly, such phenotypic overlap has been widely accepted in genetic research as strong evidence for functional pathway association (Shafer and Taghert, 2009; Van Sinay et al., 2017).

      References:

      Shafer, O. T., & Taghert, P. H. (2009). RNA-interference knockdown of Drosophila pigment dispersing factor in neuronal subsets: the anatomical basis of a neuropeptide's circadian functions. PloS one, 4(12), e8298. doi: 10.1371/journal.pone.0008298.

      Van Sinay, E., Mirabeau, O., Depuydt, G., Van Hiel, M. B., Peymen, K., Watteyne, J., Zels, S., Schoofs, L., & Beets, I. (2017). Evolutionarily conserved TRH neuropeptide pathway regulates growth in Caenorhabditis elegans. Proceedings of the National Academy of Sciences of the United States of America, 114(20), E4065–E4074. doi: 10.1073/pnas.1617392114.

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      This preprint from Shaowei Zhao and colleagues presents results that suggest tumorous germline stem cells (GSCs) in the Drosophila ovary mimic the ovarian stem cell niche and inhibit the differentiation of neighboring non-mutant GSC-like cells. The authors use FRT-mediated clonal analysis driven by a germline-specific gene (nos-Gal4, UASp-flp) to induce GSC-like cells mutant for bam or bam's cofactor bgcn. Bam-mutant or bgcn-mutant germ cells produce tumors in the stem cell compartment (the germarium) of the ovary (Figure 1). These tumors contain non-mutant cells - termed SGC for single-germ cells. 75% of SGCs do not exhibit signs of differentiation (as assessed by bamP-GFP) (Figure 2). The authors demonstrate that block in differentiation in SGC is a result of suppression of bam expression (Figure 2). They present data suggesting that in 73% of SGCs, BMP signaling is low (assessed by dad-lacZ) (Figure 3) and proliferation is less in SGCs vs GSCs. They present genetic evidence that mutations in BMP pathway receptors and transcription factors suppress some of the non-autonomous effects exhibited by SGCs within bam-mutant tumors (Figure 4). They show data that bam-mutant cells secrete Dpp, but this data is not compelling (see below) (Figure 5). They provide genetic data that loss of BMP ligands (dpp and gbb) suppresses the appearance of SGCs in bam-mutant tumors (Figure 6). Taken together, their data support a model in which bam-mutant GSC-like cells produce BMPs that act on nonmutant cells (i.e., SGCs) to prevent their differentiation, similar to what is seen in the ovarian stem cell niche. 

      Strengths:

      (1) Use of an excellent and established model for tumorous cells in a stem cell microenvironment.

      (2) Powerful genetics allow them to test various factors in the tumorous vs nontumorous cells.

      (3) Appropriate use of quantification and statistics.

      We greatly appreciate these comments.

      Weaknesses:

      (1) What is the frequency of SGCs in nos>flp; bam-mutant tumors? For example, are they seen in every germarium, or in some germaria, etc, or in a few germaria?

      This is a great question. Because the SGC phenotype depends on the presence of germline tumor clones, our quantification was restricted to germaria that contained them.These quantification data ("SGCs and/or germline cysts per germarium with germline clones") will be presented in the revised Figure 1.

      (2) Does the breakdown in clonality vary when they induce hs-flp clones in adults as opposed to in larvae/pupae?

      Our initial attempts to induce ovarian hs-flp germline clones by heat-shocking adult flies were unsuccessful, with very few clones being observed. Therefore, we shifted our approach to an earlier developmental stage. Successful induction was achieved by subjecting late-L3/early-pupal animals to a twice-daily heatshock at 37°C for 6 consecutive days (2 hours per session with a 6-hour interval, see Lines 325-329) (Zhao et al., 2018).

      (3) Approximately 20-25% of SGCs are bam+, dad-LacZ+. Firstly, how do the authors explain this? Secondly, of the 70-75% of SGCs that have no/low BMP signaling, the authors should perform additional characterization using markers that are expressed in GSCs (i.e., Sex lethal and nanos).

      These 20-25% of SGCs are bamP-GFP<sup>+</sup> dad-lacZ-, not bam<sup>+</sup> dad-lacZ<sup>+</sup> (see Figure 2C and 3D). They would be cystoblast-like cells that may have initiated a differentiation program toward forming germline cysts (see Lines 109-117). The 70-75% of SGCs that have low BMP signaling exhibit GSC-like properties, including: 1) dot-like spectrosomes; 2) dad-lacZ positivity; 3) absence of bamP-GFP expression. While additional markers would be beneficial, we think that this combination of properties is sufficient to classify these cells as GSC-like. 

      (4) All experiments except Figure 1I (where a single germarium with no quantification) were performed with nos-Gal4, UASp-flp. Have the authors performed any of the phenotypic characterizations (i.e., figures other than Figure 1) with hs-flp?

      Yes, we initially identified the SGC phenotype through hs-flp-mediated mosaic analysis of bam or bgcn mutant in ovaries. However, as noted in our response to Weakness (2), this approach was very labor-intensive. Therefore, we switched to using the more convenient nos::flp system for subsequent experiments. To our observation, there was no difference in the SGC phenotype between these two approaches, confirming that the nos::flp system is a valid and more practical alternative for its study. 

      (5) Does the number of SGCs change with the age of the female? The experiments were all performed in 14-day-old adult females. What happens when they look at a young female (like 2-day-old). I assume that the nos>flp is working in larval and pupal stages, and so the phenotype should be present in young females. Why did the authors choose this later age? For example, is the phenotype more robust in older females? Or do you see more SGCs at later time points?

      These are very good questions. Such time-course analysis data will be provided in revised Figure 1. The SGC phenotype depends on the presence of bam or bgcn mutant germline clones. Germaria from 14-day-old flies contained bigger and more such clones than those from younger flies. This age-dependent increase in clone size and frequency significantly enhanced the efficiency of our quantification (see Lines 129-131). 

      (6) Can the authors distinguish one copy of GFP versus 2 copies of GFP in germ cells of the ovary? This is not possible in the Drosophila testis. I ask because this could impact the clonal analyses diagrammed in Figure 4A and 4G and in 6A and B. Additionally, in most of the figures, the GFP is saturated, so it is not possible to discern one vs two copies of GFP.

      We greatly appreciate this comment. It was also difficult for us to distinguish 1 and 2 copies of GFP in the Drosophila ovary. In Figure 4A-F, to resolve this problem, we used a triplecolor system, in which red germ cells (RFP<sup>+/+</sup> GFP<sup>-/-</sup>) are bam mutant, yellow germ cells (RFP<sup>+/-</sup> GFP<sup>+/-</sup>) are wild-type, and green germ cells (RFP<sup>-/-</sup> GFP<sup>+/+</sup>) are punt or med mutant. In Figure 4G-J, we quantified the SGC phenotype only in black germ cells (GFP<sup>-/-</sup>), which are wild-type (control) or mad mutant.  In Figure 6, we quantified the SGC phenotype only in green germ cells (both GFP<sup>+/+</sup> and GFP<sup>+/-</sup>), all of which are wild-type.

      (7) More evidence is needed to support the claim of elevated Dpp levels in bam or bgcn mutant tumors. The current results with the dpp-lacZ enhancer trap in Figure 5A, B are not convincing. First, why is the dpp-lacZ so much brighter in the mosaic analysis (A) than in the no-clone analysis (B)? It is expected that the level of dpplacZ in cap cells should be invariant between ovaries, and yet LacZ is very faint in Figure 5B. I think that if the settings in A matched those in B, the apparent expression of dpp-lacZ in the tumor would be much lower and likely not statistically significant. Second, they should use RNA in situ hybridization with a sensitive technique like hybridization chain reactions (HCR) - an approach that has worked well in numerous Drosophila tissues, including the ovary.

      We appreciate this critical comment. The settings of immunofluorescent staining and confocal parameters in Figure 5A were the same as those in 5B. To our observation, the level of dpp-lacZ in cap cells was variable across germaria, even within the same ovary, as quantified in Figure 5C. We will provide RNA in situ hybridization data to further strengthen the conclusion that bam or bgcn mutant germline tumors secret BMP ligands.  

      (8) In Figure 6, the authors report results obtained with the bamBG allele. Do they obtain similar data with another bam allele (i.e., bamdelta86)?

      No. Given that bam<sup>BG</sup> was functionally indistinguishable from bam<sup>Δ86</sup> in inducing the SGC phenotype (compare Figure 6F, I with Figure 6-figure supplement 3C), we believe that repeating these experiments with bam<sup>Δ86</sup> would be redundant and would not alter the key conclusion of our study. Thanks for the understanding!

      Reviewer #2 (Public review):

      While the study by Zhang et al. provides valuable insights into how germline tumors can non-autonomously suppress the differentiation of neighboring wild-type germline stem cells (GSCs), several conceptual and technical issues limit the strength of the conclusions.

      Major points:

      (1) Naming of SGCs is confusing. In line 68, the authors state that "many wild-type germ cells located outside the niche retained a GSC-like single-germ-cell (SGC) morphology." However, bam or bgcn mutant GSCs are also referred to as "SGCs," which creates confusion when reading the text and interpreting the figures. The authors should clarify the terminology used to distinguish between wild-type SGCs and tumor (bam/bgcn mutant) SGCs, and apply consistent naming throughout the manuscript and figure legends.

      We apologize for any confusion. In our manuscript, the term "SGC" is reserved specifically for wild-type germ cells that maintain a GSC-like morphology outside the niche. bam or bgcn mutant germ cells are referred to as GSC-like tumor cells (Lines 87-88), not SGCs.

      (a) The same confusion appears in Figure 2. It is unclear whether the analyzed SGCs are wild-type or bam mutant cells. If the SGCs analyzed are Bam mutants, then the lack of Bam expression and failure to differentiate would be expected and not informative. However, if the SGCs are wild-type GSCs located outside the niche, then the observation would suggest that Bam expression is silenced in these wildtype cells, which is a significant finding. The authors should clarify the genotype of the SGCs analyzed in Figure 2C, as this information is not currently provided.

      The SGCs analyzed in Figure 2A-C are wild-type, GSC-like cells located outside the niche. They were generated using the same genetic strategy depicted in Figures 1C and 1E (with the schematic in Figure 1B). The complete genotypes for all experiments are available in Source data 1. 

      (b) In Figures 4B and 4E, the analysis of SGC composition is confusing. In the control germaria (bam mutant mosaic), the authors label GFP⁺ SGCs as "wild-type," which makes interpretation unclear. Note, this is completely different from their earlier definition shown in line 68.

      The strategy to generate SGCs in Figure 4B-F (with the schematic in Figure 4A) is completely different from that in Figure 1C-F, H, and I (with the schematic in Figure 1B). In Figure 4B-F, we needed to distinguish punt<sup>-/-</sup> (or med<sup>-/-</sup>) with punt<sup>+/-</sup> (or med<sup>+/-</sup>) germ cells. As noted in our response to Reviewer #1’s Weakness (6), it was difficult for us to distinguish 1 and 2 copies of GFP in the Drosophila ovary. Therefore, we chose to use the triple-color system to distinguish these germ cells in Figure 4B-F (see genotypes in Source data 1). 

      (c) Additionally, bam⁺/⁻ GSCs (the first bar in Figure 4E) should appear GFP⁺ and Red⁺ (i.e., yellow). It would be helpful if the authors could indicate these bam⁺/⁻ germ cells directly in the image and clarify the corresponding color representation in the main text. In Figure 2A, although a color code is shown, the legend does not explain it clearly, nor does it specify the identity of bam⁺/⁻ cells alone. Figure 4F has the same issue, and in this graph, the color does not match Figure 4A.

      The color-to-genotype relationships for the schematics in Figures 2A and 4E are provided in Figures 1B and 4A, respectively. Due to the high density of germ cells, it is impractical to label each genotype directly in the images. In contrast to Figure 4E, the colors in Figure 4F do not represent genotypes; instead, blue denotes the percentage of SGCs, and red denotes the percentage of germline cysts, as indicated below the bar chart. 

      (2) The frequencies of bam or bgcn mutant mosaic germaria carrying [wild-type] SGCs or wild-type germ cell cysts with branched fusomes, as well as the average number of wild-type SGCs per germarium and the number of days after heat shock for the representative images, are not provided when Figure 1 is first introduced. Since this is the first time the authors describe these phenotypes, including these details is essential. Without this information, it is difficult for readers to follow and evaluate the presented observations.

      Thanks for this constructive suggestion. We will include such quantification data in the revised manuscript.

      (3) Without the information mentioned in point 2, it causes problems when reading through the section regarding [wild-type] SGCs induced by impairment of differentiation or dedifferentiation. In lines 90-97, the authors use the presence of midbodies between cystocytes as a criterion to determine whether the wild-type GSCs surrounded by tumor GSCs arise through dedifferentiation. However, the cited study (Mathieu et al., 2022) reports that midbodies can be detected between two germ cells within a cyst carrying a branched fusome upon USP8 loss.

      Unlike wild-type cystocytes, which undergo incomplete cytokinesis and lack midbodies, those with USP8 loss undergo complete cell division, with the presence of midbodies (white arrow, Figure 1F’ from Mathieu et al., 2022) as a marker of the late cytokinesis stage (Mathieu et al., 2022). 

      (a) Are wild-type germ cell cysts with branched fusomes present in the bam mutant mosaic germaria? What is the proportion of germaria containing wild-type SGCs versus those containing wild-type germ cell cysts with branched fusomes?

      (b) If all bam mutant mosaic germaria carry only wild-type GSCs outside the niche and no germaria contain wild-type germ cell cysts with branched fusomes, then examining midbodies as an indicator of dedifferentiation may not be appropriate.

      We greatly appreciate this critical comment. bam mutant mosaic germaria indeed contained wild-type germline cysts, as evidenced by an SGC frequency of ~70%, rather than 100% (see Figures 2H, 4F, 4J, 6F, 6I, and Figure 6-figure supplement 3C). Since the SGC phenotype depends on the presence of bam or bgcn mutant germline tumors, we quantified it as “the percentage of SGCs relative to the total number of SGCs and germline cysts that are surrounded by germline tumors” (see Lines 124-129). Quantifying the SGC phenotype as "the percentage of germaria with SGCs" would be imprecise. This is because the presence and number of SGCs were highly variable among germaria with bam mutant germline clones, and a small number of germaria entirely lacked these clones. We will provide the data of "SGCs and/or germline cysts per germarium with germline clones" in revised Figure 1.

      (c) If, however, some germaria do contain wild-type germ cell cysts with branched fusomes, the authors should provide representative images and quantify their proportion.

      Such representative germaria are shown in Figure 2G, 3B, 3C, 6D, 6E, and 6H. The percentage of germline cysts can be calculated by “100% - SGC%”.

      (d) In line 95, although the authors state that 50 germ cell cysts were analyzed for the presence of midbodies, it would be more informative to specify how many germaria these cysts were derived from and how many biological replicates were examined.

      As noted in our response to points a) and b) above, the germ cells surrounded by germline tumors, rather than germarial numbers, are more precise for analyzing the phenotype. For this experiment, we examined >50 such germline cysts via confocal microscopy. As the analysis was performed on a defined cellular population, this sample size should be sufficient to support our conclusion. 

      (4) Note that both bam mutant GSCs and wild-type SGCs can undergo division to generate midbodies (double cells), as shown in Figure 4H. Therefore, the current description of the midbody analysis is confusing. The authors should clarify which cell types were examined and explain how midbodies were interpreted in distinguishing between cell division and differentiation.

      We assayed for the presence of midbodies or not specifically within the germline cysts surrounded by bam mutant tumors, not within the tumors themselves (Lines 94-95). As detailed in Lines 88-97, the absence of midbodies was used as a key criterion to exclude the possibility of dedifferentiation.  

      (5) The data in Figure 5 showing Dpp expression in bam mutant tumorous GSCs are not convincing. The Dpp-lacZ signal appears broadly distributed throughout the germarium, including in escort cells. To support the claim more clearly, the authors should present corresponding images for Figures 5D and 5E, in which dpp expression was knocked down in the germ cells of bam or bgcn mutant mosaic germaria. Showing these images would help clarify the localization and specificity of Dpp-lacZ expression relative to the tumorous GSCs.

      We greatly appreciate this comment. RNA in situ hybridization data will be provided to further strengthen the conclusion that bam or bgcn mutant germline tumors secret BMP ligands.

      (6) While Figure 6 provides genetic evidence that bam mutant tumorous GSCs produce Dpp to inhibit the differentiation of wild-type SGCs, it should be noted that these analyses were performed in a dpp⁺/⁻ background. To strengthen the conclusion, the authors should include appropriate controls showing [dpp⁺/⁻; bam⁺/⁻] SGCs and [dpp⁺/⁻; bam⁺/⁻] germ cell cysts without heat shock (as referenced in Figures 6F and 6I).

      Schematic cartoons in Figure 6A and 6B demonstrate that these analyses were performed in a dpp<sup>+/-</sup> background. Figure 6-figure supplement 1 indicates that dpp<sup>+/-</sup> or gbb<sup>+/-</sup> does not affect GSC maintenance, germ cell differentiation, and female fly fertility. Figure 6C is the control for 6D and 6E, and 6G is the control for 6H, with quantification in 6F and 6I.  We used nos::flp, not the heat shock method, to induce germline clones in these experiments (see genotypes in Source data 1).

      (7) Previous studies have reported that bam mutant germ cells cause blunted escort cell protrusions (e.g., Kirilly et al., Development, 2011), which are known to contribute to germ cell differentiation (e.g., Chen et al., Frontiers in Cell and Developmental Biology, 2022). The authors should include these findings in the Discussion to provide a broader context and to acknowledge how alterations in escort cell morphology may further influence differentiation defects in their model.

      Thanks for teaching us! Such discussion will be included in the revised manuscript.

      (8) Since fusome morphology is an important readout of SGCs vs differentiation. All the clonal analysis should have fusome staining.

      SGC is readily distinguishable from multi-cellular germline cyst based on morphology. In some clonal analysis experiments, fusome staining was not feasible due to technical limitations such as channel saturation or antibody incompatibility. Thanks for the understanding! 

      (9) Figure arrangement. It is somewhat difficult to identify the figure panels cited in the text due to the current panel arrangement.

      The figure panels were arranged to optimize space while ensuring that related panels are grouped in close proximity for logical comparison. We would be happy to consider any specific suggestions for an alternative layout that could improve clarity. Thanks!

      (10) The number of biological replicates and germaria analyzed should be clearly stated somewhere in the manuscript-ideally in the Methods section or figure legends. Providing this information is essential for assessing data reliability and reproducibility.

      Thanks for this constructive suggestion. Such information will be included in figure legends in the revised manuscript.

      Reviewer #3 (Public review):

      Summary:

      Zhang et al. investigated how germline tumors influence the development of neighboring wild-type (WT) germline stem cells (GSC) in the Drosophila ovary. They report that germline tumors inhibit the differentiation of neighboring WT GSCs by arresting them in an undifferentiated state, resulting from reduced expression of the differentiation-promoting factor Bam. They find that these tumor cells produce low levels of the niche-associated signaling molecules Dpp and Gbb, which suppress bam expression and consequently inhibit the differentiation of neighboring WT GSCs non-cell-autonomously. Based on these findings, the authors propose that germline tumors mimic the niche to suppress the differentiation of the neighboring stem cells.

      Strengths:

      This study addresses an important biological question concerning the interaction between germline tumor cells and WT germline stem cells in the Drosophila ovary. If the findings are substantiated, they could provide valuable insights applicable to other stem cell systems.

      We greatly appreciate these comments.

      Weaknesses:

      Previous work from Xie's lab demonstrated that bam and bgcn mutant GSCs can outcompete WT GSCs for niche occupancy. Furthermore, a large body of literature has established that the interactions between escort cells (ECs) and GSC daughters are essential for proper and timely germline differentiation (the differentiation niche). Disruption of these interactions leads to arrest of germline cell differentiation in a status with weak BMP signaling activation and low bam expression, a phenotype virtually identical to what is reported here. Thus, it remains unclear whether the observed phenotype reflects "direct inhibition by tumor cells" or "arrested differentiation due to the loss of the differentiation niche". Because most data were collected at a very late stage (more than 10 days after clonal induction), when tumor cells already dominate the germarium, this question cannot be solved. To distinguish between these two possibilities, the authors could conduct a time-course analysis to examine the onset of the WT GSC-like singlegerm-cell (SGC) phenotype and determine whether early-stage tumor clones with a few tumor cells can suppress the differentiation of neighboring WT GSCs with only a few tumor cells present. If tumor cells indeed produce Dpp and Gbb (as proposed here) to inhibit the differentiation of neighboring germline cells, a small cluster or probably even a single tumor cell generated at an early stage might prevent the differentiation of their neighboring germ cells.

      Thanks for this critical comment. Such time-course analysis data will be provided in revised Figure 1.

      The key evidence supporting the claim that tumor cells produce Gpp and Gbb comes from Figures 5 and 6, which suggest that tumor-derived dpp and gbb are required for this inhibition. However, interpretation of these data requires caution. In Figure 5, the authors use dpp-lacZ to support the claim that dpp is upregulated in tumor cells (Figure 5A and 5B). However, the background expression in somatic cells (ECs and pre-follicular cells) differs noticeably between these panels. In Figure 5A, dpp-lacZ expression in somatic cells in 5A is clearly higher than in 5B, and the expression level in tumor cells appears comparable to that in somatic cells (dpplacZ single channel). Similarly, in Figure 5B, dpp-lacZ expression in germline cells is also comparable to that in somatic cells. Providing clear evidence of upregulated dpp and gbb expression in tumor cells (for example, through single-molecular RNA in situ) would be essential.

      We greatly appreciate this critical comment. In our data, the expression of dpp-lacZ in cap cells was variable across germaria, even within the same ovary, as quantified in Figure 5C. The images in Figures 5A and 5B were selected as representative examples of positive signaling. To directly address the reviewer's point and strengthen our conclusion, we will perform RNA in situ hybridization data in the revised manuscript to visualize the expression of BMP ligands within the bam or bgcn mutant germline tumor cells.

      Most tumor data present in this study were collected from the bam[86] null allele, whereas the data in Figure 6 were derived from a weaker bam[BG] allele. This bam[BG] allele is not molecularly defined and shows some genetic interaction with dpp mutants. As shown in Figure 6E, removal of dpp from homozygous bam[BG] mutant leads to germline differentiation (evidenced by a branched fusome connecting several cystocytes, located at the right side of the white arrowhead). In Figure 6D, fusome is likely present in some GFP-negative bam[BG]/bam[BG] cells. To strengthen their claim that the tumor produces Dpp and Gbb to inhibit WT germline cell differentiation, the authors should repeat these experiments using the bam[86] null allele.

      Although a structure resembling a "branched fusome" is visible in Figure 6E (right of the white arrowhead), it is an artifact resulting from the cytoplasm of GFP-positive follicle cells, which also stain for α-Spectrin, projecting between germ cells of different clones (see the merged image). In both our previous (Zhang et al., 2023) and current studies, bam<sup>BG</sup> was functionally indistinguishable from bam<sup>Δ86</sup> in its ability to block GSC differentiation and induce the SGC phenotype (compare Figure 6F, I with Figure 6-figure supplement 3C). Given this, we believe that repeating the extensive experiments in Figure 6 with the bam<sup>Δ86</sup> allele would be scientifically redundant and would not change the key conclusion of our study. We thank the reviewer for their consideration.

      It is well established that the stem niche provides multiple functional supports for maintaining resident stem cells, including physical anchorage and signaling regulation. In Drosophila, several signaling molecules produced by the niche have been identified, each with a distinct function - some promoting stemness, while others regulate differentiation. Expression of Dpp and Gbb alone does not substantiate the claim that these tumor cells have acquired the niche-like property. To support their assertion that these tumors mimic the niche, the authors should provide additional evidence showing that these tumor cells also express other niche-associated markers. Alternatively, they could revise the manuscript title to more accurately reflect their findings.

      Dpp and Gbb are the key niche signals from cap cells for maintaining GSC stemness. Our work demonstrates that germline tumors can specifically mimic this signaling function, not the full suite of cap cell properties, to create a non-cell-autonomous differentiation block. The current title “Tumors mimic the niche to inhibit neighboring stem cell differentiation” reflects this precise concept: a partial, functional mimicry of the niche's most relevant activity in this context. We feel it is an appropriate and compelling summary of our main conclusion.

      In the Method section, the authors need to provide details on how dpp-lacZ expression levels were quantified and normalized.

      Thanks for this suggestion. Such information will be included in the revised manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      The authors report the structure of the human CTF18-RFC complex bound to PCNA. Similar structures (and more) have been reported by the O'Donnell and Li labs. This study should add to our understanding of CTF18-RFC in DNA replication and clamp loaders in general. However, there are numerous major issues that I recommend the authors fix. 

      Strengths: 

      The structures reported are strong and useful for comparison with other clamp loader structures that have been reported lately. 

      Weaknesses: 

      The structures don't show how CTF18-RFC opens or loads PCNA. There are recent structures from other groups that do examine these steps in more detail, although this does not really dampen this reviewer's enthusiasm. It does mean that the authors should spend their time investigating aspects of CTF18-RFC function that were overlooked or not explored in detail in the competing papers. The paper poorly describes the interactions of CTF18-RFC with PCNA and the ATPase active sites, which are the main interest points. The nomenclature choices made by the authors make the manuscript very difficult to read. 

      Reviewer #2 (Public review): 

      Summary 

      Briola and co-authors have performed a structural analysis of the human CTF18 clamp loader bound to PCNA. The authors purified the complexes and formed a complex in solution. They used cryo-EM to determine the structure to high resolution. The complex assumed an auto-inhibited conformation, where DNA binding is blocked, which is of regulatory importance and suggests that additional factors could be required to support PCNA loading on DNA. The authors carefully analysed the structure and compared it to RFC and related structures. 

      Strength & Weakness 

      Their overall analysis is of high quality, and they identified, among other things, a human-specific beta-hairpin in Ctf18 that flexibly tethers Ctf18 to Rfc2-5. Indeed, deletion of the beta-hairpin resulted in reduced complex stability and a reduction in a primer extension assay with Pol ε. This is potentially very interesting, although some more work is needed on the quantification. Moreover, the authors argue that the Ctf18 ATP-binding domain assumes a more flexible organisation, but their visual representation could be improved. 

      The data are discussed accurately and relevantly, which provides an important framework for rationalising the results. 

      All in all, this is a high-quality manuscript that identifies a key intermediate in CTF18dependent clamp loading. 

      Reviewer #3 (Public review): 

      Summary: 

      CTF18-RFC is an alternative eukaryotic PCNA sliding clamp loader that is thought to specialize in loading PCNA on the leading strand. Eukaryotic clamp loaders (RFC complexes) have an interchangeable large subunit that is responsible for their specialized functions. The authors show that the CTF18 large subunit has several features responsible for its weaker PCNA loading activity and that the resulting weakened stability of the complex is compensated by a novel beta hairpin backside hook. The authors show this hook is required for the optimal stability and activity of the complex. 

      Relevance: 

      The structural findings are important for understanding RFC enzymology and novel ways that the widespread class of AAA ATPases can be adapted to specialized functions. A better understanding of CTF18-RFC function will also provide clarity into aspects of DNA replication, cohesion establishment, and the DNA damage response. 

      Strengths: 

      The cryo-EM structures are of high quality enabling accurate modelling of the complex and providing a strong basis for analyzing differences and similarities with other RFC complexes. 

      Weaknesses: 

      The manuscript would have benefitted from more detailed biochemical analysis to tease apart the differences with the canonical RFC complex. 

      I'm not aware of using Mg depletion to trap active states of AAA ATPases. Perhaps the authors could provide a reference to successful examples of this and explain why they chose not to use the more standard practice in the field of using ATP analogues to increase the lifespan of reaction intermediates. 

      Overall appraisal: 

      Overall the work presented here is solid and important. The data is sufficient to support the stated conclusions and so I do not suggest any additional experiments. 

      Reviewer #1 (Recommendations for the authors): 

      We thank the reviewer for their positive comments and for their thorough review. All raised points have been addressed below.

      Major points 

      (1) The nomenclature used in the paper is very confusing and sometimes incorrect. The authors refer to CTF18 protein as "Ctf18", and the entire CTF18-RFC complex as "CTF18". This results in massive confusion because it is hard to ascertain whether the authors are discussing the individual subunits or the entire complex. Because these are human proteins, each protein name should be fully capitalized (i.e. CTF18, RFC4 etc). The full complex should be referred to more clearly with the designation CTF18-RFC or CTF18-RLC (RFC-like complex). Also, because the yeast and human clamp loader complexes use the same nomenclature for different subunits, it would be best for the authors to use the "A, B, C, D, E subunit" nomenclature that has been standard in the field for the past 20 years. Finally, the authors try to distinguish PCNA subunits by labeling them "PCNA2" or "PCNA1" (see Page 8 lines 180,181 for an example). This is confusing because the names of the RFC subunits have similar formats (RFC2, RFC3, RFC4, etc). In the case of RFC this denotes unique genes, whereas PCNA is a homotrimer. Could the authors think of another way to denote the different subunits, such as super/subscript? PCNA-I, PCNA-II, PCNA-III? 

      We thank the reviewer for pointing out the confusing nomenclature. Following the referee suggestion, we now refer to the CTF18 full complex as “CTF18-RFC”. We prefer keeping the nomenclature used for CTFC18 subunits as RFC2, RFC3 etc., as recently used in Yuan et al, Science, 2024. However, we followed the referee’s suggestion for PCNA subunits, now referred to as PCNA-I, PCNA-II and PCNA-III.

      (2) I believe that the authors are over-interpreting their data in Figure 1. The claim that "less sharp definition" of the map corresponding to the AAA+ domain of Ctf18 supports a relatively high mobility of this subunit is largely unsubstantiated. There are several reasons why one could get varying resolution in a cryo-EM reconstruction, such as compositional heterogeneity, preferred orientation artifacts, or how the complex interacts with the air-water interface. If other data were presented that showed this subunit is flexible, this evidence would support that data but cannot alone as justification for subunit mobility. Along these lines, how was the buried surface area (2300 vs 1400 A2) calculated? Is this the total surface area or only the buried surface area involving the AAA+ domains? It is surprising that these numbers are so different considering that the subunits and complexes look so similar (Figures 1c and 2b). 

      We respectfully disagree with the suggestion that our interpretation of local flexibility in the AAA+ domain of Ctf18 is overreaching. Several lines of evidence support this interpretation. First, compositional heterogeneity is unlikely, as the A′ domain of Ctf18 is well-resolved and forms stable interactions with RFC3, indicating that Ctf18 is consistently incorporated into the complex. Second, preferred orientation artifacts are excluded, as the particle distribution shows excellent angular coverage (Fig. S9a). Third, we now include a 3D variability analysis (3DVA; Supplementary Video 1), which reveals local conformational heterogeneity centered around the AAA+ domain of Ctf18, consistent with intrinsic flexibility.

      Regarding the buried surface area values, the reported numbers refer specifically to the interfaces between the AAA+ domain of Ctf18 and RFC2, and are derived from buried surface area calculations performed with PISA. The smaller interface (~1400 Ų) compared to RFC1–RFC2 (~2300 Ų) reflects low sequence identity (~26%) and divergent structural features, including the absence of conserved elements such as the canonical PIP-box in Ctf18. We have clarified and expanded this explanation in the revised manuscript (Page 7).

      (3) The authors very briefly discuss interactions with PCNA and how the CTF18-RFC complex differs from the RFC complex. This is amongst the most interesting results from their work, but also not well-developed. Moreover, Figure 3D describing these interactions is extremely unclear. I feel like this observation had potential to be interesting, but is largely ignored by the authors. 

      We thank the referee for pointing this out. We have expanded the section describing the interactions of CTF18-RFC and PCNA (Page 9 in the new manuscript), and made a new panel figure with further details (Fig. 3D).  

      (4) The authors make the observation that key ATP-binding residues in RFC4 are displaced and incompatible with nucleotide binding in their CTF18-RFC structure compared to the hRFC structure. This should be a main-text figure showing these displacements and how it is incompatible with ATP binding. Again, this is likely an interesting finding that is largely glossed over by the authors. 

      We now discuss this feature in detail (Pag 11 in the new manuscript), and added two figure insets (Fig. 4c) describing the incompatibility of RFC4 with nucleotide binding.

      (5) The authors claim that the work of another group (citation 50) "validate(s) our predictions regarding the significant similarities between CTF18-RFC and canonical RFC in loading PCNA onto a ss/dsDNA junction." However, as far as this reviewer can tell the work in citation 50 was posted online before the first draft of this manuscript appeared on biorxiv, so it is dubious to claim that these were "predictions." 

      We agree with the referee about this claim. We have now revised the text as follows:

      “While our work was being finalized, several cryo-EM structures of human CTF18-RFC bound to PCNA and primer/template DNA were reported by another group (He et al, PNAS, 2024). These findings are consistent with the distinct features of CTF18-RFC observed in our structures and independently support the notion of significant mechanistic similarity between CTF18-RFC and canonical RFC in loading PCNA onto a ss/dsDNA junction”.

      (6) The authors use a primer extension assay to test the effects of truncating the Nterminal beta hairpin of CTF18. However, this assay is only a proxy for loading efficiency and the observed effects of the mutation are rather subtle. The authors could test their hypothesis more clearly if they performed an ATPase assay or even better a clamp loading assay. 

      We thank the referee for this valuable suggestion. In response, we have performed clamp loading assays comparing the activities of human RFC, wild-type CTF18-RFC, and the β-hairpin–truncated CTF18-RFC mutant. The results, now presented in Fig. 6 and Table 1 of the revised manuscript, clearly show that truncation of the N-terminal βhairpin results in a slower rate of PCNA loading. We propose that this reduced loading rate likely contributes to the diminished Pol ε–mediated DNA synthesis observed in the primer extension assays.

      Minor points 

      (1) Page 3 line 53 the introduction suggests that ATP hydrolysis prompts clamp closure. While this may be the case, to my knowledge all recent structural work shows that closure can occur without ATP hydrolysis. It may be better to rephrase it to highlight that under normal loading conditions, ATP hydrolysis occurs before clamp closure. 

      The text now reads (Page 3): 

      “DNA binding prompts the closure of the clamp and hydrolysis of ATP induces the concurrent disassembly of the closed clamp loader from the sliding clamp-DNA complex, completing the cycle necessary for the engagement of the replicative polymerases to start DNA synthesis.”

      (2) Page 3 line 60, I do not see how the employment of alternative loaders highlights the specificity of the loading mechanism - would it not be possible for multiple loaders to have promiscuous clamp loading? 

      We thank the referee for this comment. The text now reads (Page 3):

      “However, eukaryotes also employ alternative loaders (20), including CTF18-RFC (6, 21-24), which likely use a conserved loading mechanism but are functionally specialized through specific protein interactions and context-dependent roles in DNA replication.”

      (3) Page 4 line 75 could you please cite a study that shows Ctf8 and Dcc1 bind to the Ctf18 C-terminus and that a long linker is predicted to be flexible? 

      Two references have been added (Stokes et al, NAR, 2020 and Grabarczyk et al, Structure, 2018)

      (4) Figure 2A has the N-terminal region of Ctf18 as bound to RFC3 but should likely be labeled as bound to RFC5. This caused significant confusion while trying to parse this figure. Further, the inclusion of "X" as a sequence - does this refer to a sequence that was not buildable in the cryo-EM map? I would be surprised that density immediately after the conserved DEXX box motif is unbuildable. If this is the case, it should be clearly stated in the figure legend that "X" denotes an unbuildable sequence. For the conserved beta-hairpin in the sequence, could the authors superimpose the AlphaFold prediction onto their structure? It would be more informative than just looking at the sequence. 

      We apologize for this confusion. The error in Figure 2A has been corrected. The figure caption now explicitely says that “X” refers to amino acid residues in the sequence which were not modelled. A superposition of the cryo-EM model of the N-terminal Beta hairpin in human Ctf18 and AlphaFold predictions for this feature in drosophila and yeast Ctf18 is now presented in Figure 2A.

      (5) Page 8 line 168, the use of the term "RFC5" here feels improper, since the "C" subunit is not RFC5 in all lower eukaryotes (see comment above about nomenclature). For instance, in S cerevisiae, the C subunit is RFC3. I would expect this interaction to be maintained in all C subunits, not all RFC5 subunits. 

      The text now reads (Page 8):

      “Therefore, lower eukaryotes may use a similar b-hairpin motif to bind the corresponding subunit of the RFC-module complex (RFC5 in human, Rfc3 in S. cerevisiae), emphasizing its importance.”  

      (6) Page 10 line 228, the authors claim that hydrolysis is dispensable at the Ctf18/RFC2 interface based on evidence from RFC1/RFC2 interface, by analogy that this is the "A/B" interface in both loaders. However, the wording makes it sound as if the cited data were collected while studying Ctf18 loaders. The authors should clarify this point. 

      The text has been modified as follows (Pag 11): 

      “Prior research has indicated that hydrolysis at the large subunit/RFC2 interface is not essential for clamp loading by various loaders (48-51), while the others are critical for the clamp-loading activity of eukaryotic RFCs. “

      (7) Page 11 line 243/244 the authors introduce the separation pin. Could they clarify whether Ctf18 contains any aromatic residues in this structural motif that would suggest it serves the same functional purpose? Also, the authors highlight this is similar to yeast RFC, which makes it sound like this is not conserved in human RFC, but the structural motif is also conserved in human RFC. 

      We thank the reviewer for this helpful comment. We have clarified in the revised text (Page 12) that the separation pin is conserved not only in yeast RFC but also in human RFC, and now note that human Ctf18 also harbors aromatic residues at the corresponding positions. This observation is supported by the new panel in Figure 4e.

      Minutia 

      (1) Page 2 line 37 please remove the word "and" before PCNA. 

      This has been corrected.

      (2) Please define AAA+ and update the language to clarify that not all pentameric AAA+ ATPases are clamp loaders. 

      AAA+ has been now defined (Page 3).

      (3) Page 4 line 86 Given the relatively weak interaction of Pol ε. 

      This has been corrected.

      (4) Page 8 line 204 the authors likely mean "leucine" and not "lysine". 

      We thank the reviewer for catching this. The error has been corrected.

      (5) Page 14 line 300, the authors claim that CTF18 utilizes three subunits but then list four. 

      We have corrected this.

      Reviewer #2 (Recommendations for the authors): 

      We thank the reviewer for their positive comments and valuable suggestions. The points raised by the referee have been addressed below.

      Major point: 

      (1) Please quantify Figure 6 and S9 from 3 independent repeats and determine the standard deviation to show the variability of the Ctf18 beta hairpin deletion.  The authors suggest that a suboptimal Ctf18 complex interaction with PCNA impacts the stability of the complex, but do not test this hypothesis. Could the suboptimal PIP motif in Ctf18 be changed to an improved motif and the impact tested in the primer extension assay? Although not essential, it would be a nice way to explore the mechanism. 

      We thank the reviewer for the suggestion. However, we note that Figure 6b (now 7b) already presents the quantification of the primer extension assay from three independent replicates, with error bars showing standard deviations, and includes the calculated rate of product accumulation. These data clearly indicate a 42% reduction in primer synthesis rate upon deletion of the Ctf18 β-hairpin.

      We agree that we do not provide direct evidence of impaired complex stability upon deletion of the Ctf18 β-hairpin. However, the 2D classification of the cryo-EM dataset (Figure S9) shows a marked reduction in the number of particles corresponding to intact CTF18-RFC–PCNA complexes in the β-hairpin deletion sample, with the majority of particles corresponding to free PCNA. This contrasts with the wild-type dataset, where complex particles are predominant. These findings indirectly suggest that deletion of the β-hairpin compromises the stability or assembly of the clamp-loader–clamp complex.

      We thank the reviewer for the valuable suggestion to mutate the weak PIP-box of Ctf18. While an interesting direction, we instead sought to directly test the mechanism by performing quantitative clamp loading assays. These assays revealed a significant reduction in the rate of PCNA loading by the CTF18<sup>Δ165–194</sup>-RFCmutant (Figure 6), supporting the conclusion that the β-hairpin contributes to productive PCNA loading. This loading delay likely underlies the reduced rate of primer extension observed in the Pol ε assay (Figure 7), consistent with impaired formation of processive polymerase– clamp complexes.

      (2) I did not see the method describing how the 2D classes were quantified to evaluate the impact of the Ctf18 beta hairpin deletion on complex formation. Please add the relevant information. 

      The relevant information has been added to the Method section:

      “For quantification of complex stability, the number of particles contributing to each 2D class was extracted from the classification metadata (Datasets 1 and 3). All classes showing isolated PCNA rings were summed and compared to the total number of particles in classes representing intact CTF18-RFC–PCNA complexes. This analysis was performed for both wild-type and β-hairpin deletion mutant datasets. Notably, no 2D classes corresponding to free PCNA were observed in the wild-type dataset, whereas in the mutant dataset, a substantial fraction of particles corresponded to isolated PCNA, suggesting reduced stability of the mutant complex.”

      Minor point: 

      (1) Page 2, line 25. Detail what type of mobility is referred to. Do you mean flexibility in the EM-map? 

      We have clarified this. The text now reads:

      “The unique RFC1 (Ctf18) large subunit of CTF18-RFC, which based on the cryo-EM map shows high relative flexibility, is anchored to PCNA through an atypical low-affinity PIP box”

      (2) Page 4, line 82. Please introduce CMGE, or at least state what the abbreviation stands for. 

      This has been addressed.

      (3) Page 4, line 89. Specify that the architecture of the HUMAN CTF18-RFC module is not known, as the yeast one has been published. 

      At the time our study was initiated, the architecture of the human CTF18-RFC module was unknown. A structure of the human complex was published by another group during the final stages of our work and is now properly acknowledged in the Discussion.

      (4) Page 6. Is it possible to illustrate why the autoinhibited state cannot bind to DNA? A visual representation would be nice. 

      We thank the reviewer for this suggestion. Figure 4b in the original manuscript already illustrates why the autoinhibited, overtwisted conformation of the CTF18-RFC pentamer cannot accommodate DNA. In this state, the inner chamber of the loader is sterically occluded, precluding the binding of duplex DNA.

      Reviewer #3 (Recommendations for the authors): 

      We thank Reviewer #3 for their constructive feedback and positive overall assessment of our work.

      We also thank the reviewer for their remarks on the use of Mg depletion to halt hydrolysis. Magnesium is an essential cofactor for ATP hydrolysis, and its depletion is expected to effectively prevent catalysis by destabilizing the transition state, possibly more completely than the use of slowly hydrolysable analogues such as ATPγS. We have recently employed Mg<sup>²+</sup> depletion to successfully trap a pre-hydrolytic intermediate in a replicative AAA+ helicase engaged in DNA unwinding (Shahid et al., Nature, 2025). This precedent supports the rationale for our choice, and the reference has now been included in the revised manuscript.

      I think the authors deposited the FSC curve for the +Mg structure in the -Mg structure PDB/EMDB entry according to the validation report. 

      We thank the reviewer for their careful inspection of the deposition materials. The discrepancy in the deposited FSC curve has now been corrected, and the appropriate FSC curves have been assigned to the correct PDB/EMDB entries.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      This paper measures the positioning and diffusivity of RNaseE-mEos3.2 proteins in E. coli as a function of rifampicin treatment, compares RNaseE to other E. coli proteins, and measures the effect of changes in domain composition on this localization and motion. The straightforward study is thoroughly presented, including very good descriptions of the imaging parameters and the image analysis/modeling involved, which is good because the key impact of the work lies in presenting this clear methodology for determining the position and mobility of a series of proteins in living bacteria cells. 

      Thank you for the nice summary and positive feedback on the descriptions and methodology. 

      My key notes and concerns are listed below; the most important concerns are indicated with asterisks. 

      (1) The very start of the abstract mentions that the domain composition of RNase E varies among species, which leads the reader to believe that the modifications made to E. coli RNase E would be to swap in the domains from other species, but the experiment is actually to swap in domains from other E. coli proteins. The impact of this work would be increased by examining, for instance, RNase E domains from B. subtilis and C. crescentus as mentioned in the introduction. 

      Thank you for the suggestions. We agree that the sentence may convey an unintended expectation. Our original intention was to note the presence and absence of certain domains of RNase E (e.g. membrane-binding motif and CTD) vary across species, rather than the actual sequence variations. To avoid any misinterpretation, we decided to remove the sentence from the abstract. Using the domains of B. subtilis and C. crescentus RNase E in E. coli is a very interesting suggestion, but we will leave that for a future study. 

      (2) Furthermore, the introduction ends by suggesting that this work will modulate the localization, diffusion, and activity of RNase E for "various applications", but no applications are discussed in the discussion or conclusion. The impact of this work would be increased by actually indicating potential reasons why one would want to modulate the activity of RNase E. 

      Thank you for this suggestion. For example, an E. coli strain expressing membranebound RNase E without CTD can help stabilize mRNAs and enhance protein expression. In fact, this idea was used in a commercial BL21 cell line (Invitrogen’s One Shot BL21 Star), to increase the yield of protein expression. We also think that environmentally modulated MB% of RNase E can be useful for controlling the mRNA half-lives and protein expression levels in different conditions. We discussed these ideas at the end of the Discussion.

      (3) Lines 114 - 115: "The xNorm histogram of RNase E shows two peaks corresponding to each side edge of the membrane": "side edge" is not a helpful term. I suggest instead: "...corresponding to the membrane at each side of the cell" 

      Thank you. We made the suggested change.

      (4) A key concern of this reviewer is that, since membrane-bound proteins diffuse more slowly than cytoplasmic proteins, some significant undercounting of the % of cytoplasmic proteins is expected due to decreased detectability of the faster-moving proteins. This would not be a problem for the LacZ imaging where essentially all proteins are cytoplasmic, but would significantly affect the reported MB% for the intermediate protein constructs. How is this undercounting considered and taken into account? One could, for instance, compare LacZ vs. LacY (or RNase E) copy numbers detected in fixed cells to those detected in living cells to estimate it.  

      Thank you for raising this point and suggesting a possible way to address this. We compared the number of tracks for mEos3.2-fused proteins in live vs fixed cells and tested the undercounting effect of cytoplasmic molecules. We compared WT RNase E molecules in live and fixed cells and found that there are about 50% lower molecules detected in the fixed cells, which agrees with the expectation that fluorescent proteins lose their signal upon fixation. Similarly, cytoplasmic RNase E (RNase E ΔMTS) copy number was also ~50% less in the fixed cells compared to live cells. If cytoplasmic molecules were undercounted compared membrane-bound molecules in live cells, fixation would reduce the copy number less than 50%. The comparable ratio of 50% indicates that the undercounting issue is not significant. This control analysis is provided in Figure S1B-C, and we made corresponding textual change in the result section as below:

      For this analysis, we first confirmed that proteins localized on the membrane and in the cytoplasm are detected with equal probability, despite differences in their mobilities (Fig. S1B-C). 

      (5) The rifampicin treatment study is not presented well. Firstly, it is found that LacY diffuses more rapidly upon rifampicin treatment. This change is attributed to changes in crowding at the membrane due to mRNA. Several other things change in cells after adding rif, including ATP levels, and these factors should be considered. More importantly, since the change in the diffusivity of RNaseE is similar to the change in diffusivity of LacY, then it seems that most of the change in RNaseE diffusion is NOT due to RNaseE-mRNAribosome binding, but rather due to whatever crowding/viscosity effects are experienced by LacY (along these lines: the error reported for D is SEM, but really should be a confidence interval, as in Figure 1, to give the reader a better sense of how different (or similar) 1.47 and 1.25 are). 

      We agree with the reviewer that upon rifampicin treatment, RNase E’s D increases to a similar extent as that of LacY. Hence, the increase likely arises from a factor common to both proteins. We have added the reviewer’s suggested interpretation as a possible explanation in the manuscript as below. 

      The similar fold change in D<sub>RNE</sub> and D<sub>LacY</sub> upon rif treatment suggests that the change in RNE diffusion may largely be attributed to physical changes in the intracellular environment (such as reduced viscosity or macromolecular crowding[41,42]), rather than a loss of RNA-RNE interactions.

      As requested by the reviewer, we have provided confidence intervals for our D values in Table S8. Because these intervals are very narrow, we chose to present the SEM as the error metric for D and have also reported the corresponding errors for the fold-change values whenever we describe the fold differences between D values. 

      (6) Lines 185-189: it is surprising to me that the CTD mutants both have the same change in D (5.5x and 5.3x) relative to their full-length counterparts since D for the membranebound WT protein should be much less sensitive to protein size than D for the cytoplasmic MTS mutant. Can the authors comment? 

      Perhaps the reviewer understood that these differences are the ratios between +/-CTD (e.g. WT RNE vs ΔCTD). However, the differences we mentioned were from membrane-bound vs cytoplasmic versions of RNase E with comparable sizes (e.g. WT RNase E vs RNase E ΔMTS). We modified text and added a summary sentence at the end of the paragraph to clarify the point.

      We found that D<sub>ΔMTS</sub> is ~5.5 times that of D<sub>RNE</sub> (Fig. 3B). [...] Together, these results suggest that the membrane binding reduces RNE mobility by a factor of 5.

      That being said, we also realized a similar fold difference between +/-CTD. Specifically, WT RNE vs RNE ΔCTD (both membrane-bound) show a ~4.1-fold difference and RNE ΔMTS vs RNE ΔMTS ΔCTD (both cytoplasmic) show ~3.9-fold difference. We do not currently do not have a clear explanation for this pattern. Given that these two pairs have a similar change in mass, we speculate that the relationship between D and molecular mass may be comparable for membrane-bound and free-floating RNE variants. 

      (7) Lines 190-194. Again, the confidence intervals and experimental uncertainties should be considered before drawing biological conclusions. It would seem that there is "no significant change" in the rhlB and pnp mutants, and I would avoid saying "especially for ∆pnp" when the same conclusion is true for both (one shouldn't say 1.04 is "very minute" and 1.08 is just kind of small - they are pretty much the same within experiments like this). 

      Thank you for raising this point, which we fully agree with. That being said, we decided to remove results related to the degradosome proteins to improve the flow of the paper. We are preparing another paper related to the RNA degradosome complex formation. 

      (8) Lines 221-223 " This is remarkable because their molecular masses (and thus size) are expected to be larger than that of MTS" should be reconsidered: diffusion in a membrane does not follow the Einstein law (indeed lines 223-225 agree with me and disagree with lines 221-223). (Also the discussion paragraph starting at line 375). Rather, it is generally limited by the interactions with the transmembrane segments with the membrane. So Figure 3D does not contain the right data for a comparison, and what is surprising to me is that MTS doesn't diffuse considerably faster than LacY2. 

      We agree with the reviewer’s point that diffusion in a membrane does not follow the Stokes-Einstein law. That is why we introduced Saffman’s model. However, even in this model, proteins of larger size (or mass) should be slower than smaller size (a reason why we presented Figure 3D, now 4D). In other words, both Einstein and Saffman models predict that larger particles diffuse slower, although the exact scaling relationship differs between two models. Here, we assume that mass is related to the size. Contrary to Saffman’s model for membrane proteins, LacY2 diffuses faster than MTS despite of large size. Using MD simulations, we showed that this discrepancy can be explained by different interaction energies as the reviewer mentioned. This analysis further demonstrates that the size is not the only factor to consider protein diffusion in the membrane. We edited the paragraph to clarify the expectations and our interpretations.

      According to the Stokes-Einstein relation for diffusion in simple fluids[49] and the Saffman-Delbruck diffusion model for membrane proteins, D decreases as particle size increases, albeit with different scaling behaviors. […] Thus, if size (or mass) were the primary determinant of diffusion, LacY2 and LacY6 would diffuse more slowly than the smaller MTS. The observed discrepancy instead implies that D may be governed by how each motif interacts with the membrane. For example, the way that TM domains are anchored to the membrane may facilitate faster lateral diffusion with surrounding lipids. 

      (9) The logical connection between the membrane-association discussion (which seems to ignore associations with other proteins in the cell) and the preceding +/- rifampicin discussion (which seeks to attribute very small changes to mRNA association) is confusing.

      Thank you for raising this point. We re-arranged the second result section to present diffusion due to membrane binding first before rifampicin. Furthermore, we stated our hypothesis and expectations in the beginning of the results section. This addition will legitimate our logic flow.

      (10) Separately, the manuscript should be read through again for grammar and usage. For instance, the title should be: "Single-molecule imaging reveals the *roles* of *the* membrane-binding motif and *the* C-terminal domain of RNase E in its localization and diffusion in Escherichia coli". Also, some writing is unwieldy, for instance, "RNase E's D" would be easier to read if written as D_{RNaseE}. (underscore = subscript), and there is a lot of repetition in the sentence structures. 

      Thank you for catching grammar mistakes. We went through extensive proofreading to avoid these mistakes and also used simple notation suggested by the reviewer, such as D<sub>RNE</sub>, to make it easier to read. Thank you again for your suggestions.

      Reviewer #2 (Public review): 

      Summary: 

      Troyer and colleagues have studied the in vivo localisation and mobility of the E.coli RNaseE (a protein key for mRNA degradation in all bacteria) as well as the impact of two key protein segments (MTS and CTD) on RNase E cellular localisation and mobility. Such sequences are important to study since there is significant sequence diversity within bacteria, as well as a lack of clarity about their functional effects. Using single-molecule tracking in living bacteria, the authors confirmed that >90% of RNaseE localised on the membrane, and measured its diffusion coefficient. Via a series of mutants, they also showed that MTS leads to stronger membrane association and slower diffusion compared to a transmembrane motif (despite the latter being more embedded in the membrane), and that the CTD weakens membrane binding. The study also rationalised how the interplay of MTS and CTD modulate mRNA metabolism (and hence gene expression) in different cellular contexts. 

      Strengths: 

      The study uses powerful single-molecule tracking in living cells along with solid quantitative analysis, and provides direct measurements for the mobility and localisation of E.coli RNaseE, adding to information from complementary studies and other bacteria. The exploration of different membrane-binding motifs (both MTS and CTD) has novelty and provides insight on how sequence and membrane interactions can control function of protein-associated membranes and complexes. The methods and membrane-protein standards used contribute to the toolbox for molecular analysis in live bacteria. 

      Thank you for the nice summary of our work and positive comments about the paper’s strengths.

      Weaknesses: 

      The Results sections can be structured better to present the main hypotheses to be tested. For example, since it is well known that RNase E is membrane-localised (via its MTS), one expects its mobility to be mainly controlled by the interaction with the membrane (rather than with other molecules, such as polysomes and the degradosome). The results indeed support this expectation - however, the manuscript in its current form does not lay down the dominant hypothesis early on (see second Results chapter), and instead considers the rifampicin-addition results as "surprising"; it will be best to outline the most likely hypotheses, and then discuss the results in that light. 

      Thank you for this comment. We addressed this point by stating our main hypothesis from the beginning of the results section. We also agree with the reviewer that the membrane binding effect should be discussed first; hence, we re-arranged the result section. In the revised manuscript, we discuss the effect of membrane binding on diffusion first, followed by rif effects.

      Similarly, the authors should first discuss the different modes of interaction for a peripheral anchor vs a transmembrane anchor, outline the state of knowledge and possibilities, and then discuss their result; in its current version, the ms considers the LacY2 and LacY6 faster diffusion compared to MTS "remarkable", but considering the very different mode of interaction, there is no clear expectation prior to the experiment. In the same section, it would be good to see how the MD simulations capture the motion of LacY6 and LacY12, since this will provide a set of results consistent with the experimental set. 

      Thank you for pointing this out. In fact, there is little discussion in the literature about the different modes of interaction for a peripheral anchor vs a transmembrane anchor. To our knowledge, our work (experiments and MD simulations) is the first that directly compared the two to reveal that the peripheral anchor has higher interaction energy than the transmembrane anchor. We added a sentence “Despite the prevalence of peripheral membrane proteins, how they interact with the membrane and how this differs from TM proteins remain poorly understood”. Furthermore, we added the MD simulation result of LacY6 and LacY12 in Figure 4E-F.

      The work will benefit from further exploration of the membrane-RNase E interactions; e.g., the effect of membrane composition is explored by just using two different growth media (which on its own is not a well-controlled setting), and no attempts to change the MTS itself were made. The manuscript will benefit from considering experiments that explore the diversity of RNaseE interactions in different species; for example, the authors may want to consider the possibility of using the membrane-localisation signals of functional homologs of RNaseE in different bacteria (e.g., B. subtilis). It would be good to look at the effect of CTD deletions in a similar context (i.e., in addition to the MTS substitution by LacY2 and LacY6). 

      Thank you very much for this suggestion. During revision, we engineered point mutations in MTS and analyzed critical hydrophobic residues for membrane binding. We characterized MB% in both +/-CTD variants (Fig. 2 and Fig. S6) and their effect on lacZ mRNA degradation (Fig. 6). We will leave the use of membrane motif of B. subtilis RNase E for future study. 

      The manuscript will benefit from further discussion of the unstructured nature of the CTD, especially since the RNase CTD is well known to form condensates in Caulobacter crescentus; it is unclear how the authors excluded any roles for RNaseE phase separation in the mobility of RNaseE in E.coli cells. 

      Yes, we agree with the reviewer that the intrinsically disordered nature of the CTD might contribute to condensate formation. We explored this possibility using both epifluorescence microscopy (with a YFP fusion) and single-molecule imaging with cluster analysis (using an mEos3.2 fusion). Please see Figure S8. We did observe some weak de-clustering of RNase E upon CTD deletion. In the current study, we are unable to quantify the extent to which clustering contributes to the slow diffusion of RNase E. However, we speculate that the clustering may be linked to the low MB% of certain RNE mutants containing CTD, and we discussed this possibility in the Discussion.

      […] further supporting that the CTD decreases membrane association across RNE variants. We speculate that this effect may be related to the CTD’s role in promoting phase-separated ribonucleoprotein condensates, as observed in Caulobacter crescentus[19]. In E. coli, we also observed a modest increase in the clustering tendency of RNE compared to ΔCTD (Fig. S8). 

      Some statements in the Discussion require support with example calculations or toning down substantially. Specifically, it is not clear how the authors conclude that RNaseE interacts with its substrate for a short time (and what this time may actually be); further, the speculation about the MTS "not being an efficient membrane-binding motif for diffusion" lacks adequate support as it stands. 

      Thank you for these points. To elaborate our point on transient interaction between RNase E and RNA, we added a sentence “Specifically, if RNE interacts with mRNAs for ~20 ms or less, the slow-diffusing state would last shorter than the frame interval and remain undetected in our experiment.” Also, we added this sentence in the discussion.

      One possible explanation is that RNA-bound RNE (and RNase Y) is short-lived compared to our frame interval (~20 ms), unlike other RNA-binding proteins related to transcription and translation, interacting with RNA for ~1 min for elongation [48].

      Plus, we clarified the wording used in the second sentence that the reviewer pointed out as follows,

      Lastly, the slow diffusion of the MTS in comparison to LacY2 and LacY6 suggests that MTS is less favorable for rapid lateral motion in the membrane. 

      Reviewer #3 (Public review): 

      Summary: 

      The manuscript by Troyer et al quantitatively measured the membrane localization and diffusion of RNase E, an essential ribonuclease for mRNA turnover as well as tRNA and rRNA processing in bacteria cells. Using single-molecule tracking in live E. coli cells, the authors investigated the impact of membrane targeting sequence (MTS) and the Cterminal domain (CTD) on the membrane localization and diffusion of RNase E under various perturbations. Finally, the authors tried to correlate the membrane localization of RNase E to its function on co- and post-transcriptional mRNA decay using lacZ mRNA as a model. 

      The major findings of the manuscripts include: 

      (1) WT RNase E is mostly membrane localized via MTS, confirming previous results. The diffusion of RNase E is increased upon removal of MTS or CTD, and more significantly increased upon removal of both regions. 

      (2) By tagging RNase E MTS and different lengths of LacY transmembrane domain (LacY2, LacY6, or LacY12) to mEos3.2, the results demonstrate that short LacY transmembrane sequence (LacY2 and LacY6) can increase the diffusion of mEos3.2 on the membrane compared to MTS, further supported by the molecular dynamics simulation. A similar trend was roughly observed in RNase E mutants with MTS switched to LacY transmembrane domains. 

      (3) The removal of RNase E MTS significantly increases the co-transcriptional degradation of lacZ mRNA, but has minimal effect on the post-transcriptional degradation of lacZ mRNA. Removal of CTD of RNase E overall decreases the mRNA decay rates, suggesting the synergistic effect of CTD on RNase E activity. 

      Strengths: 

      (1) The manuscript is clearly written with very detailed method descriptions and analysis parameters. 

      (2) The conclusions are mostly supported by the data and analysis. 

      (3) Some of the main conclusions are interesting and important for understanding the cellular behavior and function of RNase E. 

      Thank you for your thorough summary of our work and positive comments.

      Weaknesses: 

      (1) Some of the observations show inconsistent or context-dependent trends that make it hard to generalize certain conclusions. Those points are worth discussion at least. Examples include: 

      (a) The authors conclude that MTS segment exhibits reduced MB% when succinate is used as a carbon source compared to glycerol, whereas LacY2 segment maintains 100% membrane localization, suggesting that MTS can lose membrane affinity in the former growth condition (Ln 341-342). However, the opposite case was observed for the WT RNase E and RNase E-LacY2-CTD, in which RNase E-LacY2-CTD showed reduced MB% in the succinate-containing M9 media compared to the WT RNase E (Ln 264-267). This opposite trend was not discussed. In the absence of CTD, would the media-dependent membrane localization be similar to the membrane localization sequence or to the fulllength RNase E? 

      This is a great point. Thank you for pointing out the discrepancy in data. We think the weak membrane interaction of RNaseE-lacY2-CTD likely stems from the structure instability in the presence of the CTD. Our data shows that an RNase E variant with a cytoplasmic population under a normal growth condition exhibits a greater cytoplasmic fraction in a poor growth media. In contrast, RNaseE-MTS and RNaseE-LacY2 lacking the CTD both showed 100% MB% under both normal and poor growth conditions. These results are presented in Figure S6 and further discussed in the Discussion section.

      The loss of MB% in LacY2-based RNE was observed only in the presence of the CTD (Fig. S6D), suggesting that the CTD negatively affects membrane binding of RNE, possibly by altering protein conformation. In fact, all ΔCTD RNE mutants we tested exhibited higher MB% than their CTD-containing counterparts (Fig. S6A-B). 

      (b) When using mEos3.2 reporter only, LacY2 and LacY6 both increase the diffusion of mEos3.2 compared to MTS. However, when inserting the LacY transmembrane sequence into RNase E or RNase E without CTD, only the LacY2 increases the diffusion of RNase E. This should also be discussed. 

      Thank you for raising this point. As the reviewer pointed out, as the membrane motifs, both LacY2 and LacY6 diffuse faster than the MTS, but when they are fused to RNE, only LacY2-based RNE diffuses faster than MTS-based RNE. We speculate that it is possibly due to a structural reason—having four (large) LacY6 in a tetrameric arrangement may cancel out the original fast-diffusing property of LacY6. We added this idea in the result section:

      This result may be due to the high TM load (24 helices) created by four LacY6 anchors in the RNE tetramer. Although all constructs are tetrameric, the 24-helix load (LacY6), compared with 8 (LacY2) and 4 (MTS), likely enlarges the membrane-embedded footprint and increases drag, thereby changing the mobility advantages assessed as standalone membrane anchors.

      (2) The authors interpret that in some cases the increase in the diffusion coefficient is related to the increase in the cytoplasm localization portion, such as for the LacY2 inserted RNase E with CTD, which is rational. However, the authors can directly measure the diffusion coefficient of the membrane and cytoplasm portion of RNase E by classifying the trajectories based on their localizations first, rather than just the ensemble calculation. 

      Thank you for this suggestion. Currently, because of the 2D projection effect from imaging, we cannot clearly distinguish which individual tracks are from the cytoplasm or from the inner membrane based on the localization. Therefore, we are unable to assign individual tracks as membrane-bound or cytoplasmic. However, we can demonstrate that the xNorm data can be separated into two different spatial populations based on the diffusion coefficient. D. That is we can plot xNorm of slow tracks vs xNorm of fast tracks. This analysis showed that the slow tracks have LacY-like xNorm profiles while the fast tracks have LacZ-like xNorm profiles, also quantitatively supporting our MB% fitting results. We have added this analysis to Figure S2.

      (3) The error bars of the diffusion coefficient and MB% are all SEM from bootstrapping, which are very small. I am wondering how much of the difference is simply due to a batch effect. Were the data mixed from multiple biological replicates? The number of biological replicates should also be reported. 

      Thank you for raising this point. In the original manuscript, we reported the number of tracks analyzed and noted that all data was from at least three separate biological replicates (measurements were repeated at least three different days). Furthermore, in the revised manuscript, we have provided the number of cells imaged in Table S6. 

      (4) Some figures lack p-values, such as Figures 4 and 5C-D. Also, adding p-values directly to the bar graphs will make it easier to read. 

      Thank you for checking these details. We added p values in the graphs showing k<sub>d1</sub> and k<sub>d2</sub> (Table S7).

      Reviewer #2 (Recommendations for the authors): 

      Minor and technical points: 

      (1) Clarity and flow will be improved if each section first highlights the objective for the experiments that are described (e.g., line 240). 

      Thank you for the suggestion. We addressed this point by editing the beginning of each subsection in the Results. 

      (2) Line 272 (and elsewhere)."1.33-times faster is wrong". The authors mean 33% faster (from 0.075 to 1, see Figure 4G), and not 133% faster. Needs fixing. 

      Thanks for pointing this out. We changed this as well as other incidences where we talk about the fold difference. For example, this particular incidence was changed to:

      Indeed, in the absence of the CTD, we found that the D of LacY2-based RNE was 1.33 ± 0.01 times as fast as the MTS-based RNE. 

      (3) The authors need to consider the fitting of two species on their D population. e.g., how will a 93% - 7% split between diffusive species would have looked for the distribution in S4B? Note also the L1 profile in Fig S4C - while it is not hugely different from Figure S4B, the analysis gives a 41% amplitude for the fast-diffusing species. The 2-species analysis can also be used on some of the samples with much higher cytoplasmic components. Further, tracks that are in the more central region can be analysed to see whether the fast-diffusing species increase in amplitude. 

      Thank you for this comment. The D histograms of L1 and RNase E show a dominant peak at around 0.015, but L1 has a residual population in the shoulder (note the difference between L1’s experimental data and D1 fit, a yellow line in now Figure S3B). This residual shoulder population is absent in the D histogram of RNase E. We also performed two-species analysis as suggested by the reviewer and provided the result in Figure S3C. The analysis shows that the two-population fit (black line) is very close to one one-population fit (yellow line). While we agree with the reviewer that subpopulation analysis is helpful for other proteins that show <90% MB% (>10% significant cytoplasmic population). we found it useful to divide xNorm histogram into two populations based on the diffusivity (rather than doing two-population fit to the D histogram, which does not have spatial information). This analysis, shown in Figure S2, supports our MB% fit results.

      (4) The authors suggest that the sequestration of RNaseE to the membrane limits its interaction with cytoplasmic mRNAs, and may increase mRNA lifetime. While this is true and supported by the authors' preprint (Ref15), it will also be good to consider (and discuss) that highly-transcribed regions are in the nucleoid periphery (and thus close to the membrane) and that ribosomes/polysomes are likewise predominantly peripheral (coregulation of transcription/translation) and membrane proximal. 

      This is an interesting point, which we appreciate very much. The lacZ gene, when induced, is shown to move to the nucleoid periphery (Yang et al. 2019, Nat Comm). Also, in our preprint (Ref 15), we engineered to have lacZ closer to the membrane, by translationally fusing it to lacY. However, the degradation rate of lacZ mRNA was not enhanced by the proximity to the membrane (for both k<Sub>d1</sub> and k<sub>d2</sub>). For lacZ mRNA, we mainly see the change in k<sub>d1</sub> when RNE localization changes. We think it is due to the slow diffusion of the nascent mRNA (attached to the chromosome) and the slow diffusion of membrane-bound RNE, such that regardless of the location of the nascent mRNA, the degradation by the membrane-bound RNE is inefficient. Only when RNE is free diffusing in the cytoplasm, it seems to increase k<sub>d1</sub> (the decay of nascent mRNAs).

      Reviewer #3 (Recommendations for the authors):

      (1) It will increase the clarity of the manuscript if the authors can provide better nomenclatures for different constructs, such as for different membrane targeting sequences fused to mEos3.2, full-length RNase E, or CDT truncated RNaseE. 

      Thank you for this suggestion. We agree that many constructions were discussed, and their naming can be confusing. To help with clarity, we have abbreviated RNase E as RNE throughout the text where appropriate. 

      (2) Line 342, Figure S7D should be cited instead of S6D. 

      Thank you for finding this error. We made a proper change in the revised manuscript.

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      The authors describe the results of a single study designed to investigate the extent to which horizontal orientation energy plays a key role in supporting view-invariant face recognition. The authors collected behavioral data from adult observers who were asked to complete an old/new face matching task by learning broad-spectrum faces (not orientation filtered) during a familiarization phase and subsequently trying to label filtered faces as previously seen or novel at test. This data revealed a clear bias favoring the use of horizontal orientation energy across viewpoint changes in the target images. The authors then compared different ideal observer models (cross-correlations between target and probe stimuli) to examine how this profile might be reflected in the image-level appearance of their filtered images. This revealed that a model looking for the best matching face within a viewpoint differed substantially from human data, exhibiting a vertical orientation bias for extreme profiles. However, a model forced to match targets to probes at different viewing angles exhibited a consistent horizontal bias in much the same manner as human observers.

      Strengths:

      I think the question is an important one: The horizontal orientation bias is a great example of a low-level image property being linked to high-level recognition outcomes, and understanding the nature of that connection is important. I found the old/new task to be a straightforward task that was implemented ably and that has the benefit of being simple for participants to carry out and simple to analyze. I particularly appreciated that the authors chose to describe human data via a lower-dimensional model (their Gaussian fits to individual data) for further analysis. This was a nice way to express the nature of the tuning function, favoring horizontal orientation bias in a way that makes key parameters explicit. Broadly speaking, I also thought that the model comparison they include between the view-selective and view-tolerant models was a great next step. This analysis has the potential to reveal some good insights into how this bias emerges and ask finegrained questions about the parameters in their model fits to the behavioral data.

      We thank the reviewer for their positive appraisal of the importance of our research question as well as of the soundness of our approach to it.

      Weaknesses:

      I will start with what I think is the biggest difficulty I had with the paper. Much as I liked the model comparison analysis, I also don't quite know what to make of the view-tolerant model. As I understand the authors' description, the key feature of this model is that it does not get to compare the target and probe at the same yaw angle, but must instead pick a best match from candidates that are at different yaws. While it is interesting to see that this leads to a very different orientation profile, it also isn't obvious to me why such a comparison would be reflective of what the visual system is probably doing. I can see that the view-specific model is more or less assuming something like an exemplar representation of each face: You have the opportunity to compare a new image to a whole library of viewpoints, and presumably it isn't hard to start with some kind of first pass that identifies the best matching view first before trying to identify/match the individual in question. What I don't get about the view-tolerant model is that it seems almost like an anti-exemplar model: You specifically lack the best viewpoint in the library but have to make do with the other options. Again, this is sort of interesting and the very different behavior of the model is neat to discuss, but it doesn't seem easy to align with any theoretical perspective on face recognition. My thinking here is that it might be useful to consider an additional alternate model that doesn't specifically exclude the best-matching viewpoint, but perhaps condenses appearance across views into something like a prototype. I could even see an argument for something like the yaw-averages presented earlier in the manuscript as the basis for such a model, but this might be too much of a stretch. Overall, what I'd like to see is some kind of alternate model that incorporates the existence of the best-match viewpoint somehow, but without the explicit exemplar structure of the view-specific model.

      The view-tolerant model was designed so that identity needed to be abstracted away from variations in yaw to support face recognition. We believe this model aligns with the notion of tolerant recognition.

      The tolerance of identity recognition is presumably empowered by the internal representation of the natural statistics of identity, i.e. the stable traits and (idiosyncratic) variability of a face, which builds up through the varied encounters with a given face (Burton, Jenkins et al. 2005, Burton, Jenkins and Schweinberger 2011, Jenkins and Burton 2011, Jenkins, White et al. 2011, Burton, Kramer et al. 2016, Menon, Kemp and White 2018).

      The average of various images of a face provides its appearance distribution (i.e., variability) and central tendency (i.e., stable properties; Figure 1) and could be used as a reasonable proxy of its natural statistical properties (Burton, Jenkins et al. 2005). We thus believe that the alternate model proposed by the reviewer is relevant to existing theories of face identity recognition and agree that our current model observers do not fully capture this aspect. It is thus an excellent idea to examine the orientation tuning profile of a model observer that compares a specific view of a face to the average encompassing all views of a face identity. Since the horizontal range is proposed to carry the view-stable cues to identity, we expect that such a ‘viewpoint-average’ model observer will perform best with horizontally filtered faces and that its orientation tuning profile will significantly predict human performance across views. We expect the viewpointtolerant and viewpoint-average observers will behave similarly as they manifest the stability of the horizontal identity cues across variations in viewpoint.

      Besides this larger issue, I would also like to see some more details about the nature of the crosscorrelation that is the basis for this model comparison. I mostly think I get what is happening, but I think the authors could expand more on the nature of their noise model to make more explicit what is happening before these cross-correlations are taken. I infer that there is a noise-addition step to get them off the ceiling, but I felt that I had to read between the lines a bit to determine this.

      The view-selective model responded correctly whenever successfully matching a given face identity at a specific viewpoint to itself. Since there was an exact match in each trial, resulting in uninformative ceiling performance, we decreased the signal-to-noise ratio (SNR) of the target and probe images to .125 (face RMS contrast: .01; noise RMS contrast: .08). In every trial, target and probe faces were each combined with 10 different random noise patterns. SNR was adjusted so that the overall performance of the view-selective model was in the range of human performance. We will describe these important aspects in the methods and add a supplemental with the graphic illustration of the d’ distributions of each model and human observers.

      Another thing that I think is worth considering and commenting on is the stimuli themselves and the extent to which this may limit the outcomes of their behavioral task. The use of the 3D laserscanned faces has some obvious advantages, but also (I think) removes the possibility for pigmentation to contribute to recognition, removes the contribution of varying illumination and expression to appearance variability, and perhaps presents observers with more homogeneous faces than one typically has to worry about. I don't think these negate the current results, but I'd like the authors to expand on their discussion of these factors, particularly pigmentation. Naively, surface color and texture seem like they could offer diagnostic cues to identity that don't rely so critically on horizontal orientations, so removing these may mean that horizontal bias is particularly evident when face shape is the critical cue for recognition.

      We indeed got rid of surface color by converting images to gray scales. While we acknowledge that the conversion to grayscales may have removed one potential source of surface information, it is unlikely that our stimuli fully eliminated the contribution of surface pigmentation in our study. Pigmentation refers to all surface reflectance property (Russell, Sinha et al. 2006) and hue (color) is only one surface cue among others. The grayscaled 3D laser scanned faces used here still contained natural variations in crucial surface cues such as skin albedo (i.e., how light or dark the surface appears) and texture (i.e., spatial variation in how light is reflected). Both color and grayscale stimuli (2D face pictures or 3D laser scanned faces like ours) have actually been used to disentangle the role of shape and surface cues to identity recognition (e.g., Troje and Bulthoff 1996, Vuong, Peissig et al. 2005, Russell, Sinha et al. 2006, Russell, Biederman et al. 2007, Jiang, Dricot et al. 2009).

      More fundamentally, we demonstrated that the diagnosticity of the horizontal range of face information is not restricted to the transmission of shape cues. Our recent work has indeed shown that the processing of both face shape and surface most critically relies on horizontal information (Dumont, Roux-Sibilon and Goffaux 2024).

      Reviewer #2 (Public review):

      This study investigates the visual information that is used for the recognition of faces. This is an important question in vision research and is critical for social interactions more generally. The authors ask whether our ability to recognise faces, across different viewpoints, varies as a function of the orientation information available in the image. Consistent with previous findings from this group and others, they find that horizontally filtered faces were recognised better than vertically filtered faces. Next, they probe the mechanism underlying this pattern of data by designing two model observers. The first was optimised for faces at a specific viewpoint (viewselective). The second was generalised across viewpoints (view-tolerant). In contrast to the human data, the view-specific model shows that the information that is useful for identity judgements varies according to viewpoint. For example, frontal face identities are again optimally discriminated with horizontal orientation information, but profiles are optimally discriminated with more vertical orientation information. These findings show human face recognition is biased toward horizontal orientation information, even though this may be suboptimal for the recognition of profile views of the face.

      One issue in the design of this study was the lowering of the signal-to-noise ratio in the viewselective observer. This decision was taken to avoid ceiling effects. However, it is not clear how this affects the similarity with the human observers.

      The view-selective model responded correctly whenever successfully matching a given face identity at a specific viewpoint to itself. Since there was an exact match in each trial, resulting in uninformative ceiling performance, we decreased the signal-to-noise ratio (SNR) of the target and probe images to .125 (face RMS contrast: .01; noise RMS contrast: .08). In every trial, target and probe faces were each combined with 10 different random noise patterns. SNR was adjusted so that the overall performance of the view-selective model was in the range of human performance. We will describe these important aspects in the methods and add a supplemental with the graphic illustration of the d’ distributions of each model and human observers.

      Another issue is the decision to normalise image energy across orientations and viewpoints. I can see the logic in wanting to control for these effects, but this does reflect natural variation in image properties. So, again, I wonder what the results would look like without this step.

      Energy of natural images is disproportionately distributed across orientations (e.g., Hansen, Essock et al. 2003). Images of faces cropped from their background as used here contain most of their energy in the horizontal range (Keil 2009, Goffaux and Greenwood 2016, Goffaux 2019). If not normalized after orientation filtering, such uneven distribution of energy would boost recognition performance in the horizontal range across views. Normalization was performed across our experimental conditions merely to avoid energy from explaining the influence of viewpoint on the orientation tuning profile.

      We are not aware of any systematic natural variations of energy across face views. To address this, we measured face average energy (i.e., RMS contrast) in the original stimulus set, i.e., before the application of any image processing or manipulation. Background pixels were excluded from these image analyses. Across yaws, we found energy to range between .11 and .14 on a 0 to 1 grayscale. This is moderate compared to the range of energy variations we measured across identities (from .08 to .18). This suggests that variations in energy across viewpoints are moderate compared to variations related to identity. It is unclear whether these observations are specific to our stimulus set or whether they are generalizable to faces we encounter in everyday life. They, however, indicate that RMS contrast did not substantially vary across views in the present study and suggest that RMS normalization is unlikely to have affected the influence of viewpoint on recognition performance.

      Nonetheless, we acknowledge the importance of this issue regarding the trade-off between experimental control and stimulus naturalness, and we will refer to it explicitly in the methods section.

      Despite the bias toward horizontal orientations in human observers, there were some differences in the orientation preference at each viewpoint. For example, frontal faces were biased to horizontal (90 degrees), but other viewpoints had biases that were slightly off horizontal (e.g., right profile: 80 degrees, left profile: 100 degrees). This does seem to show that differences in statistical information at different viewpoints (more horizontal information for frontal and more vertical information for profile) do influence human perception. It would be good to reflect on this nuance in the data.

      Indeed, human performance data indicates that while identity recognition remains tuned to horizontal information, horizontal tuning shows some variation across viewpoints. We primarily focused on the first aspect because of its direct relevance to our research objective, but also discussed the second aspect: with yaw rotation, certain non-horizontal morphological features such as the jaw line or nose bridge, etc. may increasingly contribute to identity recognition, whereas at frontal or near frontal views, features are mostly horizontally-oriented (e.g., Keil 2008, Keil 2009). We will relate this part of the discussion more explicitly to the observation of the fluctuation of the peak location as a function of yaw.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer 1:

      The authors frequently refer to their predictions and theory as being causal, both in the manuscript and in their response to reviewers. However, causal inference requires careful experimental design, not just statistical prediction. For example, the claim that "algorithmic differences between those with BPD and matched healthy controls" are "causal" in my opinion is not warranted by the data, as the study does not employ experimental manipulations or interventions which might predictably affect parameter values. Even if model parameters can be seen as valid proxies to latent mechanisms, this does not automatically mean that such mechanisms cause the clinical distinction between BPD and CON, they could plausibly also refer to the effects of therapy or medication. I recommend that such causal language, also implicit to expressions like "parameter influences on explicit intentional attributions", is toned down throughout the manuscript.

      Thankyou for this chance to be clearer in the language. Our models and paradigm introduce a from of temporal causality, given that latent parameter distributions are directly influenced by latent parameter estimates at a previous point in time (self-uncertainty and other uncertainty directly governs social contagion). Nevertheless, we appreciate the reviewers perspective and have now toned down the language to reflect this.

      Abstract:

      ‘Our model makes clear predictions about the mechanisms of social information generalisation concerning both joint and individual reward.’

      Discussion:

      ‘We can simulate this by modelling a framework that incorporates priors based on both self and a strong memory impression of a notional other (Figure S3).’

      ‘We note a strength of this work is the use of model comparison to understand algorithmic differences between those with BPD and matched healthy controls.’

      Although the authors have now much clearer outlined the stuy's aims, there still is a lack of clarity with respect to the authors' specific hypotheses. I understand that their primary predictions about disruptions to self-other generalisation processes underlying BPD are embedded in the four main models that are tested, but it is still unclear what specific hypotheses the authors had about group differences with respect to the tested models. I recommend the authors specify this in the introduction rather than refering to prior work where the same hypotheses may have been mentioned.

      Thankyou for this further critique which has enabled us to more cleary refine our introduction. We have now edited our introduction to be more direct about our hypotheses, that these hypotheses are instantiated into formal models, and what our predictions were. We have also included a small section on how previous predictions from other computational assessments of BPD link to our exploratory work, and highlighted this throughout the manuscript.

      ‘This paper seeks to address this gap by testing explicitly how disruptions in self-other generalization processes may underpin interpersonal disruptions observed in BPD. Specifically, our hypotheses were: (i) healthy controls will demonstrate evidence for both self-insertion and social contagion, integrating self and other information during interpersonal learning; and (ii) individuals with BPD will exhibit diminished self-other integration, reflected in stronger evidence for observations that assume distinct self-other representations.

      We tested these hypotheses by designing a dynamic, sequential, three-phase Social Value Orientation (Murphy & Ackerman, 2014) paradigm—the Intentions Game—that would provide behavioural signatures assessing whether BPD differed from healthy controls in these generalization processes (Figure 1A). We coupled this paradigm with a lattice of models (M1-M4) that distinguish between self-insertion and social contagion (Figure 1B), and performed model comparison:

      M1. Both self-to-other (self-insertion) and other-to-self (social contagion) occur before and after learning M2. Self-to-other transfer only occurs M3. Other-to-self transfer only occurs M4. Neither transfer process, suggesting distinct self-other representations

      We additionally ran exploratory analysis of parameter differences and model predictions between groups following from prior work demonstrating changes in prosociality (Hula et al., 2018), social concern (Henco et al., 2020), belief stability (Story et al., 2024a), and belief updating (Story, 2024b) in BPD to understand whether discrepancies in self-other generalisation influences observational learning. By clearly articulating our hypotheses, we aim to clarify the theoretical contribution of our findings to existing literature on social learning, BPD, and computational psychiatry.’

      Caveats should also be added about the exploratory nature of the many parameter group comparisons. If there are any predictions about group differences that can be made based on prior literature, the authors should make such links clear.

      Thank you for this. We have now included caveats in the text to highlight the exploratory nature of these group comparisons, and added direct links to relevant literature where able:

      Introduction

      ‘We additionally ran exploratory analysis of parameter differences and model predictions between groups following from prior work demonstrating changes in prosociality (Hula et al., 2018), social concern (Henco et al., 2020), belief stability (Story et al., 2024a), and belief updating (Story, 2024b) in BPD to understand whether discrepancies in self-other generalisation influences observational learning. By clearly articulating our hypotheses, we aim to clarify the theoretical contribution of our findings to existing literature on social learning, BPD, and computational psychiatry.’

      Model Comparison

      ‘We found that CON participants were best fit at the group level by M1 (Frequency = 0.59, Exceedance Probability = 0.98), whereas BPD participants were best fit by M4 (Frequency = 0.54, Exceedance Probability = 0.86; Figure 2A). This suggests CON participants are best fit by a model that fully integrates self and other when learning, whereas those with BPD are best explained as holding disintegrated and separate representations of self and other that do not transfer information back and forth.

      We first explore parameters between separate fits (see Methods). Later, in order to assuage concerns about drawing inferences from different models, we examined the relationships between the relevant parameters when we forced all participants to be fit to each of the models (in a hierarchical manner, separated by group). In sum, our model comparison is supported by convergence in parameter values when comparisons are meaningful (see Supplementary Materials). We refer to both types of analysis below.’

      Phase 2 analysis

      ‘Prior work predicts those with BPD should focus more intently on public social information, rather than private information that only concerns one party (Henco et al., 2020). In BPD participants, only new beliefs about the relative reward preferences – mutual outcomes for both player - of partners differed (see Fig 2E): new median priors were larger than median preferences in phase 1 (mean = -0.47; = -6.10, 95%HDI: -7.60, -4.60).’

      ‘Models of moral preference learning (Story et al., 2024) predicts that BPD vs non-BPD participants have more rigid beliefs about their partners. We found that BPD participants were equally flexible around their prior beliefs about a partner’s relative reward preferences (= -1.60, 95%HDI: -3.42, 0.23), and were less flexible around their beliefs about a partner’s absolute reward preferences (=-4.09, 95%HDI: -5.37, -2.80), versus CON (Figure 2B).’

      Phase 3 analysis

      ‘Prior work predicts that human economic preferences are shaped by observation (Panizza, et al., 2021; Suzuki et al. 2016; Yu et al, 2021), although little-to-no work has examined whether contagion differs for relative vs. absolute preferences. Associative models predict that social contagion may be exaggerated in BPD (Ereira et al., 2018).… As a whole, humans are more susceptible to changing relative preferences more than selfish, absolute reward preferences, and this is disrupted in BPD.’

      Psychometric and Intentional Attribution analysis

      ‘Childhood trauma, persecution, and poor mentalising in BPD are all predicted to disrupt one’s ability to change (Fonagy & Luyten, 2009).’

      ‘Prior work has also predicted that partner-participant preference disparity influences mental state attributions (Barnby et al., 2022; Panizza et al., 2021).’

      I'm not sure I understand why the authors, after adding multiple comparison correction, now list two kinds of p-values. To me, this is misleading and precludes the point of multiple comparison corrections, I therefore recommend they report the FDR-adjusted p-values only. Likewise, if a corrected p-value is greater than 0.05 this should not be interpreted as a result.

      We have now adjusted the exploratory results to include only the FDR corrected values in the text.

      ‘We assessed conditional psychometric associations with social contagion under the assumption of M3 for all participants. We conducted partial correlation analyses to estimate relationships conditional on all other associations and retained all that survived bootstrapping (5000 reps), permutation testing (5000 reps), and subsequent FDR correction. When not controlled for group status, RGPTSB and CTQ scores were both moderately associated with MZQ scores (RGPTSB r = 0.41, 95%CI: 0.23, 0.60, p[fdr]=0.043; CTQ r = 0.354 95%CI: 0.13, 0.56, p[fdr]=0.02). This was not affected by group correction. CTQ scores were moderately and negatively associated with shifts in individualistic reward preferences (; r = -0.25, 95%CI: -0.46, -0.04, p[fdr]=0.03). This was not affected by group correction. MZQ scores were in turn moderately and negatively associated with shifts in prosocial-competitive preferences () between phase 1 and 3 (r = -0.26, 95%CI: -0.46, -0.06, p[fdr]=0.03). This was diminished when controlled for group status (r = 0.13, 95%CI: -0.34, 0.08, p[fdr]=0.20). Together this provides some evidence that self-reported trauma and self-reported mentalising influence social contagion (Fig S11). Social contagion under M3 was highly correlated with contagion under M1 demonstrating parsimony of outcomes across models (Fig S12).

      Prior work has predicted that partner-participant preference disparity influences mental state attributions (Barnby et al., 2022; Panizza et al., 2021). We tested parameter influences on explicit intentional attributions in Phase 2 while controlling for group status. Attributions included the degree to which they believed their partner was motived by harmful intent (HI) and self-interest (SI). According with prior work (Barnby et al., 2022), greater disparity of absolute preferences before learning was associated on a trend level with reduced attributions of SI (<= -0.23, p[fdr]=0.08), and greater disparity of relative preferences before learning exaggerated attributions of HI = 0.21, p[fdr]=0.08), but did not survive correction (Figure S4B). This is likely due to partners being significantly less individualistic and prosocial on average compared to participants (= -5.50, 95%HDI: -7.60, -3.60; = 12, 95%HDI: 9.70, 14.00); partners are recognised as less selfish and more competitive.’

      Can the authors please elaborate why the algorithm proposed to be employed by BPD is more 'entropic', especially given both their self-priors and posteriors about partners' preferences tended to be more precise than the ones used by CON? As far as I understand, there's nothing in the data to suggest BPD predictions should be more uncertain. In fact, this leads me to wonder, similarly to what another reviewer has already suggested, whether BPD participants generate self-referential priors over others in the same way CON participants do, they are just less favourable (i.e., in relation to oneself, but always less prosocial) - I think there is currently no model that would incorporate this possibility? It should at least be possible to explore this by checking if there is any statistical relationship between the estimated θ_ppt^m and 〖p(θ〗_par |D^0).

      Thank you for this opportunity to be clearer in our wording. We belief the reviewer is referring to this line in the discussion: ‘In either case, the algorithm underlying the computational goal for BPD participants is far higher in entropy and emphasises a less stable or reliable process of inference.’

      We note in the revised Figure 2 panel E and in the results that those with BPD under M4 show insertion along absolute reward (they still expect diminished selfishness in others), but neutral priors over relative reward (around 0, suggesting expectations of neither prosocial or competitive tendencies of others). Thus, θ_ppt^m (self preference) and θ_par^m (other preference) are tightly associated for absolute, but not relative reward.

      In our wording, we meant that whether under model M4 or M1, those with BPD either show a neutral prior over relative reward (M4) or a prior with large variance over relative reward (M1), showing expectations of difference between themselves and their partner. In both cases, expectation about a partner’s absolute reward preferences is diminished vs. CON participants. We have strengthened our language in the discussion to clarify this:

      ‘In either case, the algorithm underlying the computational goal for BPD participants is far higher in uncertainty, whether through a neutral central tendency (M4) or large variance (M1) prior over relative reward in phase 2, and emphasises a less certain and reliable expectation about others.’

      To note, social contagion under M3 was highly correlated with contagion under M1 (see Fig S11). This provides some preliminary evidence that trauma impacts beliefs about individualism directly, whereas trauma and persecutory beliefs impact beliefs about prosociality through impaired trait mentalising" - I don't understand what the authors mean by this, can they please elaborate and add some explanation to the main text?

      We have now clarified this in the text:

      ‘Together this provides some evidence that self-reported trauma and self-reported mentalising influence social contagion (Fig S11). Social contagion under M3 was highly correlated with contagion under M1 demonstrating parsimony of outcomes across models (Fig S12).’

      I noted that at least some of the newly added references have not been added to the bibliography (e.g., Hitchcock et al. 2022).

      Thankyou for noticing this omission. We have now ensured all cited works are in the reference list.

      Reviewer 2:

      The paper is not based on specific empirical hypotheses formulated at the outset, but, rather, it uses an exploratory approach. Indeed, the task is not chosen in order to tackle specific empirical hypotheses. This, in my view, is a limitation since the introduction reads a bit vague and it is not always clear which gaps in the literature the paper aims to fill. As a further consequence, it is not always clear how the findings speak to previous theories on the topic.’

      As I wrote in the public review, however, I believe that an important limitation of this work is that it was not based on testing specific empirical hypotheses formulated at the outset, and on selecting the experimental paradigm accordingly. This is a limitation because it is not always clear which gaps in the literature the paper aims to fill. As a consequence, although it has improved substantially compared to the previous version, the introduction remains a bit vague. As a further consequence, it is not always clear how the findings speak to previous theories on the topic. Still, despite this limitation, the paper has many strengths, and I believe it is now ready for publication

      Thank you for this further critique. We appreciate your appraisal that the work has improved substantially and is ready for publication. We nevertheless have opted to clarify our introduction and aprior predictions throughout the manuscript (please see response to Reviewer 1).

      Reviewer 3:

      Although the authors note that their approach makes "clear and transparent a priori predictions," the paper could be improved by providing a clear and consolidated statement of these predictions so that the results could be interpreted vis-a-vis any a priori hypotheses.

      In line with comments from both Reviewer 1 and 2, we have clarified our introduction to make it clear what our aprior predictions and hypotheses are about our core aims and exploratory analyses (see response to Reviewer 1).

      The approach of using a partial correlation network with bootstrapping (and permutation) was interesting, but the logic of the analysis was not clearly stated. In particular, there are large group (Table 1: CON vs. BPD) differences in the measures introduced into this network. As a result, it is hard to understand whether any partial correlations are driven primarily by mean differences in severity (correlations tend to be inflated in extreme groups designs due to the absence of observation in middle of scales forming each bivariate distribution). I would have found these exploratory analyses more revealing if group membership was controlled for.

      Thank you for this chance to be clearer in our methods. We have now written a more direct exposition of this exploratory method:

      ‘Exploratory Network Analysis

      To understand the individual differences of trait attributes (MZQ, RGPTSB, CTQ) with other-to-self information transfer () across the entire sample we performed a network analysis (Borsboom, 2021). Network analysis allows for conditional associations between variables to be estimated; each association is controlled for by all other associations in the network. It also allows for visual inspection of the conditional relationships to get an intuition for how variables are interrelated as a whole (see Fig S11). We implemented network analysis with the bootNet package in r using the ‘estimateNetwork’ function with partial correlations (Epskamp, Borsboom & Fried, 2018). To assess the stability of the partial correlations we further implemented bootstrap resampling with 5000 repetitions using the ‘bootnet’ function. We then additionally shuffled the data and refitted the network 5000 times to determine a p<sub>permuted</sub> value; this indicates the probability that a conditional relationship in the original network was within the null distribution of each conditional relationship. We then performed False Discovery Rate correction on the resulting p-values. We additionally controlled for group status for all variables in a supplementary analysis (Table S4).’

      We have also further corrected for group status and reported these results as a supplementary table, and also within the main text alongside the main results. We have opted to relegate Figure 4 into a supplementary figure to make the text clearer.

      ‘We explored conditional psychometric associations with social contagion under the assumption of M3 for all participants (where everyone is able to be influenced by their partner). We conducted partial correlation analyses to estimate relationships conditional on all other associations and retained all that survived bootstrapping (5000 reps), permutation testing (5000 reps), and subsequent FDR correction. When not controlled for group status, RGPTSB and CTQ scores were both moderately associated with MZQ scores (RGPTSB r = 0.41, 95%CI: 0.23, 0.60, p[fdr]=0.043; CTQ r = 0.354 95%CI: 0.13, 0.56, p[fdr]=0.02). This was not affected by group correction. CTQ scores were moderately and negatively associated with shifts in individualistic reward preferences (; r = -0.25, 95%CI: -0.46, -0.04, p[fdr]=0.03). This was not affected by group correction. MZQ scores were in turn moderately and negatively associated with shifts in prosocial-competitive preferences () between phase 1 and 3 (r = -0.26, 95%CI: -0.46, -0.06, p[fdr]=0.03). This was diminished when controlled for group status (r = 0.13, 95%CI: -0.34, 0.08, p[fdr]=0.20). Together this provides some evidence that self-reported trauma and self-reported mentalising influence social contagion (Fig S11). Social contagion under M3 was highly correlated with contagion under M1 demonstrating parsimony of outcomes across models (Fig S12).’

      Discussion first para: "effected -> affected"

      Thanks for spotting this. We have now changed it.

      Add "s" to "participant: "Notably, despite differing strategies, those with BPD achieved similar accuracy to CON participant."

      We have now changed this.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      Measurement of BOLD MR imaging has regularly found regions of the brain that show reliable suppression of BOLD responses during specific experimental testing conditions. These observations are to some degree unexplained, in comparison with more usual association between activation of the BOLD response and excitatory activation of the neurons (most tightly linked to synaptic activity) in the same brain location. This paper finds two patients whose brains were tested with both non-invasive functional MRI and with invasive insertion of electrodes, which allowed the direct recording of neuronal activity. The electrode insertions were made within the fusiform gyrus, which is known to process information about faces, in a clinical search for the sites of intractable epilepsy in each patient. The simple observation is that the electrode location in one patient showed activation of the BOLD response and activation of neuronal firing in response to face stimuli. This is the classical association. The other patient showed an informative and different pattern of responses. In this person, the electrode location showed a suppression of the BOLD response to face stimuli and, most interestingly, an associated suppression of neuronal activity at the electrode site.

      Strengths:

      Whilst these results are not by themselves definitive, they add an important piece of evidence to a long-standing discussion about the origins of the BOLD response. The observation of decreased neuronal activation associated with negative BOLD is interesting because, at various times, exactly the opposite association has been predicted. It has been previously argued that if synaptic mechanisms of neuronal inhibition are responsible for the suppression of neuronal firing, then it would be reasonable

      Weaknesses:

      The chief weakness of the paper is that the results may be unique in a slightly awkward way. The observation of positive BOLD and neuronal activation is made at one brain site in one patient, while the complementary observation of negative BOLD and neuronal suppression actually derives from the other patient. Showing both effects in both patients would make a much stronger paper.

      We thank reviewer #1 for their positive evaluation of our paper. Obviously, we agree with the reviewer that the paper would be much stronger if BOTH effects – spike increase and decrease – would be found in BOTH patients in their corresponding fMRI regions (lateral and medial fusiform gyrus) (also in the same hemisphere). Nevertheless, we clearly acknowledge this limitation in the (revised) version of the manuscript (p.8: Material and Methods section).

      Note that with respect to the fMRI data, our results are not surprising, as we indicate in the manuscript: BOLD increases to faces (relative to nonface objects) are typically found in the LatFG and BOLD decreases in the medialFG (in the revised version, we have added the reference to an early neuroimaging paper that describes this dissociation clearly:

      Pelphrey, K. A., Mack, P. B., Song, A., Güzeldere, G., & McCarthy, G. Faces evoke spatially differentiated patterns of BOLD activation and deactivation. Neuroreport 14, 955–959 (2003).

      This pattern of increase/decrease in fMRI can be appreciated in both patients on Figure 2, although one has to consider both the transverse and coronal slices to appreciate it.

      Regarding electrophysiological data, in the current paper, one could think that P1 shows only increases to faces, and P2 would show only decreases (irrespective of the region). However, that is not the case since 11% of P1’s face-selective units are decreases (89% are increases) and 4% of P2’s face-selective units are increases. This has now been made clearer in the revised manuscript (p.5).

      As the reviewer is certainly aware, the number and positions of the electrodes are based on strict clinical criteria, and we will probably never encounter a situation with two neighboring (macro-micro hybrid electrodes), one with microelectrodes ending up in the lateral MidFG, the other in the medial MidFG, in the same patient. If there is no clinical value for the patient, this cannot be done.

      The only thing we can do is to strengthen these results in the future by collecting data on additional patients with an electrode either in the lateral or the medial FG, together with fMRI. But these are the only two patients we have been able to record so far with electrodes falling unambiguously in such contrasted regions and with large (and comparable) measures.

      While we acknowledge that the results may be unique because of the use of 2 contrasted patients only (and this is why the paper is a short report), the data is compelling in these 2 cases, and we are confident that it will be replicated in larger cohorts in the future.

      Finally, information regarding ethics approval has been provided in the paper.

      Reviewer #2 (Public review):

      Summary:

      This is a short and straightforward paper describing BOLD fMRI and depth electrode measurements from two regions of the fusiform gyrus that show either higher or lower BOLD responses to faces vs. objects (which I will call face-positive and facenegative regions). In these regions, which were studied separately in two patients undergoing epilepsy surgery, spiking activity increased for faces relative to objects in the face-positive region and decreased for faces relative to objects in the face-negative region. Interestingly, about 30% of neurons in the face-negative region did not respond to objects and decreased their responses below baseline in response to faces (absolute suppression).

      Strengths:

      These patient data are valuable, with many recording sessions and neurons from human face-selective regions, and the methods used for comparing face and object responses in both fMRI and electrode recordings were robust and well-established. The finding of absolute suppression could clarify the nature of face selectivity in human fusiform gyrus since previous fMRI studies of the face-negative region could not distinguish whether face < object responses came from absolute suppression, or just relatively lower but still positive responses to faces vs. objects.

      Weaknesses:

      The authors claim that the results tell us about both 1) face-selectivity in the fusiform gyrus, and 2) the physiological basis of the BOLD signal. However, I would like to see more of the data that supports the first claim, and I am not sure the second claim is supported.

      (1) The authors report that ~30% of neurons showed absolute suppression, but those data are not shown separately from the neurons that only show relative reductions. It is difficult to evaluate the absolute suppression claim from the short assertion in the text alone (lines 105-106), although this is a critical claim in the paper.

      We thank reviewer #2 for their positive evaluation of our paper. We understand the reviewer’s point, and we partly agree. Where we respectfully disagree is that the finding of absolute suppression is critical for the claim of the paper: finding an identical contrast between the two regions in terms of RELATIVE increase/decrease of face-selective activity in fMRI and spiking activity is already novel and informative. Where we agree with the reviewer is that the absolute suppression could be more documented: it wasn’t, due to space constraints (brief report). We provide below an example of a neuron showing absolute suppression to faces (P2), as also requested in the recommendations to authors. In the frequency domain, there is only a face-selective response (1.2 Hz and harmonics) but no significant response at 6 Hz (common general visual response). In the time-domain, relative to face onset, the response drops below baseline level. It means that this neuron has baseline (non-periodic) spontaneous spiking activity that is actively suppressed when a face appears.

      Author response image 1.

      (2) I am not sure how much light the results shed on the physiological basis of the BOLD signal. The authors write that the results reveal "that BOLD decreases can be due to relative, but also absolute, spike suppression in the human brain" (line 120). But I think to make this claim, you would need a region that exclusively had neurons showing absolute suppression, not a region with a mix of neurons, some showing absolute suppression and some showing relative suppression, as here. The responses of both groups of neurons contribute to the measured BOLD signal, so it seems impossible to tell from these data how absolute suppression per se drives the BOLD response.

      It is a fact that we find both kinds of responses in the same region. We cannot tell with this technique if neurons showing relative vs. absolute suppression of responses are spatially segregated for instance (e.g., forming two separate sub-regions) or are intermingled. And we cannot tell from our data how absolute suppression per se drives the BOLD response. In our view, this does not diminish the interest and originality of the study, but the statement "that BOLD decreases can be due to relative, but also absolute, spike suppression in the human brain” has been rephrased in the revised manuscript: "that BOLD decreases can be due to relative, or absolute (or a combination of both), spike suppression in the human brain”.

      Reviewer #3 (Public review):

      In this paper the authors conduct two experiments an fMRI experiment and intracranial recordings of neurons in two patients P1 and P2. In both experiments, they employ a SSVEP paradigm in which they show images at a fast rate (e.g. 6Hz) and then they show face images at a slower rate (e.g. 1.2Hz), where the rest of the images are a variety of object images. In the first patient, they record from neurons over a region in the mid fusiform gyrus that is face-selective and in the second patient, they record neurons from a region more medially that is not face selective (it responds more strongly to objects than faces). Results find similar selectivity between the electrophysiology data and the fMRI data in that the location which shows higher fMRI to faces also finds face-selective neurons and the location which finds preference to non faces also shows non face preferring neurons.

      Strengths:

      The data is important in that it shows that there is a relationship between category selectivity measured from electrophysiology data and category-selective from fMRI. The data is unique as it contains a lot of single and multiunit recordings (245 units) from the human fusiform gyrus - which the authors point out - is a humanoid specific gyrus.

      Weaknesses:

      My major concerns are two-fold:

      (i) There is a paucity of data; Thus, more information (results and methods) is warranted; and in particular there is no comparison between the fMRI data and the SEEG data.

      We thank reviewer #3 for their positive evaluation of our paper. If the reviewer means paucity of data presentation, we agree and we provide more presentation below, although the methods and results information appear as complete to us. The comparison between fMRI and SEEG is there, but can only be indirect (i.e., collected at different times and not related on a trial-by-trial basis for instance). In addition, our manuscript aims at providing a short empirical contribution to further our understanding of the relationship between neural responses and BOLD signal, not to provide a model of neurovascular coupling.

      (ii) One main claim of the paper is that there is evidence for suppressed responses to faces in the non-face selective region. That is, the reduction in activation to faces in the non-face selective region is interpreted as a suppression in the neural response and consequently the reduction in fMRI signal is interpreted as suppression. However, the SSVEP paradigm has no baseline (it alternates between faces and objects) and therefore it cannot distinguish between lower firing rate to faces vs suppression of response to faces.

      We understand the concern of the reviewer, but we respectfully disagree that our paradigm cannot distinguish between lower firing rate to faces vs. suppression of response to faces. Indeed, since the stimuli are presented periodically (6 Hz), we can objectively distinguish stimulus-related activity from spontaneous neuronal firing. The baseline corresponds to spikes that are non-periodic, i.e., unrelated to the (common face and object) stimulation. For a subset of neurons, even this non-periodic baseline activity is suppressed, above and beyond the suppression of the 6 Hz response illustrated on Figure 2. We mention it in the manuscript, but we agree that we do not present illustrations of such decrease in the time-domain for SU, which we did not consider as being necessary initially (please see below for such presentation).

      (1) Additional data: the paper has 2 figures: figure 1 which shows the experimental design and figure 2 which presents data, the latter shows one example neuron raster plot from each patient and group average neural data from each patient. In this reader's opinion this is insufficient data to support the conclusions of the paper. The paper will be more impactful if the researchers would report the data more comprehensively.

      We answer to more specific requests for additional evidence below, but the reviewer should be aware that this is a short report, which reaches the word limit. In our view, the group average neural data should be sufficient to support the conclusions, and the example neurons are there for illustration. And while we cannot provide the raster plots for a large number of neurons, the anonymized data is made available at:

      (a) There is no direct comparison between the fMRI data and the SEEG data, except for a comparison of the location of the electrodes relative to the statistical parametric map generated from a contrast (Fig 2a,d). It will be helpful to build a model linking between the neural responses to the voxel response in the same location - i.e., estimate from the electrophysiology data the fMRI data (e.g., Logothetis & Wandell, 2004).

      As mentioned above the comparison between fMRI and SEEG is indirect (i.e., collected at different times and not related on a trial-by-trial basis for instance) and would not allow to make such a model.

      (b) More comprehensive analyses of the SSVEP neural data: It will be helpful to show the results of the frequency analyses of the SSVEP data for all neurons to show that there are significant visual responses and significant face responses. It will be also useful to compare and quantify the magnitude of the face responses compared to the visual responses.

      The data has been analyzed comprehensively, but we would not be able to show all neurons with such significant visual responses and face-selective responses.

      (c) The neuron shown in E shows cyclical responses tied to the onset of the stimuli, is this the visual response?

      Correct, it’s the visual response at 6 Hz.

      If so, why is there an increase in the firing rate of the neuron before the face stimulus is shown in time 0?

      Because the stimulation is continuous. What is displayed at 0 is the onset of the face stimulus, with each face stimulus being preceded by 4 images of nonface objects.

      The neuron's data seems different than the average response across neurons; This raises a concern about interpreting the average response across neurons in panel F which seems different than the single neuron responses

      The reviewer is correct, and we apologize for the confusion. This is because the average data on panel F has been notch-filtered for the 6 Hz (and harmonic responses), as indicated in the methods (p.11): ‘a FFT notch filter (filter width = 0.05 Hz) was then applied on the 70 s single or multi-units time-series to remove the general visual response at 6 Hz and two additional harmonics (i.e., 12 and 18 Hz)’.

      Here is the same data without the notch-filter (the 6Hz periodic response is clearly visible):

      Author response image 2.

      For sake of clarity, we prefer presenting the notch-filtered data in the paper, but the revised version makes it clear in the figure caption that the average data has been notch-filtered.

      (d) Related to (c) it would be useful to show raster plots of all neurons and quantify if the neural responses within a region are homogeneous or heterogeneous. This would add data relating the single neuron response to the population responses measured from fMRI. See also Nir 2009.

      We agree with the reviewer that this is interesting, but again we do not think that it is necessary for the point made in the present paper. Responses in these regions appear rather heterogenous, and we are currently working on a longer paper with additional SEEG data (other patients tested for shorter sessions) to define and quantify the face-selective neurons in the MidFusiform gyrus with this approach (without relating it to the fMRI contrast as reported here).

      (e) When reporting group average data (e.g., Fig 2C,F) it is necessary to show standard deviation of the response across neurons.

      We agree with the reviewer and have modified Figure 2 accordingly in the revised manuscript.

      (f) Is it possible to estimate the latency of the neural responses to face and object images from the phase data? If so, this will add important information on the timing of neural responses in the human fusiform gyrus to face and object images.

      The fast periodic paradigm to measure neural face-selectivity has been used in tens of studies since its original reports:

      In this paradigm, the face-selective response spreads to several harmonics (1.2 Hz, 2.4 Hz, 3.6 Hz, etc.) (which are summed for quantifying the total face-selective amplitude). This is illustrated below by the averaged single units’ SNR spectra across all recording sessions for both participants.

      Author response image 3.

      There is no unique phase-value, each harmonic being associated with a phase-value, so that the timing cannot be unambiguously extracted from phase values. Instead, the onset latency is computed directly from the time-domain responses, which is more straightforward and reliable than using the phase. Note that the present paper is not about the specific time-courses of the different types of neurons, which would require a more comprehensive report, but which is not necessary to support the point made in the present paper about the SEEG-fMRI sign relationship.

      (g) Related to (e) In total the authors recorded data from 245 units (some single units and some multiunits) and they found that both in the face and nonface selective most of the recoded neurons exhibited face -selectivity, which this reader found confusing: They write “ Among all visually responsive neurons, we found a very high proportion of face-selective neurons (p < 0.05) in both activated and deactivated MidFG regions (P1: 98.1%; N = 51/52; P2: 86.6%; N = 110/127)’. Is the face selectivity in P1 an increase in response to faces and P2 a reduction in response to faces or in both it’s an increase in response to faces

      Face-selectivity is defined as a DIFFERENTIAL response to faces compared to objects, not necessarily a larger response to faces. So yes, face-selectivity in P1 is an increase in response to faces and P2 a reduction in response to faces.

      Additional methods

      (a) it is unclear if the SSVEP analyses of neural responses were done on the spikes or the raw electrical signal. If the former, how is the SSVEP frequency analysis done on discrete data like action potentials?

      The FFT is applied directly on spike trains using Matlab’s discrete Fourier Transform function. This function is suitable to be applied to spike trains in the same way as to any sampled digital signal (here, the microwires signal was sampled at 30 kHz, see Methods).

      In complementary analyses, we also attempted to apply the FFT on spike trains that had been temporally smoothed by convolving them with a 20ms square window (Le Cam et al., 2023, cited in the paper ). This did not change the outcome of the frequency analyses in the frequency range we are interested in. We have also added one sentence with information in the methods section about spike detection (p.10).

      (b) it is unclear why the onset time was shifted by 33ms; one can measure the phase of the response relative to the cycle onset and use that to estimate the delay between the onset of a stimulus and the onset of the response. Adding phase information will be useful.

      The onset time was shifted by 33ms because the stimuli are presented with a sinewave contrast modulation (i.e., at 0ms, the stimulus has 0% contrast). 100% contrast is reached at half a stimulation cycle, which is 83.33ms here, but a response is likely triggered before reaching 100% contrast. To estimate the delay between the start of the sinewave (0% contrast) and the triggering of a neural response, we tested 7 SEEG participants with the same images presented in FPVS sequences either as a sinewave contrast (black line) modulation or as a squarewave (i.e. abrupt) contrast modulation (red line). The 33ms value is based on these LFP data obtained in response to such sinewave stimulation and squarewave stimulation of the same paradigm. This delay corresponds to 4 screen refresh frames (120 Hz refresh rate = 8.33ms by frame) and 35% of the full contrast, as illustrated below (please see also Retter, T. L., & Rossion, B. (2016). Uncovering the neural magnitude and spatio-temporal dynamics of natural image categorization in a fast visual stream. Neuropsychologia, 91, 9–28).

      Author response image 4.

      (2) Interpretation of suppression:

      The SSVEP paradigm alternates between 2 conditions: faces and objects and has no baseline; In other words, responses to faces are measured relative to the baseline response to objects so that any region that contains neurons that have a lower firing rate to faces than objects is bound to show a lower response in the SSVEP signal. Therefore, because the experiment does not have a true baseline (e.g. blank screen, with no visual stimulation) this experimental design cannot distinguish between lower firing rate to faces vs suppression of response to faces.

      The strongest evidence put forward for suppression is the response of non-visual neurons that was also reduced when patients looked at faces, but since these are non-visual neurons, it is unclear how to interpret the responses to faces.

      We understand this point, but how does the reviewer know that these are non-visual neurons? Because these neurons are located in the visual cortex, they are likely to be visual neurons that are not responsive to non-face objects. In any case, as the reviewer writes, we think it’s strong evidence for suppression.

      We thank all three reviewers for their positive evaluation of our paper and their constructive comments.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      Summary:

      Zhang et al. addressed the question of whether advantageous and disadvantageous inequality aversion can be vicariously learned and generalized. Using an adapted version of the ultimatum game (UG), in three phases, participants first gave their own preference (baseline phase), then interacted with a "teacher" to learn their preference (learning phase), and finally were tested again on their own (transfer phase). The key measure is whether participants exhibited similar choice preferences (i.e., rejection rate and fairness rating) influenced by the learning phase, by contrasting their transfer phase and baseline phase. Through a series of statistical modeling and computational modeling, the authors reported that both advantageous and disadvantageous inequality aversion can indeed be learned (Study 1), and even be generalised (Study 2).

      Strengths:

      This study is very interesting, it directly adapted the lab's previous work on the observational learning effect on disadvantageous inequality aversion, to test both advantageous and disadvantageous inequality aversion in the current study. Social transmission of action, emotion, and attitude have started to be looked at recently, hence this research is timely. The use of computational modeling is mostly appropriate and motivated. Study 2, which examined the vicarious inequality aversion in conditions where feedback was never provided, is interesting and important to strengthen the reported effects. Both studies have proper justifications to determine the sample size.

      Weaknesses:

      Despite the strengths, a few conceptual aspects and analytical decisions have to be explained, justified, or clarified.

      INTRODUCTION/CONCEPTUALIZATION

      (1) Two terms seem to be interchangeable, which should not, in this work: vicarious/observational learning vs preference learning. For vicarious learning, individuals observe others' actions (and optionally also the corresponding consequence resulting directly from their own actions), whereas, for preference learning, individuals predict, or act on behalf of, the others' actions, and then receive feedback if that prediction is correct or not. For the current work, it seems that the experiment is more about preference learning and prediction, and less so about vicarious learning. The intro and set are heavily around vicarious learning, and later the use of vicarious learning and preference learning is rather mixed in the text. I think either tone down the focus on vicarious learning, or discuss how they are different. Some of the references here may be helpful: (Charpentier et al., Neuron, 2020; Olsson et al., Nature Reviews Neuroscience, 2020; Zhang & Glascher, Science Advances, 2020)

      We are appreciative of the Reviewer for raising this question and providing the reference. In response to this comment we have elected to avoid, in most cases, use of the term ‘vicarious’ and instead focus the paper on learning of others’ preferences (without specific commitment to various/observational learning per se). These changes are reflected throughout all sections of the revised manuscript, and in the revised title. We believe this simplified terminology has improved the clarity of our contribution.

      EXPERIMENTAL DESIGN

      (2) For each offer type, the experiment "added a uniformly distributed noise in the range of (-10 ,10)". I wonder what this looks like? With only integers such as 25:75, or even with decimal points? More importantly, is it possible to have either 70:30 or 90:10 option, after adding the noise, to have generated an 80:20 split shown to the participants? If so, for the analyses later, when participants saw the 80:20 split, which condition did this trial belong to? 70:30 or 90:10? And is such noise added only to the learning phase, or also to the baseline/transfer phases? This requires some clarification.

      We thank the Reviewer for pointing this out. The uniformly distributed noise was added to all three phases to make the proposers’ behavior more realistic. This added noise was rounded to integer numbers, constrained from -9 to 9, which means in both 70:30 and 90:10 offer types, an 80:20 split could not occur. We have made this feature of our design clear in the Method section Line 524 ~ 528:

      “In all task phases, we added uniformly distributed noise to each trial’s offer (ranging from -9 to 9, inclusive, rounding to the nearest integer) such that the random amount added (or subtracted) from the Proposer’s share was subtracted (or added) to the Receiver’s share. We adopted this manipulation to make the proposers’ behavior appear more realistic. The orders of offers participants experienced were fully randomized within each experiment phase. ”

      (3) For the offer conditions (90:10, 70:30, 50:50, 30:70, 10:90) - are they randomized? If so, how is it done? Is it randomized within each participant, and/or also across participants (such that each participant experienced different trial sequences)? This is important, as the order especially for the learning phase can largely impact the preference learning of the participants.

      We agree with the Reviewer the order in which offers are experienced could be very important. The order of the conditions was randomized independently for each participant (i.e. each participant experienced different trial sequences). We made this point clear in the Methods part. Line 527 ~ 528:

      “The orders of offers participants experienced were fully randomized within each experiment phase.”

      STATISTICAL ANALYSIS & COMPUTATIONAL MODELING

      (4) In Study 1 DI offer types (90:10, 70:30), the rejection rate for DI-AI averse looks consistently higher than that for DI averse (ie, the blue line is above the yellow line). Is this significant? If so, how come? Since this is a between-subject design, I would not anticipate such a result (especially for the baseline). Also, for the LME results (eg, Table S3), only interactions were reported but not the main results.

      We thank the Reviewer for pointing out this feature of the results. Prompted by this comment, we compared the baseline rejection rates between two conditions for these two offer types, finding in Experiment 1 that rejection rates in the DI-AI-averse condition were significantly higher than in the DI-averse condition (DI-AI-averse vs. DI-averse; Offer 90:10, β = 0.13, p < 0.001, Offer 70:30, β = 0.09, p < 0.034). We agree with the Reviewer that there should, in principle, be no difference between the experiences of participants in these two conditions is identical in the Baseline phase. However, we did not observe these difference in baseline preferences in Experiment 2 (DI-AI-averse vs. DI-averse; Offer 90:10, β = 0.07, p < 0.100, Offer 70:30, β = 0.05, p < 0.193). On the basis of the inconsistency of this effect across studies we believe this is a spurious difference in preferences stemming from chance.

      Regarding the LME results, the reason why only interaction terms are reported is due to the specification of the model and the rationale for testing.

      Taking the model reported in Table S3 as an example—a logistic model which examines Baseline phase rejection rates as a function of offer level and condition—the between-subject conditions (DI-averse and DI-AI-averse) are represented by dummy-coded variables. Similarly, offer types were also dummy-coded, such that each of the five columns (90:10, 70:30, 50:50, 30:70, and 10:90) correspond corresponded to a particular offer type. This model specification yields ten interaction terms (i.e., fixed effects) of interest—for example, the “DI-averse × Offer 90:10” indicates baseline rejection rates for 90:10 offers in DI-averse condition. Thus, to compare rejection rates across specific offer types, we estimate and report linear contrasts between these resultant terms. We have clarified the nature of these reported tests in our revised Results—for example, line189-190: “linear contrasts; e.g. 90:10 vs 10:90, all Ps<0.001, see Table S3 for logistic regression coefficients for rejection rates).

      Also in response to this comment that and a recommendation from Reviewer 2 (see below), we have revised our supplementary materials to make each model specification clearer as SI line 25:

      RejectionRate ~ 0 + (Disl + Advl):(Offer10 + Offer30 + Offer50 + Offer70 + Offer90) + (1|Subject)”

      (5) I do not particularly find this analysis appealing: "we examined whether participants' changes in rejection rates between Transfer and Baseline, could be explained by the degree to which they vicariously learned, defined as the change in punishment rates between the first and last 5 trials of the Learning phase." Naturally, the participants' behavior in the first 5 trials in the learning phase will be similar to those in the baseline; and their behavior in the last 5 trials in the learning phase would echo those at the transfer phase. I think it would be stronger to link the preference learning results to the change between the baseline and transfer phase, eg, by looking at the difference between alpha (beta) at the end of the learning phase and the initial alpha (beta).

      Thanks for pointing this out. Also, considering the comments from Reviewer 2 concerning the interpretation of this analysis, we have elected to remove this result from our revision.

      (6) I wonder if data from the baseline and transfer phases can also be modeled, using a simple Fehr-Schimdt model. This way, the change in alpha/beta can also be examined between the baseline and transfer phase.

      We agree with the Reviewer that a simplified F-S model could be used, in principle, to characterize Baseline and Transfer phase behavior, but it is our view that the rejection rates provide readers with the clearest (and simplest) picture of how participants are responding to inequity. Put another way, we believe that the added complexity of using (and explaining) a new model to characterize simple, steady-state choice behavior (within these phases) would not be justified or add appreciable insights about participants’ behavior.

      (7) I quite liked Study 2 which tests the generalization effect, and I expected to see an adapted computational modeling to directly reflect this idea. Indeed, the authors wrote, "[...] given that this model [...] assumes the sort of generalization of preferences between offer types [...]". But where exactly did the preference learning model assume the generalization? In the methods, the modeling seems to be only about Study 1; did the authors advise their model to accommodate Study 2? The authors also ran simulation for the learning phase in Study 2 (Figure 6), and how did the preference update (if at all) for offers (90:10 and 10:90) where feedback was not given? Extending/Unpacking the computational modeling results for Study 2 will be very helpful for the paper.

      We are appreciative of the Reviewer’s positive impression of Experiment 2. Upon reflection, we realize that our original submission was not clear about the modeling done in Experiment 2, and we should clarify here that we did also fit the Preference Inference model to this dataset. As in Experiment 1, this model assumes that the participants have a representation of the teacher’s preference as a Fehr-Schmidt form utility function and infer the Teacher’s Envy and Guilt parameters through learning. The model indicates that, on the basis of experience with the Teacher’s preferences on moderately unfair offers (i.e., offer 70:30 and offer 30:70), participants can successfully infer these guess of these two parameters, and in turn, compute Fehr-Schmidt utility to guide their decisions in the extreme unfair offers (i.e., offer 90:10 and offer 10:90).

      In response to this comment, we have made this clearer in our Results (Line 377-382):

      “Finally, following Experiment 1, we fit a series of computational models of Learning phase choice behavior, comparing the goodness-of-fit of the four best-fitting models from Experiment 1 (see Methods). As before, we found that the Preference Inference model provided the best fit of participants’ Learning Phase behavior (Figure S1a, Table S12). Given that this model is able to infer the Teacher’s underlying inequity-averse preferences (rather than learns offer-specific rejection preferences), it is unsurprising that this model best describes the generalization behavior observed in Experiment 2.”

      and in our revised Methods (Line 551-553)

      “We considered 6 computational models of Learning Phase choice behavior, which we fit to individual participants’ observed sequences of choices, in both Experiments 1 and 2, via Maximum Likelihood Estimation”

      Reviewer #2 (Public review):

      Summary:

      This study investigates whether individuals can learn to adopt egalitarian norms that incur a personal monetary cost, such as rejecting offers that benefit them more than the giver (advantageous inequitable offers). While these behaviors are uncommon, two experiments demonstrate that individuals can learn to reject such offers through vicarious learning - by observing and acting in line with a "teacher" who follows these norms. The authors use computational modelling to argue that learners adopt these norms through a sophisticated process, inferring the latent structure of the teacher's preferences, akin to theory of mind.

      Strengths:

      This paper is well-written and tackles a critical topic relevant to social norms, morality, and justice. The findings, which show that individuals can adopt just and fair norms even at a personal cost, are promising. The study is well-situated in the literature, with clever experimental design and a computational approach that may offer insights into latent cognitive processes. Findings have potential implications for policymakers.

      Weaknesses:

      Note: in the text below, the "teacher" will refer to the agent from which a participant presumably receives feedback during the learning phase.

      (1) Focus on Disadvantageous Inequity (DI): A significant portion of the paper focuses on responses to Disadvantageous Inequitable (DI) offers, which is confusing given the study's primary aim is to examine learning in response to Advantageous Inequitable (AI) offers. The inclusion of DI offers is not well-justified and distracts from the main focus. Furthermore, the experimental design seems, in principle, inadequate to test for the learning effects of DI offers. Because both teaching regimes considered were identical for DI offers the paradigm lacks a control condition to test for learning effects related to these offers. I can't see how an increase in rejection of DI offers (e.g., between baseline and generalization) can be interpreted as speaking to learning. There are various other potential reasons for an increase in rejection of DI offers even if individuals learn nothing from learning (e.g. if envy builds up during the experiment as one encounters more instances of disadvantageous fairness).

      We are appreciative of the Reviewer’s insight here and for the opportunity to clarify our experimental logic. We included DI offers in order to 1) expose participants to the full spectrum of offer types, and avoid focusing participants exclusively upon AI offers, which might result in a demand characteristic and 2) to afford exploration of how learning dynamics might differ in DI context s—which was, to some extent, examined in our previous study (FeldmanHall, Otto, & Phelps, 2018)—versus AI contexts. Furthermore, as this work builds critically on our previous study, we reasoned that replicating these original findings (in the DI context) would be important for demonstrating the generality of the learning effects in the DI context across experimental settings. We now remark on this point in our revised Introduction Line 129 ~132:

      “In addition, to mechanistically probe how punitive preferences are acquired in Adv-I and Dis-I contexts—in turn, assessing the replicability of our earlier study investigating punitive preference acquisition in the Dis context—we also characterize trial-by-trial acquisition of punitive behavior with computational models of choice.”

      (2) Statistical Analysis: The analysis of the learning effects of AI offers is not fully convincing. The authors analyse changes in rejection rates within each learning condition rather than directly comparing the two. Finding a significant effect in one condition but not the other does not demonstrate that the learning regime is driving the effect. A direct comparison between conditions is necessary for establishing that there is a causal role for the learning regime.

      We agree with the Reviewer and upon reflection, believe that direct comparisons between conditions would be helpful to support the claim that the different learning conditions are responsible for the observed learning effects. In brief, these specific tests buttress the idea that exposure to AI-averse preferences result in increases in AI punishment rates in the Transfer phase (over and above the rates observed for participants who were only exposed to DI-averse preferences).

      Accordingly, our revision now reports statistics concerning the differences between conditions for AI offers in Experiment 1 (Line 198~ 207):

      “Importantly, when comparing these changes between the two learning conditions, we observed significant differences in rejection rates for Adv-I offers: compared to exposure to a Teacher who rejected only Dis-I offers, participants exposed to a Teacher who rejected both Dis-I and Adv-I offers were more likely to reject Adv-I offers and rated these offers more unfair. This difference between conditions was evident in both 30:70 offers (Rejection rates: β(SE) = 0.10(0.04), p = 0.013; Fairness ratings: β(SE) = -0.86(0.17), p < 0.001) and 10:90 offers (Rejection rates: β(SE) = 0.15(0.04), p < 0.001, Fairness ratings: β(SE) = -1.04(0.17), p < 0.001). As a control, we also compared rejection rates and fairness rating changes between conditions in Dis-I offers (90:10 and 30:70) and Fair offers (i.e., 50:50) but observed no significant difference (all ps > 0.217), suggesting that observing an Adv-I-averse Teacher’s preferences did not influence participants’ behavior in response to Dis-I offers.”

      Line 222 ~ 230:

      “A mixed-effects logistic regression revealed a significant larger (positive) effect of trial number on rejection rates of Adv-I offers for the Adv-Dis-I-Averse condition compared to the Dis-I-Averse condition. This relative rejection rate increase was evident both in 30:70 offers (Table S7; β(SE) = -0.77(0.24), p < 0.001) and in 10:90 offers (β(SE) = -1.10(0.33), p < 0.001). In contrast, comparing Dis-I and Fairness offers when the Teacher showed the same tendency to reject, we found no significant difference between the two conditions (90:10 splits: β(SE)=-0.48(0.21),p=0.593;70:30 splits: β(SE)=-0.01(0.14),p=0.150; 50:50 splits: β(SE)=-0.00(0.21),p=0.086). In other words, participants by and large appeared to adjust their rejection choices in accordance with the Teacher’s feedback in an incremental fashion.”

      And in Experiment 2 Line 333 ~ 345:

      “Similar to what we observed in Experiment 1 (Figure 4a), Compared to the participants in the Dis-I-Averse Condition, participants in the Adv-I-Averse Condition increased their rates of rejection of extreme Adv-I offerers (i.e., 10:90) in the Transfer Phase, relative to the Baseline phase (β(SE) = -0.12(0.04), p < 0.004; Table S9), suggesting that participants’ learned (and adopted) Adv-I-averse preferences, generalized from one specific offer type (30:70) to an offer types for which they received no Teacher feedback (10:90). Examining extreme Dis-I offers where the Teacher exhibited identical preferences across the two learning conditions, we found no difference in the Changes of Rejection Rates from Baseline to Transfer phase between conditions (β(SE) = -0.05(0.04), p < 0.259). Mirroring the observed rejection rates (Figure 4b), relative to the Dis-I-Averse Condition, participants’ fairness ratings for extreme Adv-I offers increased more from the Baseline to Transfer phase in the Adv-Dis-I-Averse Condition than in the Dis-I-Averse condition (β(SE) = -0.97(0.18), p < 0.001), but, importantly, changes in fairness ratings for extreme Dis-I offers did not differ significantly between learning conditions (β(SE) = -0.06(0.18), p < 0.723)”

      Line 361 ~ 368:

      “Examining the time course of rejection rates in Adv-I-contexts during the Learning phase (Figure 5) revealed that participants learned over time to punish mildly unfair 30:70 offers, and these punishment preferences generalized to more extreme offers (10:90). Specifically, compared to the Dis-I-Averse Condition, in the Adv-Dis-I-Averse condition we observed a significant larger trend of increase in rejections rates for 10:90 (Adv-I) offers (Figure 5, β(SE) = -0.81(0.26), p < 0.002 mixed-effects logistic regression, see Table S10). Again, when comparing the rejection rate increase in the extremely Dis-I offers (90:10), we didn’t find significant difference between conditions (β(SE) = -0.25(0.19), p < 0.707).”

      (3) Correlation Between Learning and Contagion Effects:

      The authors argue that correlations between learning effects (changes in rejection rates during the learning phase) and contagion effects (changes between the generalization and baseline phases) support the idea that individuals who are better aligning their preferences with the teacher also give more consideration to the teacher's preferences later during generalization phase. This interpretation is not convincing. Such correlations could emerge even in the absence of learning, driven by temporal trends like increasing guilt or envy (or even by slow temporal fluctuations in these processes) on behalf of self or others. The reason is that the baseline phase is temporally closer to the beginning of the learning phase whereas the generalization phase is temporally closer to the end of the learning phase. Additionally, the interpretation of these effects seems flawed, as changes in rejection rates do not necessarily indicate closer alignment with the teacher's preferences. For example, if the teacher rejects an offer 75% of the time then a positive 5% learning effect may imply better matching the teacher if it reflects an increase in rejection rate from 65% to 70%, but it implies divergence from the teacher if it reflects an increase from 85% to 90%. For similar reasons, it is not clear that the contagion effects reflect how much a teacher's preferences are taken into account during generalization.

      This comment is very similar to a previous comment made by Reviewer 1, who also called into question the interpretability of these correlations. In response to both of these comments we have elected to remove these analyses from our revision.

      (4) Modeling Efforts: The modelling approach is underdeveloped. The identification of the "best model" lacks transparency, as no model-recovery results are provided, and fits for the losing models are not shown, leaving readers in the dark about where these models fail. Moreover, the reinforcement learning (RL) models used are overly simplistic, treating actions as independent when they are likely inversely related (for example, the feedback that the teacher would have rejected an offer provides feedback that rejection is "correct" but also that acceptance is "an error", and the later is not incorporated into the modelling). It is unclear if and to what extent this limits current RL formulations. There are also potentially important missing details about the models. Can the authors justify/explain the reasoning behind including these variants they consider? What are the initial Q-values? If these are not free parameters what are their values?

      We are appreciative of the Reviewer for identifying these potentially unaddressed questions.

      The RL models we consider in the present study are naïve models which, in our previous study (FeldmanHall, Otto, & Phelps, 2018), we found to capture important aspects of learning. While simplistic, we believed these models serve as a reasonable baseline for evaluating more complex models, such as the Preference Inference model. We have made this point more explicit in our revised Introduction, Line 129 ~ 132:

      “In addition, to mechanistically probe how punitive preferences may be acquired in Adv-I and Dis-I contexts—in turn, assessing the replicability of our earlier study investigating punitive preference acquisition in the Dis-I context—we also characterize trial-by-trial acquisition of punitive behavior with computational models of choice.”

      Again, following from our previous modeling of observational learning (FeldmanHall et al., 2018), we believe that the feedback the Teacher provides here is ideally suited to the RL formalism. In particular, when the teacher indicates that the participant’s choice is what they would have preferred, the model receives a reward of ‘1’ (e.g., the participant rejects and the Teacher indicates they would preferred rejection, resulting in a positive prediction error) otherwise, the model receives a reward of ‘0’ (e.g., the participant accepts and the Teacher indicates they would preferred rejection, resulting in a negative prediction error), indicating that the participant did not choose in accordance with the Teacher’s preferences. Through an error driven learning process, these models provide a naïve way of learning to act in accordance with the Teacher’s preferences.

      Regarding the requested model details: When treating the initial values as free parameters (model 5), we set Q(reject, offertype) as free values in [0,1] and Q(accept,offertype) as 0.5. This setting can capture participants' initial tendency to reject or accept offers from this offer type. When the initial values are fixed, for all offer types we set Q(reject, offertype) = Q(accept,offertype) = 0.5. In practice, when the initial values are fixed, setting them to 0.5 or 0 doesn’t make much difference. We have clarified these points in our revised Methods, Line 275 ~ 576:

      “We kept the initial values fixed in this model, that is Q<sub>0</sub>(reject,offertype) =0.5, (offertype ∈ 90:10, 70:30, 50:50, 30:70, 10:90)”

      And Line 582 ~ 584:

      “Formally, this model treats Q<sub>0</sub>(reject,offertype) =0.5, (offertype ∈ 90:10, 70:30, 50:50, 30:70, 10:90) as free parameters with values between 0 and 1.”

      (5) Conceptual Leap in Modeling Interpretation: The distinction between simple RL models and preference-inference models seems to hinge on the ability to generalize learning from one offer to another. Whereas in the RL models learning occurs independently for each offer (hence to cross-offer generalization), preference inference allows for generalization between different offers. However, the paper does not explore RL models that allow generalization based on the similarity of features of the offers (e.g., payment for the receiver, payment for the offer-giver, who benefits more). Such models are more parsimonious and could explain the results without invoking a theory of mind or any modelling of the teacher. In such model versions, a learner learns a functional form that allows to predict the teacher's feedback based on said offer features (e.g., linear or quadratic form). Because feedback for an offer modulates the parameters of this function (feature weights) generalization occurs without necessarily evoking any sophisticated model of the other person. This leaves open the possibility that RL models could perform just as well or even show superiority over the preference learning model, casting doubt on the authors' conclusions. Of note: even the behaviourists knew that as Little Albert was taught to fear rats, this fear generalized to rabbits. This could occur simply because rabbits are somewhat similar to rats. But this doesn't mean little Alfred had a sophisticated model of animals he used to infer how they behave.

      We are appreciative of the Reviewer for their suggestion of an alternative explanation for the observed generalization effects. Our understanding of the suggestion, put simply, put simply, is that an RL model could capture the observed generalization effects if the model were to learn and update a functional form of the Teacher’s rejection preferences using an RL-like algorithm. This idea is similar, conceptually to our account of preference learning whereby the learner has a representation of the teacher’s preferences. In our experiment the offer is in the range of [0-100], the crux of this idea is why the participants should take the functional form (either v-shaped or quadratic) with the minimum at 50. This is important because, at the beginning of the learning phase, the rejection rates are already v-shaped with 50 as its minimum. The participants do not need to adjust the minimum of this functional form. Thus, if we assume that the participants represent the teacher’s rejection rate as a v-shape function with a minimum at [50,50], then this very likely implies that the participants have a representation that the teacher has a preference for fairness. Above all, we agree that with suitable setup of the functional form, one could implement an RL model to capture the generalization effects, without presupposing an internal “model” of the teacher’s preferences.

      However, there is another way of modeling the generalization effect by truly “model-free” similarity-based Reinforcement learning. In this approach, we do not assume any particular functional form of the teacher’s preferences, but rather, assumes that experience acquired in one offer type can be generalized to offers that are close (i.e., similar) to the original offer. Accordingly, we implement this idea using a simple RL model in which the action values for each offer type is updated by a learning rate that is scaled by the distance between that offer and the experienced offer (i.e., the offer that generated the prediction error). This learning rate is governed by a Gaussian distribution, similar to the case in the Gaussian process regression (cf. Chulz, Speekenbrink, & Krause, 2018). The initial value of the ‘Reject’ action, for each offer , is set to a free parameter between 0 and 1, and the initial value for the 'Accept’ action was set to 0.5. The results show that even though this model exhibits the trend of increasing rejection rates observed in the AI-DI punish condition, the initial preferences (i.e., starting point of learning) diverges markedly from the Learning phase behavior we observed in Experiment 1:

      Author response image 1.

      This demonstrated that the participant at least maintains a representation of the teacher’s preference at the beginning. That is, they have prior knowledge about the shape of this preference. We incorporated this property into the model, that is, we considered a new model that assumes v-shaped starting values for rejection with two parameters, alpha and beta, governing the slope of this v-shaped function (this starting value actually mimics the shape of the preference functions of the Fehr-Schmidt model). We found that this new model (which we term the “Model RL Sim Vstart”) provided a satisfactory qualitative fit of the Transfer phase learning curves in Experiment 1 (see below).

      Author response image 2.

      However, we didn’t adopt this model as the best model for the following reasons. First, this model yielded a larger AIC value (indicating worse quantitative fit) compared to our preference Inference model in both Experiments 1 and 2, likely owing to its increased complexity (5 free parameters versus 4 in the Preference Inference model). Accordingly, we believe that inclusion of this model in our revised submission would be more distracting than helpful on account of the added complexity of explaining and justifying these assumptions, and of course its comparatively poor goodness of fit (relative to the preference inference model).

      (6) Limitations of the Preference-Inference Model: The preference-inference model struggles to capture key aspects of the data, such as the increase in rejection rates for 70:30 DI offers during the learning phase (e.g. Figure 3A, AI+DI blue group). This is puzzling.

      Thinking about this I realized the model makes quite strong unintuitive predictions that are not examined. For example, if a subject begins the learning phase rejecting the 70:30 offer more than 50% of the time (meaning the starting guilt parameter is higher than 1.5), then overleaning the tendency to reject will decrease to below 50% (the guilt parameter will be pulled down below 1.5). This is despite the fact the teacher rejects 75% of the offers. In other words, as learning continues learners will diverge from the teacher. On the other hand, if a participant begins learning to tend to accept this offer (guilt < 1.5) then during learning they can increase their rejection rate but never above 50%. Thus one can never fully converge on the teacher. I think this relates to the model's failure in accounting for the pattern mentioned above. I wonder if individuals actually abide by these strict predictions. In any case, these issues raise questions about the validity of the model as a representation of how individuals learn to align with a teacher's preferences (given that the model doesn't really allow for such an alignment).

      In response to this comment we explain our efforts to build a new model that might be able conceptually resolves the issue identified by the Reviewer.

      The key intuition guiding the Preference inference model is a Bayesian account of learning which we aimed to further simplify. In this setting, a Bayesian learner maintains a representation of the teacher’s inequity aversion parameters and updates it according to the teacher’s (observed) behavior. Intuitively, the posterior distribution shifts to the likelihood of the teacher’s action. On this view, when the teacher rejects, for instance, an AI offer, the learner should assign a higher probability to larger values of the Guilt parameter, and in turn the learner should change their posterior estimate to better capture the teacher’s preferences.

      In the current study, we simplified this idea, implementing this sort of learning using incremental “delta rule” updating (e.g. Equation 8 of the main text). Then the key question is to define the “teaching signal”. Assuming that the teacher rejects an offer 70:30, based on Bayesian reasoning, the teacher’s envy parameter (α) is more likely to exceed 1.5 (computed as 30/(50-30), per equation 7) than to be smaller than 1.5. Thus, 1.5, which is then used in equation 8 to update α, can be thought of as a teaching signal. We simply assumed that if the initial estimate is already greater than 1.5, which means the prior is consistent with the likelihood, no updating would occur. This assumption raises the question of how to set the learning rate range. In principle, an envy parameter that is larger than 1.5 should be the target of learning (i.e., the teaching signal), and thus our model definition allows the learning rate to be greater than 1, incorporating this possibility.

      Our simplified preference inference model has already successfully captured some key aspects of the participants’ learning behavior. However, it may fail in the following case: assume that the participant has an initial estimate of 1.51 for the envy parameter (β). Let’s say this corresponds to a rejection rate of 60%. Thus, no matter how many times the teacher rejects the offer 70:30, the participant’s estimate of the envy parameter remains the same, but observing only one offer acceptance would decrease this estimate, and in turn, would decrease the model’s predicted rejection rate. We believe this is the anomalous behavior—in 70:30 offers—identified by the Reviewer which the model does not appear able to recreate participants’ in these offers.

      This issue actually touches the core of our model specification, that is, the choosing of the teaching signal. As we chose 1.5 as the teaching signal—i.e. lower bound on whenever the teacher rejects or accepts an offer of 70:30, a very small deviation of 1.5 would fail one part of updating. One way to mitigate this problem would be to choose a lower bound for α greater than 1.5, such that when the Teacher rejects a 70:30 offer, we assign a number greater than 1.5 (by ‘hard-coding’ this into the model via modification of equation 7). One sensible candidate value could be the middle point between 1.5 and 10 (the maximum value of α per our model definition). Intuitively, the model of this setting could still pull up the value of α to 1.51 when the teacher rejects 70:30, thus alleviating (but not completely eliminating) the anomaly.

      We fitted this modified Preference Inference model to the data from Experiment 1 (see Author response image 3 below) and found that even though this model has a smaller AIC (and thus better quantitative fit than the original Preference Inference model), it still doesn’t fully capture the participants’ behavior for 70:30 offers.

      Author response image 3.

      Accordingly, rather than revising our model to include an unprincipled ‘kludge’ to account for this minor anomaly in the model behavior, we have opted to report our original model in our revision as we still believe it parsimoniously captures our intuitions about preference learning and provides a better fit to the observed behavior than the other RL models considered in the present study.

      Reviewer #1 (Recommendations for the authors):

      (1) I do not particularly prefer the acronyms AI and DI for disadvantageous inequity and advantageous inequity. Although they have been used in the literature, not every single paper uses them. More importantly, AI these days has such a strong meaning of artificial intelligence, so when I was reading this, I'd need to very actively inhibit this interpretation. I believe for the readability for a wider readership of eLife, I would advise not to use AI/DI here, but rather use the full terms.

      We thank the Reviewer for this suggestion. As the full spelling of the two terms are somewhat lengthy, and appear frequently in the figures, we have elected to change the abbreviations for disadvantageous inequity and advantageous inequity to Dis-I and Adv-I, respectively in the main text and the supplementary information. We still use AI/DI in the response letter to make the terminology consistent.

      (2) Do "punishment rate" and "rejection rate" mean the same? If so, it would be helpful to stick with one single term, eg, rejection rate.

      We thank the Reviewer for this suggestion. As these terms have the same meaning, we have opted to use the term “rejection rate” throughout the main text.

      (3) For the linear mixed effect models, were other random effect structures also considered (eg, random slops of experimental conditions)? It might be worth considering a few model specifications and selecting the best one to explain the data.

      Thanks for this comment. Following established best practices (Barr, Levy, Scheepers, & Tily, 2013) we have elected to use a maximal random effects structure, whereby all possible predictor variables in the fixed effects structure also appear in the random effects structure.

      (4) For equation (4), the softmax temperature is denoted as tau, but later in the text, it is called gamma. Please make it consistent.

      We are appreciative of the Reviewer’s attention to detail. We have corrected this error.

      Reviewer #2 (Recommendations for the authors):

      (1) Several Tables in SI are unclear. I wasn't clear if these report raw probabilities of coefficients of mixed models. For any mixed models, it would help to give the model specification (e.g., Walkins form) and explain how variables were coded.

      We are appreciative of the Reviewer’s attention to detail. We have clarified, in the captions accompanying our supplemental regression tables, that these coefficients represent log-odds. Regretfully we are unaware of the “Walkins form” the Reviewer references (even after extensive searching of the scientific literature). However, in our new revision we do include lme4 model syntax in our supplemental information which we believe will be helpful for readers seeking replicate our model specification.

      (2) In one of the models it was said that the guilt and envy parameters were bounded between 0-1 but this doesn't make sense and I think values outside this range were later reported.

      We are again appreciative of the Reviewer’s attention to detail. This was an error we have corrected— the actual range is [0,10].

      (3) It is unclear if the model parameters are recoverable.

      In response to this comment our revision now reports a basic parameter recovery analysis for the winning Preference Inference model. This is reported in our revised Methods:

      “Finally, to verify if the free parameters of the winning model (Preference Inference) are recoverable, we simulated 200 artificial subjects, based on the Learning Phase of Experiment 1, with free parameters randomly chosen (uniformly) from their defined ranges. We then employed the same model-fitting procedure as described above to estimate these parameter value, observing that parameters. We found that all parameters of the model can be recovered (see Figure S2).”

      And scatter plots depicting these simulated (versus recovered) parameters are given in Figure S2 of our revised Supplementary Information:

      (4) I was confused about what Figure S2 shows. The text says this is about correlating contagious effects for different offers but the captions speak about learning effects. This is an important aspect which is unclear.

      We have removed this figure in response to both Reviewers’ comments about the limited insights that can be drawn on the basis of these correlations.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      The aim of this paper is to develop a simple method to quantify fluctuations in the partitioning of cellular elements. In particular, they propose a flow-cytometry-based method coupled with a simple mathematical theory as an alternative to conventional imaging-based approaches.

      Strengths:

      The approach they develop is simple to understand and its use with flow-cytometry measurements is clearly explained. Understanding how the fluctuations in the cytoplasm partition vary for different kinds of cells is particularly interesting.

      Weaknesses:

      The theory only considers fluctuations due to cellular division events. This seems a large weakness because it is well known that fluctuations in cellular components are largely affected by various intrinsic and extrinsic sources of noise and only under particular conditions does partitioning noise become the dominant source of noise.

      We thank the Reviewer for her/his evaluation of our manuscript. The point raised is indeed a crucial one. In a cell division cycle, there are at least three distinct sources of noise that affect component numbers [1] :

      (1) Gene expression and degradation, which determine component numbers fluctuations during cell growth.

      (2) Variability in cell division time, which depending on the underlying model may or may not be a function of protein level and gene expression.

      (3) Noise in the partitioning/inheritance of components between mother and daughter cells.

      Our approach specifically addresses the latter, with the goal of providing a quantitative measure of this noise source. For this reason, in the present work, we consider homogeneous cancer cell populations that could be considered to be stationary from a population point-of-view. By tracking the time evolution of the distribution of tagged components via live fluorescent markers, we aim at isolating partitioning noise effects. However, as noted by the Reviewer, other sources of noise are present, and depending on the considered system the relative contributions of the different sources may change. Thus, we agree that a quantification of the effect of the various noise sources on the accuracy of our measurements will improve the reliability of our method.

      In this respect, assuming independence between noise sources, we reasoned that variability in cell cycle length would affect the timing of population emergence but not the intrinsic properties of those populations (e.g., Gaussian variance). To test this hypothesis, we conducted a preliminary set of simulations in which cell division times were drawn from an Erlang distribution (mean = 18 h, k=4k = 4k=4). The results, showing the behavior of the mean and variance of the component distributions across generations, are presented in Supplementary Information - Figure 1. Under the assumption of independence between different noise sources, no significant effects were observed even for high asymmetries of the partitioning distribution.

      Next, we quantified the accuracy of our measurements in the presence of cross-talks between the various noise sources.Indeed, cells may adopt different growth and division strategies, which can be grouped into three categories based on what triggers division:

      ● Sizer-like cells divide upon reaching a certain size;

      ● Timer-like cells divide after a fixed time (corresponding to the previously treated case with independent noise);

      ● Adder-like cells divide once their volume has increased by a finite amount.

      A detailed discussion of these strategies, including their mathematical formulation, can be found in [2]. Here we have assumed that cells follow a sizer-like model. In this way, we study a system in which cells with a higher number of components have shorter division times. Hence, older (newer) generations are emptied (populated) starting from higher values.

      As can be observed, higher levels of division asymmetry increase the fluctuations of the system relative to the analytically expected behavior, particularly in later generations.

      The result in Supplementary Information - Figure 3 demonstrates the robustness of our method, as the estimates remain within the pre-established experimental error margin. We have now discussed this aspect both in the main and in the Supplementary Information and thank the Reviewer for pointing it out.

      (1) Soltani, Mohammad, et al. "Intercellular variability in protein levels from stochastic expression and noisy cell cycle processes." PLoS computational biology 12.8 (2016): e1004972.

      (2) Mattia Miotto, Simone Scalise, Marco Leonetti, Giancarlo Ruocco, Giovanna Peruzzi, and Giorgio Gosti. A size-dependent division strategy accounts for leukemia cell size heterogeneity. Communications Physics, 7(1):248, 2024.

      Reviewer #2 (Public review):

      Summary:

      The authors present a combined experimental and theoretical workflow to study partitioning noise arising during cell division. Such quantifications usually require time-lapse experiments, which are limited in throughput. To bypass these limitations, the authors propose to use flow-cytometry measurements instead and analyse them using a theoretical model of partitioning noise. The problem considered by the authors is relevant and the idea to use statistical models in combination with flow cytometry to boost statistical power is elegant. The authors demonstrate their approach using experimental flow cytometry measurements and validate their results using time-lapse microscopy. However, while I appreciate the overall goal and motivation of this work, I was not entirely convinced by the strength of this contribution. The approach focuses on a quite specific case, where the dynamics of the labelled component depend purely on partitioning. As such it seems incompatible with studying the partitioning noise of endogenous components that exhibit production/turnover. The description of the methods was partly hard to follow and should be improved. In addition, I have several technical comments, which I hope will be helpful to the authors.

      We are grateful to the Reviewer for the comments. Indeed, both partitioning and production turnover noise are in general fundamental processes. At present the only way to consider them together are time-consuming and costly transfection/microscopy/tracking experiments. In this work, we aimed at developing a method to effectively pinpoint the first component, i.e. partitioning noise thus we opted to separate the two different noise sources.

      Below, we provided a point-by-point response that we hope will clarify all raised concerns.

      Comments:

      (1) In the theoretical model, copy numbers are considered to be conserved across generations. As a consequence, concentrations will decrease over generations due to dilution. While this consideration seems plausible for the considered experimental system, it seems incompatible with components that exhibit production and turnover dynamics. I am therefore wondering about the applicability/scope of the presented approach and to what extent it can be used to study partitioning noise for endogenous components. As presented, the approach seems to be limited to a fairly small class of experiments/situations.

      We see the Reviewer's point. Indeed, we are proposing a high-throughput and robust procedure to measure the partitioning/inheritance noise of cell components through flow cytometry time courses. By using live-cell staining of cellular compounds, we can track the effect of partitioning noise on fluorescence intensity distribution across successive generations. This specific procedure is purposely optimized to isolate partitioning noise from other sources and, as it is, can not track endogenous components or dyes that require fixation. While this certainly poses limits to the proposed approach, there are numerous contexts in which our methodology could be used to explore the role of asymmetric inheritance. Among others, (i) investigating how specific organelles are differentially partitioned and how this influences cellular behavior could provide deeper insights into fundamental biological processes: asymmetric segregation of organelles is a key factor in cell differentiation, aging, and stress response. During cell division, organelles such as mitochondria, the endoplasmic reticulum, lysosomes, peroxisomes, and centrosomes can be unequally distributed between daughter cells, leading to functional differences that influence their fate. For instance, Kajaitso et al. [1] proposed that asymmetric division of mitochondria in stem cells is associated with the retention of stemness traits in one daughter cell and differentiation in the other. As organisms age, stem cells accumulate damage, and to prevent exhaustion and compromised tissue function, cells may use asymmetric inheritance to segregate older or damaged subcellular components into one daughter cell. (ii) Asymmetric division has also been linked to therapeutic resistance in Cancer Stem Cells [2]. Although the functional consequences are not yet fully determined, the asymmetric inheritance of mitochondria is recognized as playing a pivotal role [3]. Another potential application of our methodology may be (iii) the inheritance of lysosomes, which, together with mitochondria, appears to play a crucial role in determining the fate of human blood stem cells [4]. Furthermore, similar to studies conducted on liquid tumors [5][6], our approach could be extended to investigate cell growth dynamics and the origins of cell size homeostasis in adherent cells [7][8][9]. The aforementioned cases of study can be readily addressed using our approach that in general is applicable whenever live-cell dyes can be used. We have added a discussion of the strengths and limitations of the method in the Discussion section of the revised version of the manuscript

      (1) Katajisto, Pekka, et al. "Asymmetric apportioning of aged mitochondria between daughter cells is required for stemness." Science 348.6232 (2015): 340-343.

      (2) Hitomi, Masahiro, et al. "Asymmetric cell division promotes therapeutic resistance in glioblastoma stem cells." JCI insight 6.3 (2021): e130510.

      (3) García-Heredia, José Manuel, and Amancio Carnero. "Role of mitochondria in cancer stem cell resistance." Cells 9.7 (2020): 1693.

      (4) Loeffler, Dirk, et al. "Asymmetric organelle inheritance predicts human blood stem cell fate." Blood, The Journal of the American Society of Hematology 139.13 (2022): 2011-2023.

      (5) Miotto, Mattia, et al. "Determining cancer cells division strategy." arXiv preprint arXiv:2306.10905 (2023).

      (6) Miotto, Mattia, et al. "A size-dependent division strategy accounts for leukemia cell size heterogeneity." Communications Physics 7.1 (2024): 248.

      (7) Kussell, Edo, and Stanislas Leibler. "Phenotypic diversity, population growth, and information in fluctuating environments." Science 309.5743 (2005): 2075-2078.

      (8) McGranahan, Nicholas, and Charles Swanton. "Clonal heterogeneity and tumor evolution: past, present, and the future." Cell 168.4 (2017): 613-628.

      (9) De Martino, Andrea, Thomas Gueudré, and Mattia Miotto. "Exploration-exploitation tradeoffs dictate the optimal distributions of phenotypes for populations subject to fitness fluctuations." Physical Review E 99.1 (2019): 012417.

      (2) Similar to the previous comment, I am wondering what would happen in situations where the generations could not be as clearly identified as in the presented experimental system (e.g., due to variability in cell-cycle length/stage). In this case, it seems to be challenging to identify generations using a Gaussian Mixture Model. Can the authors comment on how to deal with such situations? In the abstract, the authors motivate their work by arguing that detecting cell divisions from microscopy is difficult, but doesn't their flow cytometry-based approach have a similar problem?

      The point raised is an important one, as it highlights the fundamental role of the gating strategy. The ability to identify the distribution of different generations using the Gaussian Mixture Model (GMM) strongly depends on the degree of overlap between distributions. The more the distributions overlap, the less capable we are of accurately separating them.

      The extent of overlap is influenced by the coefficients of variation (CV) of both the partitioning distribution function and the initial component distribution. Specifically, the component distribution at time t results from the convolution of the component distribution itself at time t−1 and the partitioning distribution function. Therefore, starting with a narrow initial component distribution allows for better separation of the generation peaks. The balance between partitioning asymmetry and the width of the initial component distribution is thus crucial.

      As shown in Supplementary Information - Figure 5, increasing the CV of either distribution reduces the ability to distinguish between different generations.

      However, the variance of the initial distribution cannot be reduced arbitrarily. While selecting a narrow distribution facilitates a better reconstruction of the distributions, it simultaneously limits the number of cells available for the experiment. Therefore, for components exhibiting a high level of asymmetry, further narrowing of the initial distribution becomes experimentally impractical.

      In such cases, an approach previously tested on liquid tumors [1] involves applying the Gaussian Mixture Model (GMM) in two dimensions by co-staining another cellular component with lower division asymmetry.

      Regarding time-lapse fluorescence microscopy, the main challenge lies not in disentangling the interplay of different noise sources, but rather in obtaining sufficient statistical power from experimental data. While microscopy provides detailed insights into the division process and component partitioning, its low throughput limits large-scale statistical analyses. Current segmentation algorithms still perform poorly in crowded environments and with complex cell shapes, requiring a substantial portion of the image analysis pipeline to be performed manually, a process that is time-consuming and difficult to scale. In contrast, our cytometry-based approach bypasses this analysis bottleneck, as it enables a direct population-wide measurement of the system's evolution. We have added a detailed discussion of this argument in the Supplementary Material of the manuscript and added a clarification of the role of the gating strategy in the main text.

      (1) Peruzzi, Giovanna, et al. "Asymmetric binomial statistics explains organelle partitioning variance in cancer cell proliferation." Communications Physics 4.1 (2021): 188.

      (3) I could not find any formal definition of division asymmetry. Since this is the most important quantity of this paper, it should be defined clearly.

      We thank the Reviewer for the note. With division asymmetry we refer to a quantity that reflects how similar two daughter cells are likely to be in terms of inherited components after a division process. We opted to measure it via the coefficient of variation (root squared variance divided by the mean) of the partitioning fraction distribution. We have amended this lack of definition in the reviewed version of the manuscript.

      (4) The description of the model is unclear/imprecise in several parts. For instance, it seems to me that the index "i" does not really refer to a cell in the population, but rather a subpopulation of cells that has undergone a certain number of divisions. Furthermore, why is the argument of Equation 11 suddenly the fraction f as opposed to the component number? I strongly recommend carefully rewriting and streamlining the model description and clearly defining all quantities and how they relate to each other.

      We have amending the text carefully to avoid double naming of variables and clarifying each computation passage. In equation 11 the variable f refers to the fluorescent intensity, but the notation will be changed to increase clarity.

      (5) Similarly, I was not able to follow the logic of Section D. I recommend carefully rewriting this section to make the rationale, logic, and conclusions clear to the reader.

      We have updated the manuscript clarifying the scope of section D and its results. In brief, Section A presents a general model to derive the variance of the partitioning distribution from flow cytometry time-course data without making any assumptions about the shape of the distribution itself. In Section D, our goal is to interpret the origin of asymmetry and propose a possible form for the partitioning distribution. Since the dyes used bind non-specifically to cytoplasmic amines, the tagged proteins are expected to be uniformly distributed throughout the cytoplasm and present in large numbers. Given these assumptions the least complex model for division follows the binomial distribution, with a parameter that measures the bias in the process. Therefore, we performed a similar computation to that in Section A, which allows us to estimate not only the variance but also the degree of biased asymmetry. Finally, we fitted the data to this new model and proposed an experimental interpretation of the results.

      (6) Much theoretical work has been done recently to couple cell-cycle variability to intracellular dynamics. While the authors neglect the latter for simplicity, it would be important to further discuss these approaches and why their simplified model is suitable for their particular experiments.

      We agree with the Reviewer, we have added a discussion on this topic in the Introduction and Discussion sections of the main text.

      (7) In the discussion the authors note that the microscopy-based estimates may lead to an overestimation of the fluctuations due to limited statistics. I could not follow that reasoning. Due to the gating in the flow cytometry measurements, I could imagine that the resulting populations are more stringently selected as compared to microscopy. Could that also be an explanation? More generally, it would be interesting to see how robust the results are in terms of different gating diameters.

      The Reviewer is right on the importance of the sorting procedure. As already discussed in a previous point, the gating strategy we employed plays a fundamental role: it reduces the overlap of fluorescence distributions as generations progress, enables the selection of an initial distribution distinct from the fluorescence background, allowing for longer tracking of proliferation, and synchronizes the initial population. The narrower the initial distribution, the more separated the peaks of different generations will be. However, this also results in a smaller number of cells available for the experiment, requiring a careful balance between precision and experimental feasibility. A similar procedure, although it would certainly limit the estimation error, would be impracticable In the case of microscopy. Indeed, the primary limitation and source of error is the number of recorded events. Our pipeline allowed us to track on the order of hundreds of division dynamics, but the analysis time scales non-linearly with the number of events. Significantly increasing the dataset would have been extremely time-consuming. Reducing the analysis to cells with similar fluorescence, although theoretically true, would have reduced the statistics to a level where the sampling error would drastically dominate the measure. Moreover, different experiments would have been hardly comparable, since different fluorescences could map in equally sized cells. In light of these factors, we expect higher CV for the microscopy measure than for flow cytometry’s ones. In the plots below, we show the behaviour of the mean and the standard deviation of N numbers sampled from a gaussian distribution N(0,1) as a function of the sampling number N. The higher is N the closer the sampled distribution will be to the true one. The region in the hundreds of samples is still very noisy, but to do much better we would have to reach the order of thousands. We have added a discussion on these aspects in the reviewed version of the manuscript, with a deeper description of the importance of the sorting procedure in the Supplementary Material. .

      Author response image 1.

      Standard deviation and mean value of a distribution of points sampled from a Gaussian distribution with mean 0 and standard deviation 1, versus the number of samples, N. Increasing N leads to a closer approximation of the expected values. In orange is highlighted the Microscopy Working Region (Microscopy WR) which corresponds to the number of samples we are able to reach with microscopy experiments. In yellow the region we would have to reach to lower the estimating error, which is although very expensive in terms of analysis time.

      (7) It would be helpful to show flow cytometry plots including the identified subpopulations for all cell lines, currently, they are shown only for HCT116 cells. More generally, very little raw data is shown.

      We have provided the requested plots for the other cell lines together with additional raw data coming from simulations in the Supplementary Material.

      (8) The title of the manuscript could be tailored more to the considered problem. At the moment it is very generic.

      We see the Reviewer point. The proposed title aims at conveying the wide applicability of the presented approach, which ultimately allows for the assessment of the levels of fluctuations in the levels of the cellular components at division. This in turn reflects the asymmetricity in the division.

      Reviewer #1 (Recommendations for the authors):

      (1) I am quite concerned about the fact that the theory only considers fluctuations due to cellular division events since intrinsic and extrinsic noise sources are often dominant. I suggest that the authors simulate a full model of cell growth and division (that accounts for fluctuations in gene expression, cell-cycle dynamics, and cell division to generate a controlled synthetic dataset and then use this as input to their method to understand how robust are their results to the influence of noise sources other than partitioning.

      We thank the reviewer for the suggestions and following his advice we performed two sets of simulations in which we took into account the effect of the other noise sources. A detailed description of the results and the methods has been added to the Supplementary Material, while the topic has also been assessed in the main text. A cell proliferation cycle is affected by different sources of variability: (i) production and degradation processes of molecules; (ii) variability in length of the cell cycle; (iii) partitioning noise, which identifies asymmetric inheritance of components between the two daughter cells. However, the experimental approach and the model have been formulated to specifically address the effects of partitioning noise. Indeed, since we are dealing with components tagged via live fluorescent markers, production of new fluorophores is impossible and can therefore be discarded. Instead, the degradation process is a global effect that influences the behavior of the mean of the distribution in a time-dependent manner. However, by looking at the experimental data in Figure 1 of the main text, no significant depletion of fluorescence is observed, or at least it is hidden by the experimental fluctuations of the measure. Instead, a more careful evaluation has to be done for what concerns fluctuation in cell cycle length. We conducted two sets of simulations. In the first, we assumed the independence between fluctuations in cell cycle length and partitioning noise.

      Cell’s division time was extracted from an Erlang distribution (mean = 18 , k = 4) and the results, showing the behavior of the mean and variance of the component distributions across generations, are presented in Supplementary Information - Figure 1. Under the assumption of independence between different noise sources, no significant effects were observed even for high asymmetries of the partitioning distribution. The second set of simulations considered a situation in which the cell’s components and division time are coupled. We assumed a sizer-like division strategy for which bigger cells have a shorter division time and the results of the simulations are shown in Supplementary Information - Figure 2.

      As can be observed, higher levels of division asymmetry increase the fluctuations of the system relative to the analytically expected behavior, particularly in later generations.

      The result in Supplementary Information - Figure 3 demonstrates the robustness of our method, as the estimates remain within the pre-established experimental error margin. However, a detailed description of this topic has been provided in the Supplementary Information and into the main text.

      (2) I find the use of the Cauchy distribution somewhat odd since this does not have a finite mean or a variance and I suspect it is unlikely this mimics a naturally measurable distribution in their experiments. This should either be justified biologically or else replaced by a more realistic distribution.

      Following the reviewer’s suggestion, we have changed the distribution to Gaussian one.

      (3) There is a large body of literature on gene expression models that incorporate a large amount of detail including cell-cycle dynamics and cell division which are relevant to their discussion but not referenced. I suggest they read the following and see how to incorporate at least some of them in their discussion:

      Frequency domain analysis of fluctuations of mRNA and protein copy numbers within a cell lineage: theory and experimental validation., Physical Review X, 11.2 (2021): 021032.

      Exact solution of stochastic gene expression models with bursting, cell cycle and replication dynamics., Physical Review E, 101.3 (2020): 032403.

      Coupling gene expression dynamics to cell size dynamics and cell cycle events: Exact and approximate solutions of the extended telegraph model., Iscience, 26.1 (2023).

      Models of protein production along the cell cycle: An investigation of possible sources of noise., Plos one, 15.1 (2020): e0226016.

      Sources, propagation and consequences of stochasticity in cellular growth., Nature communications, 9(1), 4528

      Intrinsic and extrinsic noise of gene expression in lineage trees., Scientific Reports, 9.1 (2019): 474.

      We thank the Reviewer for the provided articles. We enlarged both introduction and discussion commenting on them, also in response to the second Reviewer comments.

      Reviewer #2 (Recommendations for the authors):

      (1) Even when it is used only during simulation for the sake of illustration, the Cauchy distribution is a somewhat unfortunate choice as its moments do not exist and hence, the authors' approach would not apply. I would recommend using another distribution instead.

      Following the Reviewer’s suggestion we have changed the distribution to Gaussian ones.

      (2) "cells population" should be "cell population".

      We have amended this mistake in the text.

    1. Author Response

      The following is the authors’ response to the previous reviews.

      Reviewer #1:

      Concerns Public Review:

      1)The framing of 'infinite possible types of conflict' feels like a strawman. While they might be true across stimuli (which may motivate a feature-based account of control), the authors explore the interpolation between two stimuli. Instead, this work provides confirmatory evidence that task difficulty is represented parametrically (e.g., consistent with literatures like n-back, multiple object tracking, and random dot motion). This parametric encoding is standard in feature-based attention, and it's not clear what the cognitive map framing is contributing.

      Suggestion:

      1) 'infinite combinations'. I'm frankly confused by the authors response. I don't feel like the framing has changed very much, besides a few minor replacements. Previous work in MSIT (e.g., by the author Zhongzheng Fu) has looked at whether conflict levels are represented similarly across conflict types using multivariate analyses. In the paper mentioned by Ritz & Shenhav (2023), the authors looked at whether conflict levels are represented similarly across conflict types using multivariate analyses. It's not clear what this paper contributes theoretically beyond the connections to cognitive maps, which feel like an interpretative framework rather than a testable hypothesis (i.e., these previous paper could have framed their work as cognitive maps).

      Response: We acknowledge the limitations inherent in our experimental design, which prevents us from conducting a strict test of the cognitive space view. In our previous revision, we took steps to soften our conclusions and emphasize these limitations. However, we still believe that our study offers valuable and novel insights into the cognitive space, and the tests we conducted are not merely strawman arguments.

      Specifically, our study aimed to investigate the fundamental principles of the cognitive space view, as we stated in our manuscript that “the representations of different abstract information are organized continuously and the representational geometry in the cognitive space is determined by the similarity among the represented information (Bellmund et al., 2018)”. While previous research has applied multivariate analyses to understand cognitive control representation, no prior studies had directedly tested the two key hypotheses associated with cognitive space: (1) that cognitive control representation across conflict types is continuous, and (2) that the similarity among representations of different conflict types is determined by their external similarity.

      Our study makes a unique contribute by directly testing these properties through a parametric manipulation of different conflict types. This approach differs significantly from previous studies in two ways. First, our parametric manipulation involves more than two levels of conflict similarity, enabling us to directly test the two critical hypotheses mentioned above. Unlike studies such as Fu et al. (2022) and other that have treated different conflict types categorically, we introduced a gradient change in conflict similarity. This differentiation allowed us to employ representational similarity analysis (RSA) over the conflict similarity, which goes beyond mere decoding as utilized in prior work (see more explanation below for the difference between Fu et al., 2022 and our study [1]).

      Second, our parametric manipulation of conflict types differs from previous studies that have manipulated task difficulty, and the modulation of multivariate pattern similarity observed in our study could not be attributed by task difficulty. Previous research, including the Ritz & Shenhav (2023) (see below explanation[2]), has primarily shown that task difficulty modulates univoxel brain activation. A recent work by Wen & Egner (2023) reported a gradual change in the multivariate pattern of brain activations across a wide range of frontoparietal areas, supporting the reviewer’s idea that “task difficulty is represented parametrically”. However, we do not believe that our results reflect the task difficulty representation. For instance, in our study, the spatial Stroop-only and Simon-only conditions exhibited similar levels of difficulty, as indicated by their relatively comparable congruency effects (Fig. S1). Despite this similarity in difficulty, we found that the representational similarity between these two conditions was the lowest (see revised Fig. S4, the most off-diagonal value). This observation aligns more closely with our hypothesis that these two conditions are most dissimilar in terms of their conflict types.

      [1] Fu et al. (2022) offers important insights into the geometry of cognitive space for conflict processing. They demonstrated that Simon and flanker conflicts could be distinguished by a decoder that leverages the representational geometry within a multidimensional space. However, their model of cognitive space primarily relies on categorical definitions of conflict types (i.e., Simon versus flanker), rather than exploring a parametric manipulation of these conflict types. The categorical manipulations make it difficult to quantify conceptual similarity between conflict types and hence limit the ability to test whether neural representations of conflict capture conceptual similarity. To the best of our knowledge, no previous studies have manipulated the conflict types parametrically. This gap highlights a broader challenge within cognitive science: effectively manipulating and measuring similarity levels for conflicts, as well as other high-level cognitive processes, which are inherently abstract. We therefore believe our parametric manipulation of conflict types, despite its inevitable limitations, is an important contribution to the literature.

      We have incorporated the above statements into our revised manuscript: Methodological implications. Previous studies with mixed conflicts have applied mainly categorical manipulations of conflict types, such as the multi-source interference task (Fu et al., 2022) and color Stroop-Simon task (Liu et al., 2010). The categorical manipulations make it difficult to quantify conceptual similarity between conflict types and hence limit the ability to test whether neural representations of conflict capture conceptual similarity. To the best of our knowledge, no previous studies have manipulated the conflict types parametrically. This gap highlights a broader challenge within cognitive science: effectively manipulating and measuring similarity levels for conflicts, as well as other high-level cognitive processes, which are inherently abstract. The use of an experimental paradigm that permits parametric manipulation of conflict similarity provides a way to systematically investigate the organization of cognitive control, as well as its influence on adaptive behaviors.

      [2] The work by Ritz & Shenhav (2023) indeed applied multivariate analyses, but they did not test the representational similarity across different levels of task difficulty in a similar way as our investigation into different levels of conflict types, neither did they manipulated conflict types as our study. They first estimated univariate brain activations that were parametrically scaled by task difficulty (e.g., target coherence), yielding one map of parameter estimates (i.e., encoding subspace) for each of the target coherence and distractor congruence. The multivoxel patterns from the above maps were correlated to test whether the target coherence and distractor congruence share the similar neural encoding. It is noteworthy that the encoding of task difficulty in their study is estimated at the univariate level, like the univariate parametric modulation analysis in our study. The representational similarity across target coherence and distractor congruence was the second-order test and did not reflect the similarity across different difficulty levels. Though, we have found another study (Wen & Egner, 2023) that has directly tested the representational similarity across different levels of task difficulty, and they observed a higher representational similarity between conditions with similar difficulty levels within a wide range of brain regions.

      Reference:

      Wen, T., & Egner, T. (2023). Context-independent scaling of neural responses to task difficulty in the multiple-demand network. Cerebral Cortex, 33(10), 6013-6027. https://doi.org/10.1093/cercor/bhac479

      Fu, Z., Beam, D., Chung, J. M., Reed, C. M., Mamelak, A. N., Adolphs, R., & Rutishauser, U. (2022). The geometry of domain-general performance monitoring in the human medial frontal cortex. Science (New York, N.Y.), 376(6593), eabm9922. https://doi.org/10.1126/science.abm9922

      Ritz, H., & Shenhav, A. (2023). Orthogonal neural encoding of targets and distractors supports multivariate cognitive control. https://doi.org/10.1101/2022.12.01.518771 Another issue is suggesting mixtures between two types of conflict may be many independent sources of conflict. Again, this feels like the strawman. There's a difference between infinite combinations of stimuli on the one hand, and levels of feature on the other hand. The issue of infinite stimuli is why people have proposed feature-based accounts, which are often parametric, eg color, size, orientation, spatial frequency. Mixing two forms of conflict is interesting, but the task limitations (i.e., highly correlated features) prevent an analysis of whether these are truly mixed (or eg reflect variations on just one of the conflict types). Without being able to compare a mixture between types vs levels of only one type, it's not clear what you can draw from these results re: how these are combined (and not clear how it reconciles the debate between general and specific).

      Response: As the reviewer pointed out, a feature (or a parameterization) is an efficient way to encode potentially infinite stimuli. This is the same idea as our hypothesis: different conflict types are represented in a cognitive space akin to concrete features such as a color spectrum. This concept can be illustrated in the figure below.

      Author response image 1.

      We would like to clarify that in our study we have manipulated five levels of conflict types, but they all originated from two fundamental sources: vertically spatial Stroop and horizontally Simon conflicts. We agree that the mixture of these two sources does not inherently generate additional conflict sources. However, this mixture does influence the similarity among different conflict conditions, which provides essential variability that is crucial for testing the core hypotheses (i.e., continuity and similarity modulation, see the response above) of the cognitive space view. This clarification is crucial as the reviewer’s impression might have been influenced by our introduction, where we repeatedly emphasized multiple sources of conflicts. Our aim in the introduction was to outline a broader conceptual framework, which might not directly reflect the specific design of our current study. Recognizing the possibility of misinterpretation, we have adjusted our introduction and discussion to place less emphasis on the variety of possible conflict sources. For example, we have removed the expression “The large variety of conflict sources implies that there may be innumerable number of conflict conditions” from the introduction. As we have addressed in the previous response, the observed conflict similarity effect could not be attributed to merely task difficulty. Similarly, the mixture of spatial Stroop and Simon conflicts should not be attributed to one conflict source only; doing so would oversimplify it to an issue of task difficulty, as it would imply that our manipulation of conflict types merely represented varying levels of a single conflict, akin to manipulating task difficulty when everything else being equal. Importantly, the mixed conditions differ from variations along a single conflict source in that they also incorporate components of the other conflict source, thereby introducing difference beyond that would be found within variances of a single conflict source. There are a few additional evidence challenging the single dimension assumption. In our previous revisions, we compared model fittings between the Cognitive-Space model and the Stroop-/Simon-only models, and results showed that the CognitiveSpace model (BIC = 5377093) outperformed the Stroop-Only (BIC = 5377122) and Simon-Only (BIC = 5377096) models. This suggests that mixed conflicts might not be solely reflective of either Stroop or Simon sources, although we did not include these results due to concerns raised by reviewers about the validity of such comparisons, given the high anticorrelation between the two dimensions. Furthermore, Fu et al. (2022) demonstrated that the mixture of Simon and Flanker conflicts (the sf condition) is represented as the vector sum of the Flanker and Simon dimensions within their space model, indicating a compositional nature. Similarly, our mixed conditions are combinations of Stroop and Simon conflicts, and it is plausible that these mixtures represent a fusion of both Stroop and Simon components, rather than just one. Thus, we disagree that the mixture of conflicts is a strawman. In response to this concern, we have included a statement in our limitation section: “Another limitation is that in our design, the spatial Stroop and Simon effects are highly anticorrelated. This constraint may make the five conflict types represented in a unidimensional space (e.g., a circle) embedded in a 2D space. This limitation also means we cannot conclusively rule out the possibility of a real unidimensional space driven solely by spatial Stroop or Simon conflicts. However, this appears unlikely, as it would imply that our manipulation of conflict types merely represented varying levels of a single conflict, akin to manipulating task difficulty when everything else being equal. If task difficulty were the primary variable, we would expect to see greater representational similarity between task conditions of similar difficulty, such as the Stroop and Simon conditions, which demonstrates comparable congruency effects (see Fig. S1). Contrary to this, our findings reveal that the Stroop-only and Simon-only conditions exhibit the lowest representational similarity (Fig. S4). Furthermore, Fu et al. (2022) has shown that the representation of mixtures of Simon and Flanker conflicts was compositional, rather than reflecting single dimension, which also applies to our cases.”

      My recommendation would be to dramatically rewrite to reduce the framing of this providing critical evidence in favor of cognitive maps, and being more overt about the limitations of this task. However, the authors are not required to make further revisions in eLife's new model, and it's not clear how my scores would change if they made those revisions (ie the conceptual limitations would remain, the claims would just now match the more limited scope).

      Response: With the above rationales and the adjustments we have made in the manuscripts, we believe that we have thoroughly acknowledged and articulated the limitations of our study. Therefore, we have decided against a complete rewrite of the manuscript.

      Public Review:

      2) The representations within DLPFC appear to treat 100% Stoop and (to a lesser extent) 100% Simon differently than mixed trials. Within mixed trials, the RDM within this region don't strongly match the predictions of the conflict similarity model. It appears that there may be a more complex relationship encoded in this region.

      Suggestion:

      2) RSMs in the key region of interest. I don't really understand the authors response here either. e.g,. 'It is essential to clarify that our conclusions were based on the significant similarity modulation effect identified in our statistical analysis using the cosine similarity model, where we did not distinguish between the within-Stroop condition and the other four within-conflict conditions (Fig. 7A, now Fig. 8A). This means that the representation of conflict type was not biased by the seemingly disparities in the values shown here'. In Figure 1C, it does look like they are testing this model.

      It seems like a stronger validation would test just the mixture trials (i.e., ignoring Simon-only and stroop-only). However, simon/stroop-only conditions being qualitatively different does beg the question of whether these are being represented parametrically vs categorically.

      Response: We apologize for the confusion caused by our previous response. To clarify, our conclusions have been drawn based on the robust conflict similarity effect.

      The conflict similarity regressor is defined by higher values in the diagonal cells (representing within-conflict similarity) than the off-diagonal cells (indicating between-conflict similarity), as illustrated in Fig. 1C and Fig. 8A (now Fig. 4B). It is important to note that this regressor may not be particularly sensitive to the variations within the diagonal cells. Our previous response aimed to emphasize that the inconsistencies observed along the diagonal do not contradict our core hypothesis regarding the conflict similarity effect.

      We recognized that since the visualization in Fig. S4, based on the raw RSM (i.e., Pearson correlation), may have been influenced by other regressors in our model than the conflict similarity effect. To reflect pattern similarity with confounding factors controlled for, we have visualized the RSM by including only the fixed effect of the conflict similarity and the residual while excluding all other factors. As shown in the revised Figure S4, the difference between the within-Stroop and other diagonal cells was greatly reduced. Instead, it revealed a clear pattern where that the diagonal values were higher than the off-diagonal values in the incongruent condition, aligning with our hypothesis regarding the conflict similarity modulator. Although some visual distinctions persist within the five diagonal cells (e.g., in the incongruent condition, the Stroop, Simon, and StMSmM conditions appear slightly lower than StHSmL and StLSmM conditions), follow-up one-way ANOVAs among these five diagonal conditions showed no significant differences. This held true for both incongruent and congruent conditions, with Fs < 1. Thus, we conclude that there is no strong evidence supporting the notion that Simon- and spatial Stroop-only conditions are systematically different from other conflict types. As a result, we decided not to exclude these two conflict types from analysis.

      Author response image 2.

      The stronger conflict type similarity effect in incongruent versus congruent conditions. Shown are the summary representational similarity matrices for the right 8C region in incongruent (left) and congruent (right) conditions, respectively. Each cell represents the averaged Pearson correlation (after regressing out all factors except the conflict similarity) of cells with the same conflict type and congruency in the 1400×1400 matrix. Note that the seemingly disparities in the values of withinconflict cells (i.e., the diagonal) did not reach significance for either incongruent or congruent trials, Fs < 1.

      Public Review:

      3) To orthogonalized their variables, the authors need to employ a complex linear mixed effects analysis, with a potential influence of implementation details (e.g., high-level interactions and inflated degrees of freedom).

      Suggestion:

      3) The DF for a mixed model should not be the number of observations minus the number of fixed effects. The gold standard is to use satterthwaite correction (e.g. in Matlab, fixedEffects(lme,'DFMethod','satterthwaite')), or number of subjects - number of fixed effects (i.e. you want to generalize to new subjects, not just new samples from the same subjects). Honestly, running a 4-way interaction probably is probably using more degrees of freedom than are appropriate given the number of subjects.

      Response: We concur with the reviewer’s comment that our previous estimation of degrees of freedom (DFs) was inaccurate. Following your suggestion, we have now applied the “Satterthwaite” approach to approximate the DFs for all our linear mixed effect model analyses. This adjustment has led to the correction of both DFs and p values. In the Methods section, we have mentioned this revision.

      “We adjusted the t and p values with the degrees of freedom calculated through the Satterthwaite approximation method (Satterthwaite, 1946). Of note, this approach was applied to all the mixed-effect model analyses in this study.”

      The application of this method has indeed resulted in a reduction of our statistical significance. However, our overall conclusions remained robust. Instead of the highly stringent threshold used in our previous version (Bonferonni corrected p < .0001), we have now adopted a relatively more lenient threshold of Bonferonni correction at p < 0.05, which is commonly employed in the literature. Furthermore, it is worth noting that the follow-up criteria 2 and 3 are inherently second-order analyses. Criterion 2 involves examining the interaction effect (conflict similarity effect difference between incongruent and congruent conditions), and criterion 3 involves individual correlation analyses. Due to their second-order nature, these criteria inherently have lower statistical power compared to criterion 1 (Blake & Gangestad, 2020). We thus have applied a more lenient but still typically acceptable false discovery rate (FDR) correction to criteria 2 and 3. This adjustment helps maintain the rigor of our analysis while considering the inherent differences in statistical power across the various criteria. We have mentioned this revision in our manuscript:

      “We next tested whether these regions were related to cognitive control by comparing the strength of conflict similarity effect between incongruent and congruent conditions (criterion 2) and correlating the strength to behavioral similarity modulation effect (criterion 3). Given these two criteria pertain to second-order analyses (interaction or individual analyses) and thus might have lower statistical power (Blake & Gangestad, 2020), we applied a more lenient threshold using false discovery rate (FDR) correction (Benjamini & Hochberg, 1995) on the above-mentioned regions.”

      With these adjustments, we consistently identified similar brain regions as observed in our previous version. Specifically, we found that only the right 8C region met the three criteria in the conflict similarity analysis. In addition, the regions meeting the criteria for the orientation effect included the FEF and IP2 in left hemisphere, and V1, V2, POS1, and PF in the right hemisphere. We have thoroughly revised the description of our results, updated the figures and tables in both the revised manuscript and supplementary material to accurately reflect these outcomes.

      Reference:

      Blake, K. R., & Gangestad, S. (2020). On Attenuated Interactions, Measurement Error, and Statistical Power: Guidelines for Social and Personality Psychologists. Pers Soc Psychol Bull, 46(12), 1702-1711. https://doi.org/10.1177/0146167220913363

      Minor:

      1. Figure 8 should come much earlier (e.g, incorporated into Figure 1), and there should be consistent terms for 'cognitive map' and 'conflict similarity'.

      Response: We appreciate this suggestion. Considering that Figure 7 (“The crosssubject RSA model and the rationale”) also describes the models, we have merged Figure 7 and 8 and moved the new figure ahead, before we report the RSA results. Now you could find it in the new Figure 4, see below. We did not incorporate them into Figure 1 since Figure 1 is already too crowded.

      Author response image 3.

      Fig. 4. Rationale of the cross-subject RSA model and the schematic of key RSMs. A) The RSM is calculated as the Pearson’s correlation between each pair of conditions across the 35 subjects. For 17 subjects, the stimuli were displayed on the top-left and bottom-right quadrants, and they were asked to respond with left hand to the upward arrow and right hand to the downward arrow. For the other 18 subjects, the stimuli were displayed on the top-right and bottom-left quadrants, and they were asked to respond with left hand to the downward arrow and right hand to the upward arrow. Within each subject, the conflict type and orientation regressors were perfectly covaried. For instance, the same conflict type will always be on the same orientation. To de-correlate conflict type and orientation effects, we conducted the RSA across subjects from different groups. For example, the bottom-right panel highlights the example conditions that are orthogonal to each other on the orientation, response, and Simon distractor, whereas their conflict type, target and spatial Stroop distractor are the same. The dashed boxes show the possible target locations for different conditions. (B) and (C) show the orthogonality between conflict similarity and orientation RSMs. The within-subject RSMs (e.g., Group1-Group1) for conflict similarity and orientation are all the same, but the cross-group correlations (e.g., Group2-Group1) are different. Therefore, we can separate the contribution of these two effects when including them as different regressors in the same linear regression model. (D) and (E) show the two alternative models. Like the cosine model (B), within-group trial pairs resemble betweengroup trial pairs in these two models. The domain-specific model is an identity matrix. The domaingeneral model is estimated from the absolute difference of behavioral congruency effect, but scaled to 0 (lowest similarity) – 1 (highest similarity) to aid comparison. The plotted matrices in B-E include only one subject each from Group 1 and Group 2. Numbers 1-5 indicate the conflict type conditions, for spatial Stroop, StHSmL, StMSmM, StLSmH, and Simon, respectively. The thin lines separate four different sub-conditions, i.e., target arrow (up, down) × congruency (incongruent, congruent), within each conflict type.

      In our manuscript, the term “cognitive map/space” was used when explaining the results in a theoretical perspective, whereas the “conflict similarity” was used to describe the regressor within the RSA. These terms serve distinct purposes in our study and cannot be interchangeably substituted. Therefore, we have retained them in their current format. However, we recognize that the initial introduction of the “Cognitive-Space model” may have appeared somewhat abrupt. To address this, we have included a brief explanatory note: “The model described above employs the cosine similarity measure to define conflict similarity and will be referred to as the Cognitive-Space model.”

    2. Author Response

      The following is the authors’ response to the previous reviews.

      Thank you and the reviewers for further providing constructive comments and suggestions on our manuscript. On behalf of all the co-authors, I have enclosed a revised version of the above referenced paper. Below, I have merged similar public reviews and recommendations (if applicable) from each reviewer and provided point-by-point responses.

      Reviewer #1:

      People can perform a wide variety of different tasks, and a long-standing question in cognitive neuroscience is how the properties of different tasks are represented in the brain. The authors develop an interesting task that mixes two different sources of difficulty, and find that the brain appears to represent this mixture on a continuum, in the prefrontal areas involved in resolving task difficulty. While these results are interesting and in several ways compelling, they overlap with previous findings and rely on novel statistical analyses that may require further validation.

      Strengths

      1. The authors present an interesting and novel task for combining the contributions of stimulus-stimulus and stimulus-response conflict. While this mixture has been measured in the multi-source interference task (MSIT), this task provides a more graded mixture between these two sources of difficulty.

      2. The authors do a good job triangulating regions that encoding conflict similarity, looking for the conjunction across several different measures of conflict encoding. These conflict measures use several best-practice approaches towards estimating representational similarity.

      3. The authors quantify several salient alternative hypothesis and systematically distinguish their core results from these alternatives.

      4. The question that the authors tackle is important to cognitive control, and they make a solid contribution.

      The authors have addressed several of my concerns. I appreciate the authors implementing best practices in their neuroimaging stats.

      I think that the concerns that remain in my public review reflect the inherent limitations of the current work. The authors have done a good job working with the dataset they've collected.

      Response: We would like to thank the reviewer for the positive evaluation of our manuscript and the constructive comments and suggestions. In response to your suggestions and concerns, we have removed the Stroop/Simon-only and the Stroop+Simon models, revised our conclusion and modified the misleading phrases.

      We have provided detailed responses to your comments below.

      1. The evidence from this previous work for mixtures between different conflict sources makes the framing of 'infinite possible types of conflict' feel like a strawman. The authors cite classic work (e.g., Kornblum et al., 1990) that develops a typology for conflict which is far from infinite. I think few people would argue that every possible source and level of difficulty will have to be learned separately. This work provides confirmatory evidence that task difficulty is represented parametrically (e.g., consistent with the n-back, MOT, and random dot motion literature).

      notes for my public concerns.

      In their response, the authors say:

      'If each combination of the Stroop-Simon combination is regarded as a conflict condition, there would be infinite combinations, and it is our major goal to investigate how these infinite conflict conditions are represented effectively in a space with finite dimensions.'

      I do think that this is a strawman. The paper doesn't make a strong case that this position ('infinite combinations') is widely held in the field. There is previous work (e.g., n-back, multiple object tracking, MSIT, dot motion) that has already shown parametric encoding of task difficulty. This paper provides confirmatory evidence, using an interesting new task, that demand are parametric, but does not provide a major theoretical advance.

      Response: We agree that the previous expression may have seemed somewhat exaggerative. While it is not “infinite”, recent research indeed suggests that the cognitive control shows domain-specificity across various “domains”, including conflict types (Egner, 2008), sensory modalities (Yang et al., 2017), task-irrelevant stimuli (Spape et al., 2008), and task sets (Hazeltine et al., 2011), to name a few.

      These findings collectively support the notion that cognitive control is contextspecific (Bream et al., 2014). That is, cognitive control can be tuned and associated with different (and potentially large numbers of) contexts. Recently, Kikumoto and Mayr (2020) demonstrated that combinations of stimulus, rule and response in the same task formed separatable, conjunctive representations. They further showed that these conjunctive representations facilitate performance. This is in line with the idea that each stimulus-location combination in the present task may be represented separately in a domain-specific manner. Moreover, domain-general task representation can also become domain-specific with learning, which further increases the number of domain-specific conjunctive representations (Mill et al., 2023). In line with the domain-specific account of cognitive control, we referred to the “infinite combinations” in our previous response to emphasize the extreme case of domainspecificity. However, recognizing that the term “infinite” may lead to ambiguity, we have replaced it with phrases such as “a large number of”, “hugely varied”, in our revised manuscript.

      We appreciate the reviewer for highlighting the potential connection of our work to existing literature that showed the parametric encoding of task difficulty (e.g., Dagher et al., 1999; Ritz & Shenhav, 2023). For instance, in Ritz et al.’s (2023) study, they parametrically manipulated target difficulty based on consistent ratios of dot color, and found that the difficulty was encoded in the caudal part of dorsal anterior cingulate cortex. Analogically, in our study, the “difficulty” pertains to the behavioral congruency effect that we modulated within the spatial Stroop and Simon dimensions. Notably, we did identify univariate effects in the right dmPFC and IPS associated with the difficulty in the Simon dimension. This parametric effect may lend support to our cognitive space hypothesis, although we exercised caution in interpreting their significance due to the absence of a clear brain-behavioral relevance in these regions. We have added the connection of our work to prior literature in the discussion. The parametric encoding of conflict also mirrors prior research showing the parametric encoding of task demands (Dagher et al., 1999; Ritz & Shenhav, 2023).

      However, our analyses extend beyond solely testing the parametric encoding of difficulty. Instead, we focused on the multivariate representation of different conflict types, which we believe is independent from the univariate parametric encoding. Unlike the univariate encoding that relies on the strength within one dimension, the multivariate representation of conflict types incorporates both the spatial Stroop and Simon dimensions. Furthermore, we found that similar difficulty levels did not yield similar conflict representation, as indicated by the low similarity between the spatial Stroop and Simon conditions, despite both showing a similar level of congruency effect (Fig. S1). Additionally, we also observed an interaction between conflict similarity and difficulty (i.e., congruency, Fig. 4B/D), such that the conflict similarity effect was more pronounced when conflict was present. Therefore, we believe that our findings make contribution to the literature beyond the difficulty effect.

      Reference:

      Egner, T. (2008). Multiple conflict-driven control mechanisms in the human brain. Trends in Cognitive Sciences, 12(10), 374-380. https://doi.org/10.1016/j.tics.2008.07.001

      Yang, G., Nan, W., Zheng, Y., Wu, H., Li, Q., & Liu, X. (2017). Distinct cognitive control mechanisms as revealed by modality-specific conflict adaptation effects. Journal of Experimental Psychology: Human Perception and Performance, 43(4), 807-818. https://doi.org/10.1037/xhp0000351

      Spapé MM, Hommel B (2008). He said, she said: episodic retrieval induces conflict adaptation in an auditory Stroop task. Psychonomic Bulletin Review,15(6):1117-21. https://doi.org/10.3758/PBR.15.6.1117

      Hazeltine E, Lightman E, Schwarb H, Schumacher EH (2011). The boundaries of sequential modulations: evidence for set-level control. Journal of Experimental Psychology: Human Perception & Performance. 2011 Dec;37(6):1898-914. https://doi.org/10.1037/a0024662

      Braem, S., Abrahamse, E. L., Duthoo, W., & Notebaert, W. (2014). What determines the specificity of conflict adaptation? A review, critical analysis, and proposed synthesis. Frontiers in Psychology, 5, 1134. https://doi.org/10.3389/fpsyg.2014.01134

      Kikumoto A, Mayr U. (2020). Conjunctive representations that integrate stimuli, responses, and rules are critical for action selection. Proceedings of the National Academy of Sciences, 117(19):10603-10608. https://doi.org/10.1073/pnas.1922166117.

      Mill, R. D., & Cole, M. W. (2023). Neural representation dynamics reveal computational principles of cognitive task learning. bioRxiv. https://doi.org/10.1101/2023.06.27.546751

      Dagher, A., Owen, A. M., Boecker, H., & Brooks, D. J. (1999). Mapping the network for planning: a correlational PET activation study with the Tower of London task. Brain, 122 ( Pt 10), 1973-1987. https://doi.org/10.1093/brain/122.10.1973

      Ritz, H., & Shenhav, A. (2023). Orthogonal neural encoding of targets and distractors supports multivariate cognitive control. https://doi.org/10.1101/2022.12.01.518771

      1. (Public Reviews) The degree of Stroop vs Simon conflict is perfectly negatively correlated across conditions. This limits their interpretation of an integrated cognitive space, as they cannot separately measure Stroop and Simon effects. The author's control analyses have limited ability to overcome this task limitation. While these results are consistent with parametric encoding, they cannot adjudicate between combined vs separated representations.

      (Recommendations) I think that it is still an issue that the task's two features (stroop and simon conflict) are perfectly correlated. This fundamentally limits their ability to measure the similarity in these features. The authors provide several control analyses, but I think these are limited.

      Response: We need to acknowledge that the spatial Stroop and Simon components in the five conflict conditions were not “perfectly” correlated, with r = –0.89. This leaves some room for the preliminary model comparison to adjudicate between these models. However, it’s essential to note that conclusions based on these results must be tempered. In line with the reviewer’s observation, we agree that the high correlation between the two conflict sources posed a potential limitation on our ability to independently investigate the contribution of spatial Stroop and Simon conflicts. Therefore, in addition to the limitation we have previously acknowledged, we have now further revised our conclusion and adjusted our expressions accordingly.

      Specifically, we now regard the parametric encoding of cognitive control not as direct evidence of the cognitive space view but as preliminary evidence that led us to propose this hypothesis, which requires further testing. Notably, we have also modified the title from “Conflicts are represented in a cognitive space to reconcile domain-general and domain-specific cognitive control” to “Conflicts are parametrically encoded: initial evidence for a cognitive space view to reconcile the debate of domain-general and domain-specific cognitive control”. Also, we revised the conclusion as: In sum, we showed that the cognitive control can be parametrically encoded in the right dlPFC and guides cognitive control to adjust goal-directed behavior. This finding suggests that different cognitive control states may be encoded in an abstract cognitive space, which reconciles the long-standing debate between the domain-general and domain-specific views of cognitive control and provides a parsimonious and more broadly applicable framework for understanding how our brains efficiently and flexibly represents multiple task settings.

      From Recommendations The authors perform control analyses that test stroop-only and simon-only models. However, these analyses use a totally different similarity metric, that's based on set intersection rather than geometry. This metric had limited justification or explanation, and it's not clear whether these models fit worse because of the similarity metric. Even here, Simon-only model fit better than Stroop+Simon model. The dimensionality analyses may reflect the 1d manipulation by the authors (i.e. perfectly corrected stroop and simon effects).

      Response: The Jaccard measure is the most suitable method we can conceive of for assessing the similarity between two conflicts when establishing the Stroop-only and Simon-only models, achieved by projecting them onto the vertical or horizontal axes, respectively (Author response image 1A). This approach offers two advantages. First, the Jaccard similarity combines both similarity (as reflected by the numerator) and distance (reflected by the difference between denominator and numerator) without bias towards either. Second, the Jaccard similarity in our design is equivalent to the cosine similarity because the denominator in the cosine similarity is identical to the denominator in the Jaccard similarity (both are the radius of the circle, Author response image 1B).

      Author response image 1.

      Definition of Jaccard similarity. A) Two conflicts (1 and 2) are projected onto the spatial Stroop/Simon axis in the Stroop/Simon-only model, respectively. The Jaccard similarity for Stroop-only and Simon-only model are and respectively. Letters a-d are the projected vectors from the two conflicts to the two axes. Blue and red colors indicate the conflict conditions. Shorter vectors are the intersection and longer vectors are the union. B) According to the cosine similarity model, the similarity is defined as , where e is the projected vector from conflict 1 to conflict 2, and g is the vector of conflict 1. The Jaccard similarity for this case is defined by , where f is the projector vector from conflict 2 to itself. Because f = g in our design, the Jaccard similarity is equivalent to the cosine similarity.

      Therefore, we believe that the model comparisons between cosine similarity model and the Stroop/Simon-Only models were equitable. However, we acknowledge the reviewer’s and other reviewers’ concerns about the correlation between spatial Stroop and Simon conflicts, which reduces the space to one dimension (1d) and limits our ability to distinguish between the Stroop-only and Simon-only models, as well as between Stroop+Simon and cosine similarity models. While these distinctions are undoubtedly important for understanding the geometry of the cognitive space, we recognize that they go beyond the major objective of this study, that is, to differentiate the cosine similarity model from domain-general/specific models. Therefore, we have chosen to exclude the Stroop-only, Simon-only and Stroop+Simon models in our revised manuscript.

      Something that raised additional concerns are the RSMs in the key region of interest (Fig S5). The pure stroop task appears to be represented very differently from all of the conditions that include simon conflict.

      Together, I think these limitations reflect the structure of the task and research goals, not the statistical approach (which has been meaningfully improved).

      Response: We appreciate the reviewer for pointing this out. It is essential to clarify that our conclusions were based on the significant similarity modulation effect identified in our statistical analysis using the cosine similarity model, where we did not distinguish between the within-Stroop condition and the other four within-conflict conditions (Fig. 7A, now Fig. 8A). This means that the representation of conflict type was not biased by the seemingly disparities in the values shown here. Moreover, to specifically test the differences between the within-Stroop condition and the other within-conflict conditions, we conducted a mixed-effect model analysis only including trial pairs from the same conflict type. In this analysis, the primary predictor was the cross-condition difference (0 for within-Stroop condition and 1 for other within-conflict conditions). The results showed no significant cross-condition difference in either the incongruent (t = 1.22, p = .23) or the congruent (t = 1.06, p = .29) trials. Thus, we believe the evidence for different similarities is inconclusive in our data and decided not to interpret this numerical difference. We have added this note in the revised figure caption for Figure S5.

      Author response image 2.

      Fig. S5. The stronger conflict type similarity effect in incongruent versus congruent conditions. (A) Summary representational similarity matrices for the right 8C region in incongruent (left) and congruent (right) conditions, respectively. Each cell represents the averaged Pearson correlation of cells with the same conflict type and congruency in the 1400×1400 matrix. Note that the seemingly disparities in the values of Stroop and other within-conflict cells (i.e., the diagonal) did not reach significance for either incongruent (t = 1.22, p = .23) or congruent (t = 1.06, p = .29) trials. (2) Scatter plot showing the averaged neural similarity (Pearson correlation) as a function of conflict type similarity in both conditions. The values in both A and B are calculated from raw Pearson correlation values, in contrast to the z-scored values in Fig. 4D.

      Minor:

      • In the analysis of similarity_orientation, the df is very large (~14000). Here, and throughout, the df should be reflective of the population of subjects (ie be less than the sample size).

      Response: The large degrees of freedom (df) in our analysis stem from the fact that we utilized a mixed-effect linear model, incorporating all data points (a total of 400×35=14000). In mixed-effect models, the df is determined by subtracting the number of fixed effects (in our case, 7) from the total number of observations. Notably, we are in line with the literature that have reported the df in this manner (e.g., Iravani et al., 2021; Schmidt & Weissman, 2015; Natraj et al., 2022).

      Reference:

      Iravani B, Schaefer M, Wilson DA, Arshamian A, Lundström JN. The human olfactory bulb processes odor valence representation and cues motor avoidance behavior. Proc Natl Acad Sci U S A. 2021 Oct 19;118(42):e2101209118. https://doi.org/10.1073/pnas.2101209118.

      Schmidt, J.R., Weissman, D.H. Congruency sequence effects and previous response times: conflict adaptation or temporal learning?. Psychological Research 80, 590–607 (2016). https://doi.org/10.1007/s00426-015-0681-x.

      Natraj, N., Silversmith, D. B., Chang, E. F., & Ganguly, K. (2022). Compartmentalized dynamics within a common multi-area mesoscale manifold represent a repertoire of human hand movements. Neuron, 110(1), 154-174. https://doi.org/10.1016/j.neuron.2021.10.002.

      • it would improve the readability if there was more didactic justification for why analyses are done a certain way (eg justifying the jaccard metric). This will help less technically-savvy readers.

      Response: We appreciate the reviewer’s suggestion. However, considering the Stroop/Simon-only models in our design may not be a valid approach for distinguishing the contributions of the Stroop/Simon components, we have decided not to include the Jaccard metrics in our revised manuscript.

      Besides, to improve the readability, we have moved Figure S4 to the main text (labeled as Figure 7), and added the domain-general/domain-specific schematics in Figure 8.

      Author response image 3.

      Figure 8. Schematic of key RSMs. (A) and (B) show the orthogonality between conflict similarity and orientation RSMs. The within-subject RSMs (e.g., Group1-Group1) for conflict similarity and orientation are all the same, but the cross-group correlations (e.g., Group2-Group1) are different. Therefore, we can separate the contribution of these two effects when including them as different regressors in the same linear regression model. (C) and (D) show the two alternative models. Like the cosine model (A), within-group trial pairs resemble between-group trial pairs in these two models. The domain-specific model is an identity matrix. The domain-general model is estimated from the absolute difference of behavioral congruency effect, but scaled to 0(lowest similarity)-1(highest similarity) to aid comparison. The plotted matrices here include only one subject each from Group 1 and Group 2. Numbers 1-5 indicate the conflict type conditions, for spatial Stroop, StHSmL, StMSmM, StLSmH, and Simon, respectively. The thin lines separate four different sub-conditions, i.e., target arrow (up, down) × congruency (incongruent, congruent), within each conflict type.

      Reviewer #2:

      This study examines the construct of "cognitive spaces" as they relate to neural coding schemes present in response conflict tasks. The authors use a novel experimental design in which different types of response conflict (spatial Stroop, Simon) are parametrically manipulated. These conflict types are hypothesized to be encoded jointly, within an abstract "cognitive space", in which distances between task conditions depend only on the similarity of conflict types (i.e., where conditions with similar relative proportions of spatial-Stroop versus Simon conflicts are represented with similar activity patterns). Authors contrast such a representational scheme for conflict with several other conceptually distinct schemes, including a domain-general, domain-specific, and two task-specific schemes. The authors conduct a behavioral and fMRI study to test which of these coding schemes is used by prefrontal cortex. Replicating the authors' prior work, this study demonstrates that sequential behavioral adjustments (the congruency sequence effect) are modulated as a function of the similarity between conflict types. In fMRI data, univariate analyses identified activation in left prefrontal and dorsomedial frontal cortex that was modulated by the amount of Stroop or Simon conflict present, and representational similarity analyses (RSA) that identified coding of conflict similarity, as predicted under the cognitive space model, in right lateral prefrontal cortex.

      This study tackles an important question regarding how distinct types of conflict might be encoded in the brain within a computationally efficient representational format. The ideas postulated by the authors are interesting ones and the statistical methods are generally rigorous.

      Response: We would like to express our sincere appreciation for the reviewer’s positive evaluation of our manuscript and the constructive comments and suggestions. In response to your suggestions and concerns, we excluded the StroopOnly, SimonOnly and Stroop+Simon models, and added the schematic of domain-general/specific model RSMs. We have provided detailed responses to your comments below.

      The evidence supporting the authors claims, however, is limited by confounds in the experimental design and by lack of clarity in reporting the testing of alternative hypotheses within the method and results.

      1. Model comparison

      The authors commendably performed a model comparison within their study, in which they formalized alternative hypotheses to their cognitive space hypothesis. We greatly appreciate the motivation for this idea and think that it strengthened the manuscript. Nevertheless, some details of this model comparison were difficult for us to understand, which in turn has limited our understanding of the strength of the findings.

      The text indicates the domain-general model was computed by taking the difference in congruency effects per conflict condition. Does this refer to the "absolute difference" between congruency effects? In the rest of this review, we assume that the absolute difference was indeed used, as using a signed difference would not make sense in this setting. Nevertheless, it may help readers to add this information to the text.

      Response: We apologize for any confusion. The “difference” here indeed refers to the “absolute difference” between congruency effects. We have now clarified this by adding the word “absolute” accordingly.

      "Therefore, we defined the domain-general matrix as the absolute difference in their congruency effects indexed by the group-averaged RT in Experiment 2."

      Regarding the Stroop-Only and Simon-Only models, the motivation for using the Jaccard metric was unclear. From our reading, it seems that all of the other models --- the cognitive space model, the domain-general model, and the domain-specific model --- effectively use a Euclidean distance metric. (Although the cognitive space model is parameterized with cosine similarities, these similarity values are proportional to Euclidean distances because the points all lie on a circle. And, although the domain-general model is parameterized with absolute differences, the absolute difference is equivalent to Euclidean distance in 1D.) Given these considerations, the use of Jaccard seems to differ from the other models, in terms of parameterization, and thus potentially also in terms of underlying assumptions. Could authors help us understand why this distance metric was used instead of Euclidean distance? Additionally, if Jaccard must be used because this metric seems to be non-standard in the use of RSA, it would likely be helpful for many readers to give a little more explanation about how it was calculated.

      Response: We believe that the Jaccard similarity measure is consistent with the Cosine similarity measure. The Jaccard similarity is calculated as the intersection divided by the union. To define the similarity of two conflicts in the Stroop-only and Simon-only models, we first project them onto the vertical or horizontal axes, respectively (as shown in Author response image 1A). The Jaccard similarity in our design is equivalent to the cosine similarity because the denominator in the Jaccard similarity is identical to the denominator in the cosine similarity (both are the radius of the circle, Author response image 1B).

      However, it is important to note that a cosine similarity cannot be defined when conflicts are projected onto spatial Stroop or Simon axis simultaneously. Therefore, we used the Jaccard similarity in the previous version of our manuscript.

      Author response image 4.

      Definition of Jaccard similarity. A) Two conflicts (1 and 2) are projected onto the spatial Stroop/Simon axis in the Stroop/Simon-only model, respectively. The Jaccard similarity for Stroop-only and Simon-only model are and respectively. Letters a-d are the projected vectors from the two conflicts to the two axes. Blue and red colors indicate the conflict conditions. Shorter vectors are the intersection and longer vectors are the union. B) According to the cosine similarity model, the similarity is defined as , where e is the projected vector from conflict 1 to conflict 2, and g is the vector of conflict 1. The Jaccard similarity for this case is defined by , where f is the projector vector from conflict 2 to itself. Because f = g in our design, the Jaccard similarity is equivalent to the cosine similarity.

      However, we agree with the reviewer’s and other reviewers’ concern that the correlation between spatial Stroop and Simon conflicts makes it less likely to distinguish the Stroop+Simon from cosine similarity models. While distinguishing them is essential to understand the detailed geometry of the cognitive space, it is beyond our major purpose, that is, to distinguish the cosine similarity model with the domain-general/specific models. Therefore, we have chosen to exclude the Stroop-only, Simon-only and Stroop+Simon models from our revised manuscript.

      When considering parameterizing the Stroop-Only and Simon-Only models with Euclidean distances, one concern we had is that the joint inclusion of these models might render the cognitive space model unidentifiable due to collinearity (i.e., the sum of the Stroop-Only and Simon-Only models could be collinear with the cognitive space model). Could the authors determine whether this is the case? This issue seems to be important, as the presence of such collinearity would suggest to us that the design is incapable of discriminating those hypotheses as parameterized.

      Response: We acknowledge that our design does not allow for a complete differentiation between the parallel encoding (StroopOnly+SimonOnly) model and the cognitive space model, given their high correlation (r = 0.85). However, it is important to note that the StroopOnly+SimonOnly model introduces more free parameters, making the model fitting poorer than the cognitive space model.

      Additionally, the cognitive space model also shows high correlations with the StroopOnly and SimonOnly models (both rs = 0.66). It is crucial to emphasize that our study’s primary goal does not involve testing the parallel encoding hypothesis (through the StroopOnly+SimonOnly model). As a result, we have chosen to remove the model comparison results with the StroopOnly, SimonOnly and StroopOnly+SimonOnly models. Instead, the cognitive space model shows lower correlation with the purely domain-general (r = −0.16) and domain-specific (r = 0.46) models.

      1. Issue of uniquely identifying conflict coding

      We certainly appreciate the efforts that authors have taken to address potential confounders for encoding of conflict in their original submission. We broach this question not because we wish authors to conduct additional control analyses, but because this issue seems to be central to the thesis of the manuscript and we would value reading the authors' thoughts on this issue in the discussion.

      To summarize our concerns, conflict seems to be a difficult variable to isolate within aggregate neural activity, at least relative to other variables typically studied in cognitive control, such as task-set or rule coding. This is because it seems reasonable to expect that many more nuisance factors covary with conflict -- such as univariate activation, level of cortical recruitment, performance measures, arousal --- than in comparison with, for example, a well-designed rule manipulation. Controlling for some of these factors post-hoc through regression is commendable (as authors have done here), but such a method will likely be incomplete and can provide no guarantees on the false positive rate.

      Relatedly, the neural correlates of conflict coding in fMRI and other aggregate measures of neural activity are likely of heterogeneous provenance, potentially including rate coding (Fu et al., 2022), temporal coding (Smith et al., 2019), modulation of coding of other more concrete variables (Ebitz et al., 2020, 10.1101/2020.03.14.991745; see also discussion and reviews of Tang et al., 2016, 10.7554/eLife.12352), or neuromodulatory effects (e.g., Aston-Jones & Cohen, 2005). Some of these origins would seem to be consistent with "explicit" coding of conflict (conflict as a representation), but others would seem to be more consistent with epiphenomenal coding of conflict (i.e., conflict as an emergent process). Again, these concerns could apply to many variables as measured via fMRI, but at the same time, they seem to be more pernicious in the case of conflict. So, if authors consider these issues to be germane, perhaps they could explicitly state in the discussion whether adopting their cognitive space perspective implies a particular stance on these issues, how they interpret their results with respect to these issues, and if relevant, qualify their conclusions with uncertainty on these issues.

      Response: We appreciate the reviewer’s insightful comments regarding the representation and process of conflict.

      First, we agree that the conflict is not simply a pure feature like a stimulus but often arises from the interaction (e.g., dimension overlap) between two or more aspects. For example, in the manual Stroop, conflict emerges from the inconsistent semantic information between color naming and word reading. Similarly, other higher-order cognitive processes such as task-set also underlie the relationship between concrete aspects. For instance, in a face/house categorization task, the taskset is the association between face/house and the responses. When studying these higher-order processes, it is often impossible to completely isolate them from bottomup features. Therefore, methods like the representational similarity analysis and regression models are among the limited tools available to attempt to dissociate these concrete factors from conflict representation. While not perfect, this approach has been suggested and utilized in practice (Freund et al., 2021).

      Second, we agree that conflict can be both a representation and an emerging process. These two perspectives are not necessarily contradictory. According to David Marr’s influential three-level theory (Marr, 1982), representation is the algorithm of the process to achieve a goal based on the input. Therefore, a representation can refer to not only a static stimulus (e.g., the visual representation of an image), but also a dynamic process. Building on this perspective, we posit that the representation of cognitive control consists of an array of dynamic representations embedded within the overall process. A similar idea has been proposed that the abstract task profiles can be progressively constructed as a representation in our brain (Kikumoto & Mayr, 2020).

      We have incorporated this discussion into the manuscript:

      "Recently an interesting debate has arisen concerning whether cognitive control should be considered as a process or a representation (Freund, Etzel, et al., 2021). Traditionally, cognitive control has been predominantly viewed as a process. However, the study of its representation has gained more and more attention. While it may not be as straightforward as the visual representation (e.g., creating a mental image from a real image in the visual area), cognitive control can have its own form of representation. An influential theory, Marr’s (1982) three-level model proposed that representation serves as the algorithm of the process to achieve a goal based on the input. In other words, representation can encompass a dynamic process rather than being limited to static stimuli. Building on this perspective, we posit that the representation of cognitive control consists of an array of dynamic representations embedded within the overall process. A similar idea has been proposed that the representation of task profiles can be progressively constructed with time in the brain (Kikumoto & Mayr, 2020)."

      Reference:

      Freund, M. C., Etzel, J. A., & Braver, T. S. (2021). Neural Coding of Cognitive Control: The Representational Similarity Analysis Approach. Trends in Cognitive Sciences, 25(7), 622-638. https://doi.org/10.1016/j.tics.2021.03.011

      Marr, D. C. (1982). Vision: A computational investigation into human representation and information processing. New York: W.H. Freeman.

      Kikumoto A, Mayr U. (2020). Conjunctive representations that integrate stimuli, responses, and rules are critical for action selection. Proceedings of the National Academy of Sciences, 117(19):10603-10608. https://doi.org/10.1073/pnas.1922166117.

      1. Interpretation of measured geometry in 8C

      We appreciate the inclusion of the measured similarity matrices of area 8C, the key area the results focus on, to the supplemental, as this allows for a relatively model-agnostic look at a portion of the data. Interestingly, the measured similarity matrix seems to mismatch the cognitive space model in a potentially substantive way. Although the model predicts that the "pure" Stroop and Simon conditions will have maximal self-similarity (i.e., the Stroop-Stroop and Simon-Simon cells on the diagonal), these correlations actually seem to be the lowest, by what appears to be a substantial margin (particularly the Stroop-Stroop similarities). What should readers make of this apparent mismatch? Perhaps authors could offer their interpretation on how this mismatch could fit with their conclusions.

      Response: We appreciate the reviewer for bringing this to our attention. It is essential to clarify that our conclusions were based on the significant similarity modulation effect observed in our statistical analysis using the cosine similarity model, where we did not distinguish between the within-Stroop condition and the other four withinconflict conditions (Fig. 7A). This means that the representation of conflict type was not biased by the seemingly disparities in the values shown here. Moreover, to specifically address the potential differences between the within-Stroop condition and the other within-conflict conditions, we conducted a mixed-effect model. In this analysis, the primary predictor was the cross-condition difference (0 for within-Stroop condition and 1 for other within-conflict conditions). The results showed no significant cross-condition difference in either the incongruent trials (t = 1.22, p = .23) or the congruent (t = 1.06, p = .29) trials. Thus, we believe the evidence for different similarities is inconclusive in our data and decided not to interpret this numerical difference.

      We have added this note in the revised figure caption for Figure S5.

      Author response image 5.

      Fig. S5. The stronger conflict type similarity effect in incongruent versus congruent conditions. (A) Summary representational similarity matrices for the right 8C region in incongruent (left) and congruent (right) conditions, respectively. Each cell represents the averaged Pearson correlation of cells with the same conflict type and congruency in the 1400×1400 matrix. Note that the seemingly disparities in the values of Stroop and other within-conflict cells (i.e., the diagonal) did not reach significance for either incongruent (t = 1.22, p = .23) or congruent (t = 1.06, p = .29) trials. (2) Scatter plot showing the averaged neural similarity (Pearson correlation) as a function of conflict type similarity in both conditions. The values in both A and B are calculated from raw Pearson correlation values, in contrast to the z-scored values in Fig. 4D.

      1. It would likely improve clarity if all of the competing models were displayed as summarized RSA matrices in a single figure, similar to (or perhaps combined with) Figure 7.

      Response: We appreciate the reviewer’s suggestion. We now have incorporated the domain-general and domain-specific models into the Figure 7 (now Figure 8).

      Author response image 6.

      Figure 8. Schematic of key RSMs. (A) and (B) show the orthogonality between conflict similarity and orientation RSMs. The within-subject RSMs (e.g., Group1-Group1) for conflict similarity and orientation are all the same, but the cross-group correlations (e.g., Group2-Group1) are different. Therefore, we can separate the contribution of these two effects when including them as different regressors in the same linear regression model. (C) and (D) show the two alternative models. Like the cosine model (A), within-group trial pairs resemble between-group trial pairs in these two models. The domain-specific model is an identity matrix. The domain-general model is estimated from the absolute difference of behavioral congruency effect, but scaled to 0(lowest similarity)-1(highest similarity) to aid comparison. The plotted matrices here include only one subject each from Group 1 and Group 2. Numbers 1-5 indicate the conflict type conditions, for spatial Stroop, StHSmL, StMSmM, StLSmH, and Simon, respectively. The thin lines separate four different sub-conditions, i.e., target arrow (up, down) × congruency (incongruent, congruent), within each conflict type.

      1. Because this model comparison is key to the main inferences in the study, it might also be helpful for most readers to move all of these RSA model matrices to the main text, instead of in the supplemental.

      Response: We thank the reviewer for this suggestion. We have moved the Fig. S4 to the main text, labeled as the new Figure 7.

      1. It may be worthwhile to check how robust the observed brain-behavior association (Fig 4C) is to the exclusion of the two datapoints with the lowest neural representation strength measure, as these points look like they have high leverage.

      Response: We calculated the Pearson correlation after excluding the two points and found it does not affect the results too much, with the r = 0.50, p = .003 (compared to the original r = 0.52, p = .001).

      Additionally, we found the two axes were mistakenly shifted in Fig 4C. Therefore, we corrected this error in the revised manuscript. The correlation results would not be influenced.

      Author response image 7.

      Fig. 4. The conflict type effect. (A) Brain regions surviving the Bonferroni correction (p < 0.0001) across the regions (criterion 1). Labeled regions are those meeting the criterion 2. (B) Different encoding of conflict type in the incongruent with congruent conditions. * Bonferroni corrected p < .05. (C) The brain-behavior correlation of the right 8C (criterion 3). The x-axis shows the beta coefficient of the conflict type effect from the RSA, and the y-axis shows the beta coefficient obtained from the behavioral linear model using the conflict similarity to predict the CSE in Experiment 2. (D) Illustration of the different encoding strength of conflict type similarity in incongruent versus congruent conditions of right 8C. The y-axis is derived from the z-scored Pearson correlation coefficient, consistent with the RSA methodology. See Fig. S4B for a plot with the raw Pearson correlation measurement. l = left; r = right.

      Reviewer #3:

      Yang and colleagues investigated whether information on two task-irrelevant features that induce response conflict is represented in a common cognitive space. To test this, the authors used a task that combines the spatial Stroop conflict and the Simon effect. This task reliably produces a beautiful graded congruency sequence effect (CSE), where the cost of congruency is reduced after incongruent trials. The authors measured fMRI to identify brain regions that represent the graded similarity of conflict types, the congruency of responses, and the visual features that induce conflicts. They applied univariate, multivariate, and connectivity analyses to fMRI data to identify brain regions that represent the graded similarity of conflict types, the congruency of responses, and the visual features that induce conflicts. They further directly assessed the dimensionality of represented conflict space.

      The authors identified the right dlPFC (right 8C), which shows 1) stronger encoding of graded similarity of conflicts in incongruent trials and 2) a positive correlation between the strength of conflict similarity type and the CSE on behavior. The dlPFC has been shown to be important for cognitive control tasks. As the dlPFC did not show a univariate parametric modulation based on the higher or lower component of one type of conflict (e.g., having more spatial Stroop conflict or less Simon conflict), it implies that dissimilarity of conflicts is represented by a linear increase or decrease of neural responses. Therefore, the similarity of conflict is represented in multivariate neural responses that combine two sources of conflict.

      The strength of the current approach lies in the clear effect of parametric modulation of conflict similarity across different conflict types. The authors employed a clever cross-subject RSA that counterbalanced and isolated the targeted effect of conflict similarity, decorrelating orientation similarity of stimulus positions that would otherwise be correlated with conflict similarity. A pattern of neural response seems to exist that maps different types of conflict, where each type is defined by the parametric gradation of the yoked spatial Stroop conflict and the Simon conflict on a similarity scale. The similarity of patterns increases in incongruent trials and is correlated with CSE modulation of behavior.

      The main significance of the paper lies in the evidence supporting the use of an organized "cognitive space" to represent conflict information as a general control strategy. The authors thoroughly test this idea using multiple approaches and provide convincing support for their findings. However, the universality of this cognitive strategy remains an open question.

      (Public Reviews) Taken together, this study presents an exciting possibility that information requiring high levels of cognitive control could be flexibly mapped into cognitive map-like representations that both benefit and bias our behavior. Further characterization of the representational geometry and generalization of the current results look promising ways to understand representations for cognitive control.

      Response: We would like to thank the reviewer for the positive evaluation of our manuscript and for providing constructive comments. In response to your suggestions, we have acknowledged the potential limitation of the design and the cross-subject RSA approach, and incorporated the open questions to the discussions. Please find our detailed responses below.

      The task presented in the study involved two sources of conflict information through a single salient visual input, which might have encouraged the utilization of a common space.

      Response: We agree that the unified visual input in our design may have facilitated the utilization of a common space. However, we believe the stimuli are not necessarily unified in the construction of the common space. To further test the potential interaction between the concrete stimulus setting and the cognitive space representation, it is necessary to use varied stimuli in future research. We have left this as an open question in the discussion:

      Can we effectively map any sources of conflict with completely different stimuli into a single space?

      The similarity space was analyzed at the level of between-individuals (i.e., crosssubject RSA) to mitigate potential confounds in the design, such as congruency and the orientation of stimulus positions. This approach makes it challenging to establish a direct link between the quality of conflict space representation and the patterns of behavioral adaptations within individuals.

      Response: By setting the variables as random effects at the subject level, we have extracted the individual effects that incorporate both the group-level fixed effects and individual-level random effects. We believe this approach yields results that are as reliable, if not more, than effects calculated from individual data only. First, the mixed effect linear (LME) model has included all the individual data, forming the basis for establishing random effects. Therefore, the individual effects derived from this approach inherently reflect the individual-specific effects. To support this notion, we have included a simulation script (accessible in the online file “simulation_LME.mlx” at https://osf.io/rcq8w) to demonstrate the strong consistency between the two approaches (see Author response image 8). In this simulation, we generated random data (Y) for 35 subjects, each containing 20 repeated measurements across 5 conditions. To streamline the simulation, we only included one predictor (X), which was treated as both fixed and random effects at the subject level. We applied two methods to calculate the individual beta coefficient. The first involved extracting individual beta coefficients from the LME model by summing the fixed effect with the subject-specific random effect. The second method was entailed conducting a regression analysis using data from each subject to obtain the slope. We tested their consistency by calculating the Pearson correlation between the derived beta coefficients. This simulation was repeated 100 times.

      Author response image 8.

      The consistent individual beta coefficients between the mixed effect model and the individual regression analysis. A) The distribution of Pearson correlation between the two methods for 100 times. B) An example from the simulation showing the highly correlated results from the two methods. Each data point indicates a subject (n=35).

      Second, the potential difference between the two methods lies in that the LME model have also taken the group-level variance into account, such as the dissociable variances of the conflict similarity and orientation across subject groups. This enabled us to extract relatively cleaner conflict similarity effects for each subject, which we believe can be better linked to the individual behavioral adaptations. Moreover, we have extracted the behavioral adaptations scores (i.e., the similarity modulation effect on CSE) using a similar LME approach. Conducting behavioral analysis solely using individual data would have been less reliable, given the limited sample size of individual data (~32 points per subject). This also motivated us to maintain consistency by extracting individual neural effects using LME models.

      Furthermore, it remains unclear at which cognitive stages during response selection such a unified space is recruited. Can we effectively map any sources of conflict into a single scale? Is this unified space adaptively adjusted within the same brain region? Additionally, does the amount of conflict solely define the dimensions of this unified space across many conflict-inducing tasks? These questions remain open for future studies to address.

      Response: We appreciate the reviewer’s constructive open questions. We respond to each of them based on our current understanding.

      1) It remains unclear at which cognitive stages during response selection such a unified space is recruited.

      We anticipate that the cognitive space is recruited to guide the transference of behavioral CSE at two critical stages. The first stage involves the evaluation of control demands, where the representational distance/similarity between previous and current trials influences the adjustment of cognitive control. The second stage pertains to is control execution, where the switch from one control state to another follows a path within the cognitive space. It is worth noting that future studies aiming to address this question may benefit from methodologies with higher temporal resolutions, such as EEG and MEG, to provide more precise insights into the temporal dynamics of the process of cognitive space recruitment.

      2) Can we effectively map any sources of conflict into a single scale?

      It is possible that various sources of conflict can be mapped onto the same space based on their similarity, even if finding such an operational defined similarity may be challenging. However, our results may offer an approach to infer the similarity between two conflicts. One way is to examine their congruency sequence effect (CSE), with a stronger CSE suggesting greater similarity. The other way is to test their representational similarity within the dorsolateral prefrontal cortex.

      3) Is this unified space adaptively adjusted within the same brain region? We do not have an answer to this question. We showed that the cognitive space does not change with time (Note. S3). What have adjusted is the control demand to resolve the quickly changing conflict conditions from trial to trial. Though, it is an interesting question whether the cognitive space may be altered, for example, when the mental state changes significantly. And if yes, we can further test whether the change of cognitive space is also within the right dlPFC.

      4) Additionally, does the amount of conflict solely define the dimensions of this unified space across many conflict-inducing tasks?

      Our understanding of this comment is that the amount of conflict refers to the number of conflict sources. Based on our current finding, the dimensions of the space are indeed defined by how many different conflict sources are included. However, this would require the different conflict sources are orthogonal. If some sources share some aspects, the cognitive space may collapse to a lower dimension. We have incorporated the first question into the discussion:

      Moreover, we anticipate that the representation of cognitive space is most prominently involved at two critical stages to guide the transference of behavioral CSE. The first stage involves the evaluation of control demands, where the representational distance/similarity between previous and current trials influences the adjustment of cognitive control. The second stage pertains to control execution, where the switch from one control state to another follows a path within the cognitive space. However, we were unable to fully distinguish between these two stages due to the low temporal resolution of fMRI signals in our study. Future research seeking to delve deeper into this question may benefit from methodologies with higher temporal resolutions, such as EEG and MEG.

      We have included the other questions into the manuscript as open questions, calling for future research.

      Several interesting questions remains to be answered. For example, is the dimension of the unified space across conflict-inducing tasks solely determined by the number of conflict sources? Can we effectively map any sources of conflict with completely different stimuli into a single space? Is the cognitive space geometry modulated by the mental state? If yes, what brain regions mediate the change of cognitive space?

      Minor comments:

      • The original comment about out-of-sample predictions to examine the continuity of the space was a suggestion for testing neural representations, not behavior (I apologize for the lack of clarity). Given the low dimensionality of the conflict space shown by the participation ratio, we expect that linear separability exists only among specific combinations of conditions. For example, the pair of conflicts 1 and 5 together is not linearly separable from conflicts 2 and 3. But combined with other results, this is already implied.

      Response: We apologize for the misunderstanding. In fact, performing a prediction analysis using the extensive RSM in our study does presents certain challenges, primarily due to its substantial size (1400x1400) and the intricate nature of the mixed-effect linear model. In our efforts to simplify the prediction process by excluding random effects, we did observe a correlation between the predicted and original values, albeit a relatively small Pearson correlation coefficient of r = 0.024, p < .001. This small correlation can be attributed to two key factors. First, the exclusion of data points impacts not only the conflict similarity regressor but also other regressors within the model, thereby diminishing the predictive power. Secondly, the large amount of data points in the model heightens the risk of overfitting, subsequently reducing the model’s capacity for generalization and increasing the likelihood of unreliable predictions. Given these potential problems, we have opted not to include this prediction in the revised manuscript.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Editor's note:

      Thank you for taking time and efforts to improve this study. After re-review, two reviewers have a consensus that the connections the fatty acids and sperm motility is still ambiguous. Thus, I recommend to further tone down this conclusion consistently in the title and the text pointed out by reviewers before making a final version of record.

      We sincerely appreciate the considerable time and effort you and the reviewers devoted to evaluating our manuscript. We have revised the title and text to express the relationship between fatty acids and sperm motility more consistently and toned down. With these revisions, we would like to proceed with publishing the manuscript as the Version of Record (VoR). Thank you very much for your guidance in improving our study.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In this revised report, Yamanaka and colleagues investigate a proposed mechanism by which testosterone modulates seminal plasma metabolites in mice. Based on limited evidence in previous versions of the report, the authors softened the claim that oleic acid derived from seminal vesicle epithelium strongly affects linear progressive motility in isolated cauda epididymal sperm in vitro. Though the report still contains somewhat ambiguous references to the strength of the relationship between fatty acids and sperm motility.

      Strengths:

      Often, reported epidydimal sperm from mice have lower percent progressive motility compared with sperm retrieved from the uterus or by comparison with human ejaculated sperm. The findings in this report may improve in vitro conditions to overcome this problem, as well as add important physiological context to the role of reproductive tract glandular secretions in modulating sperm behaviors. The strongest observations are related to the sensitivity of seminal vesicle epithelial cells to testosterone. The revisions include the addition of methodological detail, modified language to reflect the nuance of some of the measurements, as well as re-performed experiments with more appropriate control groups. The findings are likely to be of general interest to the field by providing context for follow-on studies regarding the relationship between fatty acid beta oxidation and sperm motility pattern.

      Weaknesses:

      The connection between media fatty acids and sperm motility pattern remains inconclusive.

      We would like to express our sincere gratitude to the judges for their cooperation in reviewing the manuscript and for your helpful comments, which were instrumental in improving manuscript.

      Reviewer #2 (Public review):

      Using a combination of in vivo studies with testosterone-inhibited and aged mice with lower testosterone levels as well as isolated mouse and human seminal vesicle epithelial cells the authors show that testosterone induces an increase in glucose uptake. They find that testosterone induces a difference in gene expression with a focus on metabolic enzymes. Specifically, they identify increased expression of enzymes regulating cholesterol and fatty acid synthesis, leading to increased production of 18:1 oleic acid. The revised version strengthens the role of ACLY as the main regulator of seminal vesicle epithelial cell metabolic programming. The authors propose that fatty acids are secreted by seminal vesicle epithelial cells and are taken up by sperm, positively affecting sperm function. A lipid mixture mimicking the lipids secreted by seminal vesicle epithelial cells, however, only has a small and mostly non-significant effect on sperm motility, suggesting the authors were not apply to pinpoint the seminal vesicle fluid component that positively affects sperm function.

      We greatly appreciate the reviewer’s thoughtful comments and time spent reviewing this manuscript. The relationship between lipids such as fatty acids and sperm motility remains unclear in the current dataset. Therefore, before finalizing the manuscript, we revised the title and text, as suggested by the reviewers, to express this conclusion more cautiously and consistently.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Some additional comments are provided below to aid the authors in improving the quality of the work:

      Major Comments:

      (1) In the newly added supplemental figure 5, the authors note that the percentage data were arcisine transformed prior to statistical analysis without providing any other justification. This seems strange, especially for such a small sample size. It seems more appropriate for the authors to use a nonparametric test. Forcing symmetry without knowing what the shape of the true distribution is makes the ANOVA hard to interpret. Additionally, why use pairwise comparisons rather than comparing each group to the control (LM 0%). Also, note that the graphs are not individually labeled to distinguish them in the legend (A, B, C, etc.). Ultimately, the treatment differences don't seem that meaningful, even if the authors were able to observe statistical significance with the somewhat over-manipulated method of analysis.

      Ultimately, the conclusion of this experiment (Supplemental figure 5) remains unchanged, but we agree that the relationship between fatty acids and sperm motility remains unclear. Therefore, before finalizing the manuscript, we revised the title and text as pointed out by the reviewers to express this conclusion more cautiously and consistently throughout the manuscript.

      Arcsin transform is commonly used for percentage data [Zar, J.H. 2010. Biostatistical analysis., McDonald, J.H. 2014. Handbook of biological statistics.]. If the values are low or high, such as 0 to 30% or 70 to 100%, without arcsine transformation will result in a large deviation from the normality of the data. However, even if such a conversion is performed, it does not necessarily mean that the assumptions of normality and homogeneity of variance, which are prerequisites for parametric statistical analysis methods, are satisfied.

      Given the small sample size and the possibility of non-normal data, we performed Shapiro–Wilk tests for each group (n = 6) and found no departure from normality (all p > 0.1). Q–Q plots and Levene’s test (p > 0.1) likewise supported the assumptions of ANOVA. Following the reviewer’s recommendation, we repeated the analysis with a Kruskal–Wallis test followed by Dunn’s post-hoc comparisons (Bonferroni corrected). Both approaches led to the same conclusions, with non-parametric p-values equal to or smaller than the parametric ones. In the revised manuscript we now report ANOVA as the primary analysis. The author response image includes effect sizes with 95 % confidence intervals, and provide the non-parametric results for transparency.

      Author response image 1.

      Results of reanalysis of supplementary Figure 5 using nonparametric tests and effect sizes with 95% confidence intervals. Upper part; Differences between groups were assessed by Kruskal–Wallis test, differences among values were analyzed by Dunn’s post-hoc comparisons (Bonferroni corrected) for multiple comparisons. Different letters represent significantly different groups. Lower part; The effect sizes with 95 % confidence intervals. For example, Cliff's Δ = -1 (95% CI ~ -0.6) in VSL's “LM 0 vs LM1” means that LM 1% values exceed LM 0 %values in all pairs.

      (2) I appreciate that the authors toned down the interpretation of the effects of seminal plasma metabolites on sperm motility with a cautionary statement on Lines 397-405 and Line 259. However, they send mixed signals with the title of the report: "Testosterone-Induced Metabolic Changes in Seminal Vesicle Epithelial cells Alter Plasma Components to Enhance Sperm Motility", and on line 265 when the say "ACLY expression is upregulated by testosterone and is essential for the metabolic shift of seminal vesicle epithelial cells that mediates sperm linear motility".

      The wording has been softened overall. The title has been changed to “Testosterone-Induced Metabolic Changes in Seminal Vesicle Epithelium Modify Seminal Plasma Components with Potential to Improve Sperm Motility” In the results (lines 265-266), we have stated that “ACLY expression is upregulated by testosterone and is essential for the metabolic shift that is associated with increased linear motility” without implying a causal relationship.

      Minor Comments:

      (1) Typo on line 31: "understanding the male fertility mechanisms and may perspective for the development of potential biomarkers of male fertility and advance in the treatment of male infertility."

      We have made the following corrections. “These findings suggest that testosterone-dependent lipid remodeling may contribute to sperm straight-line motility, and further functional verification is required.”

      (2) Line 193: the statement is confusing "Therefore, we analyzed mitochondrial metabolism using a flux analyzer, predicting that more glucose is metabolized, pyruvate is metabolized from phosphoenolpyruvic acid through glycolysis in response to testosterone, and is further metabolized in the mitochondria." For example, 'Metabolized through glycolysis' is an ambiguous way to describe the pyruvate kinase reaction. Additionally, phosphoenolpyruvate has three acid ionizable groups, two of which have pKa's well below physiological pH, so phosphoenolpyruvate is the correct intermediate rather than phosphoenolpyruvic acid. The authors make similar mistakes with other organic acids such as citric acid.

      Rewritten as “We therefore examined cellular energy metabolism with a flux analyzer, anticipating that testosterone would elevate glycolytic flux, thereby producing more pyruvate from phosphoenolpyruvate. Because extracellular pyruvate levels simultaneously declined, we inferred that the cells had an increased pyruvate demand and, at that time, hypothesized that the excess pyruvate would enter the mitochondria to support enhanced oxidative metabolism.” (lines 193-198)

      The organic acids are now referenced in their appropriate forms (e.g., citrate, phosphoenolpyruvate).

      (3) Line: 271: "Acly" should be all capitalized to "ACLY". The report mixes capitalizing through out and could be more consistent.

      We appreciate the reviewer’s attention to nomenclature and have standardized the manuscript accordingly. Proteins are written in Roman letters, all in capital letters. Mouse gene symbols: italics, first letter capitalize.

      Reviewer #2 (Recommendations for the authors):

      Major comments:

      (1) 'Once capacitation is complete, sperm cannot maintain that state for a long time'. The publications cited by the author do not support that statement and this reviewer also does not agree. Lower fertilization efficiency from in vitro capacitated epidydimal sperm does not have to mean capacitation is reversed, it can simply mean in vitro capacitation conditions not accurately mimic capacitation in vivo.

      We thank the reviewer for pointing this out and would like to clarify our position. Our statement does not suggest a "reversal" of active capacitation. Rather, it reflects the well-documented fact that capacitation is a transient process. Sperm that undergo capacitation too early cannot maintain that state for long enough to retain their ability to fertilize at the moment and location of fertilization in vivo.

      (2) How do the authors explain the discrepancy between the results shown in Fig. S1E, the increase in sperm motility upon mixing of sperm with SVF and the results reported in Li et al 2025. Mentioning decapacitating factors without further explanation is insufficient.

      We appreciate the reviewer's feedback pointing out the need for a clearer explanation.

      Seminal plasma is inherently binary, containing both decapacitation factors that delay or inhibit capacitation and nutrient substrates that promote sperm motility.

      In vivo, it is believed that the coating of sperm by decapacitation factors is removed by uterine fluid and albumin as it passes through the female reproductive tract [PMID: 22827391, PMID: 24274412]. In contrast, standard fertilization culture media lack a clearance pathway, so decapacitating factors are retained throughout the culture period. As a result, the cleavage rate after in vitro fertilization using sperm exposed to seminal vesicle fluid decreased dramatically.

      Lipids, such as fatty acids, increased sperm motility without directly inducing markers of fertilization. These results suggest that the enhancement of motility by lipids is functionally distinct from the capacitation-inhibiting function of seminal plasma proteins. The data from this study are consistent with the biphasic model. Specifically, decapacitation factors temporarily stabilize the sperm membrane, preventing early capacitation. Meanwhile, lipids enhance sperm motility, enabling them to rapidly pass through the hostile uterine environment.

      (3) This reviewer does not see the merit in including a lipid mixture motility experiment compared to using OA alone. The increase in motility is still small and far from comparable to the motility increase with seminal vesicle fluid. In this reviewer's opinion the experiment is still inconclusive and should not be highlighted in the manuscript title.

      The wording has been softened overall. The title has been changed to “Testosterone-Induced Metabolic Changes in Seminal Vesicle Epithelium Modify Seminal Plasma Components with Potential to Improve Sperm Motility”. (Please see also Reviewer 1's main comment 1)

      Minor comments:

      (1) 'This change includes a large amplitude of flagella' does not make sense. Please correct.

      The following corrections have been made. “This change is characterized by large-amplitude flagellar beating.” (lines 44-45)

    1. Author response:

      The following is the authors’ response to the previous reviews.

      To the Senior Editor and the Reviewing Editor:

      We sincerely appreciate the valuable comments provided by the reviewers, the reviewing editor, and the senior editor. Based on our last response and revision, we are confused by the two limitations noted in the eLife assessment. 

      (1) benchmarking against comparable methods is limited.

      In our last revision, we added the comparison experiments with TNDM, as the reviewers requested. Additionally, it is crucial to emphasize that our evaluation of decoding capabilities of behaviorally relevant signals has been benchmarked against the performance of the ANN on raw signals, which, as Reviewer #1 previously noted, nearly represents the upper limit of performance. Consequently, we believe that our benchmarking methods are sufficiently strong.

      (2) some observations may be a byproduct of their method, and may not constitute new scientific observations.

      We believe that our experimental results are sufficient to demonstrate that our conclusions are not byproducts of d-VAE based on three reasons:

      (1) The d-VAE, as a latent variable model, adheres to the population doctrine, which posits that latent variables are responsible for generating the activities of individual neurons. The goal of such models is to maximize the explanation of the raw signals. At the signal level, the only criterion we can rely on is neural reconstruction performance, in which we have achieved unparalleled results. Thus, it is inappropriate to focus on the mixing process during the model's inference stage while overlooking the crucial de-mixing process during the generation stage and dismissing the significance of our neural reconstruction results. For more details, please refer to the first point in our response to Q4 from Reviewer #4.

      (2) The criterion that irrelevant signals should contain minimal information can effectively demonstrate that our conclusions are not by-products of d-VAE. Unfortunately, the reviewers seem to have overlooked this criterion. For more details, please refer to the third point in our response to Q4 from Reviewer #4

      (3) Our synthetic experimental results also substantiate that our conclusions are not byproducts of d-VAE. However, it appears the reviewers did not give these results adequate consideration. For more details, please refer to the fourth point in our response to Q4 from Reviewer #4.

      Furthermore, our work presents not just "a useful method" but a comprehensive framework. Our study proposes, for the first time, a framework for defining, extracting, and validating behaviorally relevant signals. In our current revision, to clearly distinguish between d-VAE and other methods, we have formalized the extraction of behaviorally relevant signals into a mathematical optimization problem. To our knowledge, current methods have not explicitly proposed extracting behaviorally relevant signals, nor have they identified and addressed the key challenges of extracting relevant signals. Similarly, existing research has not yet defined and validated behaviorally relevant signals. For more details, please refer to our response to Q1 from Reviewer #4.

      Based on these considerations, we respectfully request that you reconsider the eLife assessment of our work. We greatly appreciate your time and attention to this matter.

      The main revisions made to the manuscript are as follows:

      (1) We have formalized the extraction of behaviorally relevant signals into a mathematical optimization problem, enabling a clearer distinction between d-VAE and other models.

      (2) We have moderated the assertion about linear readout to highlight its conjectural nature and have broadened the discussion regarding this conclusion. 

      (3) We have elaborated on the model details of d-VAE and have removed the identifiability claim.

      To Reviewer #1

      Q1: “As reviewer 3 also points out, I would, however, caution to interpret this as evidence for linear read-out of the motor system - your model performs a non-linear transformation, and while this is indeed linearly decodable, the motor system would need to do something similar first to achieve the same. In fact to me it seems to show the opposite, that behaviour-related information may not be generally accessible to linear decoders (including to down-stream brain areas).”

      Thank you for your comments. It's important to note that the conclusions we draw are speculative and not definitive. We use terms like "suggest" to reflect this uncertainty. To further emphasize the conjectural nature of our conclusions, we have deliberately moderated our tone.

      The question of whether behaviorally-relevant signals can be accessed by linear decoders or downstream brain regions hinges on the debate over whether the brain employs a strategy of filtering before decoding. If the brain employs such a strategy, the brain can probably access these signals. In our opinion, it is likely that the brain utilizes this strategy.

      Given the existence of behaviorally relevant signals, it is reasonable to assume that the brain has intrinsic mechanisms to differentiate between relevant and irrelevant signals. There is growing evidence suggesting that the brain utilizes various mechanisms, such as attention and specialized filtering, to suppress irrelevant signals and enhance relevant signals [1-3]. Therefore, it is plausible that the brain filters before decoding, thereby effectively accessing behaviorally relevant signals.

      Thank you for your valuable feedback.

      (1) Sreenivasan, Sameet, and Ila Fiete. "Grid cells generate an analog error-correcting code for singularly precise neural computation." Nature neuroscience 14.10 (2011): 1330-1337.

      (2) Schneider, David M., Janani Sundararajan, and Richard Mooney. "A cortical filter that learns to suppress the acoustic consequences of movement." Nature 561.7723 (2018): 391-395.

      (3) Nakajima, Miho, L. Ian Schmitt, and Michael M. Halassa. "Prefrontal cortex regulates sensory filtering through a basal ganglia-to-thalamus pathway." Neuron 103.3 (2019): 445-458.

      Q2: “As in my initial review, I would also caution against making strong claims about identifiability although this work and TNDM seem to show that in practise such methods work quite well. CEBRA, in contrast, offers some theoretical guarantees, but it is not a generative model, so would not allow the type of analysis done in this paper. In your model there is a para,eter \alpha to balance between neural and behaviour reconstruction. This seems very similar to TNDM and has to be optimised - if this is correct, then there is manual intervention required to identify a good model.”

      Thank you for your comments. 

      Considering your concerns about our identifiability claims and the fact that identifiability is not directly relevant to the core of our paper, we have removed content related to identifiability.

      Firstly, our model is based on the pi-VAE, which also has theoretical guarantees. However, it is important to note that all such theoretical guarantees (including pi-VAE and CEBRA) are based on certain assumptions that cannot be validated as the true distribution of latent variables remains unknown.

      Secondly, it is important to clarify that the identifiability of latent variables does not impact the conclusions of this paper, nor does this paper make specific conclusions about the model's latent variables. Identifiability means that distinct latent variables correspond to distinct observations. If multiple latent variables can generate the same observation, it becomes impossible to determine which one is correct given the observation, which leads to the issue of nonidentifiability. Notably, our analysis focuses on the generated signals, not the latent variables themselves, and thus the identifiability of these variables does not affect our findings. 

      Our approach, dedicated to extracting these signals, distinctly differs from methods such as TNDM, which focuses on extracting behaviorally relevant latent dynamics. To clearly set apart d-VAE from other models, we have framed the extraction of behaviorally relevant signals as the following mathematical optimization problem:

      where 𝑥# denotes generated behaviorally-relevant signals, 𝑥 denotes raw noisy signals, 𝐸(⋅,⋅) demotes reconstruction loss, and 𝑅(⋅) denotes regularization loss. It is important to note that while both d-VAE and TNDM employ reconstruction loss, relying solely on this term is insufficient for determining the optimal degree of similarity between the generated and raw noisy signals. The key to accurately extracting behaviorally relevant signals lies in leveraging prior knowledge about these signals to determine the optimal similarity degree, encapsulated by 𝑅(𝒙𝒓).  Other studies have not explicitly proposed extracting behaviorally-relevant signals, nor have they identified and addressed the key challenges involved in extracting relevant signals. Consequently, our approach is distinct from other methods.

      Thank you for your valuable feedback.

      Q3: “Somewhat related, I also found that the now comprehensive comparison with related models shows that the using decoding performance (R2) as a metric for model comparison may be problematic: the R2 values reported in Figure 2 (e.g. the MC_RTT dataset) should be compared to the values reported in the neural latent benchmark, which represent well-tuned models (e.g. AutoLFADS). The numbers (difficult to see, a table with numbers in the appendix would be useful, see: https://eval.ai/web/challenges/challenge-page/1256/leaderboard) seem lower than what can be obtained with models without latent space disentanglement. While this does not necessarily invalidate the conclusions drawn here, it shows that decoding performance can depend on a variety of model choices, and may not be ideal to discriminate between models. I'm also surprised by the low neural R2 for LFADS I assume this is condition-averaged) - LFADS tends to perform very well on this metric.”

      Thank you for your comments. The dataset we utilized is not from the same day as the neural latent benchmark dataset. Notably, there is considerable variation in the length of trials within the RTT paradigm, and the dataset lacks explicit trial information, rendering trial-averaging unsuitable. Furthermore, behaviorally relevant signals are not static averages devoid of variability; even behavioral data exhibits variability. We computed the neural R2 using individual trials rather than condition-averaged responses. 

      Thank you for your valuable feedback.

      Q4: “One statement I still cannot follow is how the prior of the variational distribution is modelled. You say you depart from the usual Gaussian prior, but equation 7 seems to suggest there is a normal prior. Are the parameters of this distribution learned? As I pointed out earlier, I however suspect this may not matter much as you give the prior a very low weight. I also still am not sure how you generate a sample from the variational distribution, do you just draw one for each pass?”

      Thank you for your questions.

      The conditional distribution of prior latent variables 𝑝%(𝒛|𝒚) is a Gaussian distribution, but the distribution of prior latent variables 𝑝(𝒛) is a mixture Gaussian distribution. The distribution of prior latent variables 𝑝(𝒛) is:

      where denotes the empirical distribution of behavioral variables

      𝒚, and 𝑁 denotes the number of samples, 𝒚(𝒊) denotes the 𝒊th sample, δ(⋅) denotes the Dirac delta function, and 𝑝%(𝒛|𝒚) denotes the conditional distribution of prior latent variables given the behavioral variables parameterized by network 𝑚. Based on the above equation, we can see that 𝑝(𝒛) is not a Gaussian distribution, it is a Gaussian mixture model with 𝑁 components, which is theoretically a universal approximator of continuous probability densities.

      Learning this prior is important, as illustrated by our latent variable visualizations, which are not a Gaussian distribution. Upon conducting hypothesis testing for both latent variables and behavioral variables, neither conforms to Gaussian distribution (Lilliefors test and Kolmogorov-Smirnov test). Consequently, imposing a constraint on the latent variables towards N(0,1) is expected to affect performance adversely.

      Regarding sampling, during training process, we draw only one sample from the approximate posterior distribution . It is worth noting that drawing multiple samples or one sample for each pass does not affect the experimental results. After training, we can generate a sample from the prior by providing input behavioral data 𝒚(𝒊) and then generating corresponding samples via and . To extract behaviorally-relevant signals from raw signals, we use and .

      Thank you for your valuable feedback.

      Q5: “(1) I found the figures good and useful, but the text is, in places, not easy to follow. I think the manuscript could be shortened somewhat, and in some places more concise focussed explanations would improve readability.

      (2) I would not call the encoding "complex non-linear" - non-linear is a clear term, but complex can mean many things (e.g. is a quadratic function complex?) ”

      Thank you for your recommendation. We have revised the manuscript for enhanced clarity.  We call the encoding “complex nonlinear” because neurons encode information with varying degrees of nonlinearity, as illustrated in Fig. 3b, f, and Fig. S3b.

      Thank you for your valuable feedback.

      To Reviewer #2

      Q1: “I still remain unconvinced that the core findings of the paper are "unexpected". In the response to my previous Specific Comment #1, they say "We use the term 'unexpected' due to the disparity between our findings and the prior understanding concerning neural encoding and decoding." However, they provide no citations or grounding for why they make those claims. What prior understanding makes it unexpected that encoding is more complex than decoding given the entropy, sparseness, and high dimensionality of neural signals (the "encoding") compared to the smoothness and low dimensionality of typical behavioural signals (the "decoding")?” 

      Thank you for your comments. We believe that both the complexity of neural encoding and the simplicity of neural decoding in motor cortex are unexpected.

      The Complexity of Neural Encoding: As noted in the Introduction, neurons with small R2 values were traditionally considered noise and consequently disregarded, as detailed in references [1-3]. However, after filtering out irrelevant signals, we discovered that these neurons actually contain substantial amounts of behavioral information, previously unrecognized. Similarly, in population-level analyses, neural signals composed of small principal components (PCs) are often dismissed as noise, with analyses typically utilizing only between 6 and 18 PCs [4-10]. Yet, the discarded PC signals nonlinearly encode significant amounts of information, with practically useful dimensions found to range between 30 and 40—far exceeding the usual number analyzed. These findings underscore the complexity of neural encoding and are unexpected.

      The Simplicity of Neural Decoding: In the motor cortex, nonlinear decoding of raw signals has been shown to significantly outperform linear decoding, as evidenced in references [11,12]. Interestingly, after separating behaviorally relevant and irrelevant signals, we observed that the linear decoding performance of behaviorally relevant signals is nearly equivalent to that of nonlinear decoding—a phenomenon previously undocumented in the motor cortex. This discovery is also unexpected.

      Thank you for your valuable feedback.

      (1) Georgopoulos, Apostolos P., Andrew B. Schwartz, and Ronald E. Kettner. "Neuronal population coding of movement direction." Science 233.4771 (1986): 1416-1419.

      (2) Hochberg, Leigh R., et al. "Reach and grasp by people with tetraplegia using a neurally controlled robotic arm." Nature 485.7398 (2012): 372-375. 

      (3) Inoue, Yoh, et al. "Decoding arm speed during reaching." Nature communications 9.1 (2018): 5243.

      (4) Churchland, Mark M., et al. "Neural population dynamics during reaching." Nature 487.7405 (2012): 51-56.

      (5) Kaufman, Matthew T., et al. "Cortical activity in the null space: permitting preparation without movement." Nature neuroscience 17.3 (2014): 440-448.

      (6) Elsayed, Gamaleldin F., et al. "Reorganization between preparatory and movement population responses in motor cortex." Nature communications 7.1 (2016): 13239.

      (7) Sadtler, Patrick T., et al. "Neural constraints on learning." Nature 512.7515 (2014): 423426.

      (8) Golub, Matthew D., et al. "Learning by neural reassociation." Nature neuroscience 21.4 (2018): 607-616.

      (9) Gallego, Juan A., et al. "Cortical population activity within a preserved neural manifold underlies multiple motor behaviors." Nature communications 9.1 (2018): 4233.

      (10) Gallego, Juan A., et al. "Long-term stability of cortical population dynamics underlying consistent behavior." Nature neuroscience 23.2 (2020): 260-270.

      (11) Glaser, Joshua I., et al. "Machine learning for neural decoding." Eneuro 7.4 (2020).

      (12) Willsey, Matthew S., et al. "Real-time brain-machine interface in non-human primates achieves high-velocity prosthetic finger movements using a shallow feedforward neural network decoder." Nature Communications 13.1 (2022): 6899.

      Q2: “I still take issue with the premise that signals in the brain are "irrelevant" simply because they do not correlate with a fixed temporal lag with a particular behavioural feature handchosen by the experimenter. In the response to my previous review, the authors say "we employ terms like 'behaviorally-relevant' and 'behaviorally-irrelevant' only regarding behavioral variables of interest measured within a given task, such as arm kinematics during a motor control task.". This is just a restatement of their definition, not a response to my concern, and does not address my concern that the method requires a fixed temporal lag and continual decoding/encoding. My example of reward signals remains. There is a huge body of literature dating back to the 70s on the linear relationships between neural and activity and arm kinematics; in a sense, the authors have chosen the "variable of interest" that proves their point. This all ties back to the previous comment: this is mostly expected, not unexpected, when relating apparently-stochastic, discrete action potential events to smoothly varying limb kinematics.”

      Thank you for your comments. 

      Regarding the experimenter's specification of behavioral variables of interest, we followed common practice in existing studies [1, 2]. Regarding the use of fixed temporal lags, we followed the same practice as papers related to the dataset we use, which assume fixed temporal lags [3-5]. Furthermore, many studies in the motor cortex similarly use fixed temporal lags [68].

      Concerning the issue of rewards, in the paper you mentioned [9], the impact of rewards occurs after the reaching phase. It's important to note that in our experiments, we analyze only the reaching phase, without any post-movement phase. 

      If the impact of rewards can be stably reflected in the signals in the reaching phase of the subsequent trial, and if the reward-induced signals do not interfere with decoding—since these signals are harmless for decoding and beneficial for reconstruction—our model is likely to capture these signals. If the signals induced by rewards during the reaching phase are randomly unstable, our model will likely be unable to capture them.

      If the goal is to extract post-movement neural activity from both rewarded and unrewarded trials, and if the neural patterns differ between these conditions, one could replace the d-VAE's regression loss, used for continuous kinematics decoding, with a classification loss tailored to distinguish between rewarded and unrewarded conditions.

      To clarify the definition, we have revised it in the manuscript. Specifically, before a specific definition, we briefly introduce the relevant signals and irrelevant signals. Behaviorally irrelevant signals refer to those not directly associated with the behavioral variables of interest and may include noise or signals from variables of no interest. In contrast, behaviorally relevant signals refer to those directly related to the behavioral variables of interest. For instance, rewards in the post-movement phase are not directly related to behavioral variables (kinematics) in the reaching movement phase.

      It is important to note that our definition of behaviorally relevant signals not only includes decoding capabilities but also specific requirement at the signal level, based on two key requirements:

      (1) they should closely resemble raw signals to preserve the underlying neuronal properties without becoming so similar that they include irrelevant signals. (encoding requirement), and  (2) they should contain behavioral information as much as possible (decoding requirement). Signals that meet both requirements are considered effective behaviorally relevant signals. In our study, we assume raw signals are additively composed of behaviorally-relevant and irrelevant signals. We define irrelevant signals as those remaining after subtracting relevant signals from raw signals. Therefore, we believe our definition is clearly articulated. 

      Thank you for your valuable feedback.

      (1) Sani, Omid G., et al. "Modeling behaviorally relevant neural dynamics enabled by preferential subspace identification." Nature Neuroscience 24.1 (2021): 140-149.

      (2) Buetfering, Christina, et al. "Behaviorally relevant decision coding in primary somatosensory cortex neurons." Nature neuroscience 25.9 (2022): 1225-1236.

      (3) Wang, Fang, et al. "Quantized attention-gated kernel reinforcement learning for brain– machine interface decoding." IEEE transactions on neural networks and learning systems 28.4 (2015): 873-886.

      (4) Dyer, Eva L., et al. "A cryptography-based approach for movement decoding." Nature biomedical engineering 1.12 (2017): 967-976.

      (5) Ahmadi, Nur, Timothy G. Constandinou, and Christos-Savvas Bouganis. "Robust and accurate decoding of hand kinematics from entire spiking activity using deep learning." Journal of Neural Engineering 18.2 (2021): 026011.

      (6) Churchland, Mark M., et al. "Neural population dynamics during reaching." Nature 487.7405 (2012): 51-56.

      (7) Kaufman, Matthew T., et al. "Cortical activity in the null space: permitting preparation without movement." Nature neuroscience 17.3 (2014): 440-448.

      (8) Elsayed, Gamaleldin F., et al. "Reorganization between preparatory and movement population responses in motor cortex." Nature communications 7.1 (2016): 13239.

      (9) Ramkumar, Pavan, et al. "Premotor and motor cortices encode reward." PloS one 11.8 (2016): e0160851.

      Q3: “The authors seem to have missed the spirit of my critique: to say "linear readout is performed in motor cortex" is an over-interpretation of what their model can show.”

      Thank you for your comments. It's important to note that the conclusions we draw are speculative and not definitive. We use terms like "suggest" to reflect this uncertainty. To further emphasize the conjectural nature of our conclusions, we have deliberately moderated our tone.

      The question of whether behaviorally-relevant signals can be accessed by downstream brain regions hinges on the debate over whether the brain employs a strategy of filtering before decoding. If the brain employs such a strategy, the brain can probably access these signals. In our view, it is likely that the brain utilizes this strategy.

      Given the existence of behaviorally relevant signals, it is reasonable to assume that the brain has intrinsic mechanisms to differentiate between relevant and irrelevant signals. There is growing evidence suggesting that the brain utilizes various mechanisms, such as attention and specialized filtering, to suppress irrelevant signals and enhance relevant signals [1-3]. Therefore, it is plausible that the brain filters before decoding, thereby effectively accessing behaviorally relevant signals.

      Regarding the question of whether the brain employs linear readout, given the limitations of current observational methods and our incomplete understanding of brain mechanisms, it is challenging to ascertain whether the brain employs a linear readout. In many cortical areas, linear decoders have proven to be sufficiently accurate. Consequently, numerous studies [4, 5, 6], including the one you referenced [4], directly employ linear decoders to extract information and formulate conclusions based on the decoding results. Contrary to these approaches, our research has compared the performance of linear and nonlinear decoders on behaviorally relevant signals and found their decoding performance is comparable. Considering both the decoding accuracy and model complexity, our results suggest that the motor cortex may utilize linear readout to decode information from relevant signals. Given the current technological limitations, we consider it reasonable to analyze collected data to speculate on the potential workings of the brain, an approach that many studies have also embraced [7-10]. For instance, a study [7] deduces strategies the brain might employ to overcome noise by analyzing the structure of recorded data and decoding outcomes for new stimuli.

      Thank you for your valuable feedback.

      (1) Sreenivasan, Sameet, and Ila Fiete. "Grid cells generate an analog error-correcting code for singularly precise neural computation." Nature neuroscience 14.10 (2011): 1330-1337.

      (2) Schneider, David M., Janani Sundararajan, and Richard Mooney. "A cortical filter that learns to suppress the acoustic consequences of movement." Nature 561.7723 (2018): 391-395.

      (3) Nakajima, Miho, L. Ian Schmitt, and Michael M. Halassa. "Prefrontal cortex regulates sensory filtering through a basal ganglia-to-thalamus pathway." Neuron 103.3 (2019): 445-458.

      (4) Jurewicz, Katarzyna, et al. "Irrational choices via a curvilinear representational geometry for value." bioRxiv (2022): 2022-03.

      (5) Hong, Ha, et al. "Explicit information for category-orthogonal object properties increases along the ventral stream." Nature neuroscience 19.4 (2016): 613-622.

      (6) Chang, Le, and Doris Y. Tsao. "The code for facial identity in the primate brain." Cell 169.6 (2017): 1013-1028.

      (7) Ganmor, Elad, Ronen Segev, and Elad Schneidman. "A thesaurus for a neural population code." Elife 4 (2015): e06134.

      (8) Churchland, Mark M., et al. "Neural population dynamics during reaching." Nature 487.7405 (2012): 51-56.

      (9) Gallego, Juan A., et al. "Cortical population activity within a preserved neural manifold underlies multiple motor behaviors." Nature communications 9.1 (2018): 4233.

      (10) Gallego, Juan A., et al. "Long-term stability of cortical population dynamics underlying consistent behavior." Nature neuroscience 23.2 (2020): 260-270.

      Q4: “Agreeing with my critique is not sufficient; please provide the data or simulations that provides the context for the reference in the fano factor. I believe my critique is still valid.”

      Thank you for your comments. As we previously replied, Churchland's research examines the variability of neural signals across different stages, including the preparation and execution phases, as well as before and after the target appears. Our study, however, focuses exclusively on the movement execution phase. Consequently, we are unable to produce comparative displays similar to those in his research. Intuitively, one might expect that the variability of behaviorally relevant signals would be lower; however, since no prior studies have accurately extracted such signals, the specific FF values of behaviorally relevant signals remain unknown. Therefore, presenting these values is meaningful, and can provide a reference for future research. While we cannot compare FF across different stages, we can numerically compare the values to the Poisson count process. An FF of 1 indicates a Poisson firing process, and our experimental data reveals that most neurons have an FF less than 1, indicating that the variance in firing counts is below the mean.  Thank you for your valuable feedback.

      To Reviewer #4

      Q1: “Overall, studying neural computations that are behaviorally relevant or not is an important problem, which several previous studies have explored (for example PSID in (Sani et al. 2021), TNDM in (Hurwitz et al. 2021), TAME-GP in (Balzani et al. 2023), pi-VAE in (Zhou and Wei 2020), and dPCA in (Kobak et al. 2016), etc). However, this manuscript does not properly put their work in the context of such prior works. For example, the abstract states "One solution is to accurately separate behaviorally-relevant and irrelevant signals, but this approach remains elusive", which is not the case given that these prior works have done that. The same is true for various claims in the main text, for example "Furthermore, we found that the dimensionality of primary subspace of raw signals (26, 64, and 45 for datasets A, B, and C) is significantly higher than that of behaviorally-relevant signals (7, 13, and 9), indicating that using raw signals to estimate the neural dimensionality of behaviors leads to an overestimation" (line 321). This finding was presented in (Sani et al. 2021) and (Hurwitz et al. 2021), which is not clarified here. This issue of putting the work in context has been brought up by other reviewers previously but seems to remain largely unaddressed. The introduction is inaccurate also in that it mixes up methods that were designed for separation of behaviorally relevant information with those that are unsupervised and do not aim to do so (e.g., LFADS). The introduction should be significantly revised to explicitly discuss prior models/works that specifically formulated this behavior separation and what these prior studies found, and how this study differs.”  

      Thank you for your comments. Our statement about “One solution is to accurately separate behaviorally-relevant and irrelevant signals, but this approach remains elusive” is accurate. To our best knowledge, there is no prior works to do this work--- separating accurate behaviorally relevant neural signals at both single-neuron and single-trial resolution. The works you mentioned have not explicitly proposed extracting behaviorally relevant signals, nor have they identified and addressed the key challenges of extracting relevant signals, namely determining the optimal degree of similarity between the generated relevant signals and raw signals. Those works focus on the latent neural dynamics, rather than signal level.

      To clearly set apart d-VAE from other models, we have framed the extraction of behaviorally relevant signals as the following mathematical optimization problem:

      where 𝒙𝒓 denotes generated behaviorally-relevant signals, 𝒙 denotes raw noisy signals, 𝐸(⋅,⋅) demotes reconstruction loss, and 𝑅(⋅) denotes regularization loss. It is important to note that while both d-VAE and TNDM employ reconstruction loss, relying solely on this term is insufficient for determining the optimal degree of similarity between the generated and raw noisy signals. The key to accurately extracting behaviorally relevant signals lies in leveraging prior knowledge about these signals to determine the optimal similarity degree, encapsulated by 𝑅(𝒙𝒓). All the works you mentioned did not have the key part 𝑅(𝒙𝒓).

      Regarding the dimensionality estimation, the dimensionality of neural manifolds quantifies the degrees of freedom required to describe population activity without significant information loss.

      There are two differences between our work and PSID and TNDM. 

      First, the dimensions they refer to are fundamentally different from ours. The dimensionality we describe pertains to a linear subspace, where a neural dimension or neural mode or principal component basis, , with N representing the number of neurons. However, the vector length of a neural mode of PSID and our approach differs; PSID requires concatenating multiple time steps T, essentially making , TNDM, on the other hand, involves nonlinear dimensionality reduction, which is different from linear dimensionality reduction.

      Second, we estimate neural dimensionality by explaining the variance of neural signals, whereas PSID and TNDM determine dimensionality through decoding performance saturation. It is important to note that the dimensionality at which decoding performance saturates may not accurately reflect the true dimensionality of neural manifolds, as some dimensions may contain redundant information that does not enhance decoding performance.

      We acknowledge that while LFADS can generate signals that contain some behavioral information, it was not specifically designed to do so. Following your suggestion, we have removed this reference from the Introduction.

      Thank you for your valuable feedback.

      Q2: “Claims about linearity of "motor cortex" readout are not supported by results yet stated even in the abstract. Instead, what the results support is that for decoding behavior from the output of the dVAE model -- that is trained specifically to have a linear behavior readout from its embedding -- a nonlinear readout does not help. This result can be biased by the very construction of the dVAE's loss that encourages a linear readout/decoding from embeddings, and thus does not imply a finding about motor cortex.”

      Thank you for your comments. We respectfully disagree with the notion that the ability of relevant signals to be linearly decoded is due to constraints that allow embedding to be linearly decoded. Embedding involves reorganizing or transforming the structure of original signals, and they can be linearly decoded does not mean the corresponding signals can be decoded linearly.

      Let's clarify this with three intuitive examples:

      Example 1: Image denoising is a well-established field. Whether employing supervised or blind denoising methods [1, 2], both can effectively recover the original image. This denoising process closely resembles the extraction of behaviorally relevant signals from raw signals. Consider if noisy images are not amenable to linear decoding (classification); would removing the noise enable linear decoding? The answer is no. Typically, the noise in images captured under normal conditions is minimal, yet even the clear images remain challenging to decode linearly.

      Example 2: Consider the task of face recognition, where face images are set against various backgrounds, in this context, the pixels representing the face corresponds to relevant signals, while the background pixels are considered irrelevant. Suppose a network is capable of extracting the face pixels and the resulting embedding can be linearly decoded. Can the face pixels themselves be linearly decoded? The answer is no. If linear decoding of face pixels were feasible, the challenging task of face recognition could be easily resolved by merely extracting the face from the background and training a linear classifier.

      Example 3: In the MNIST dataset, the background is uniformly black, and its impact is minimal. However, linear SVM classifiers used directly on the original pixels significantly underperform compared to non-linear SVMs.

      In summary, embedding involves reorganizing the structure of the original signals through a feature transformation function. However, the reconstruction process can recover the structure of the original signals from the embedding. The fact that the structure of the embedding can be linearly decoded does not imply that the structure of the original signals can be linearly decoded in the same way. It is inappropriate to focus on the compression process without equally considering the reconstruction process.

      Thank you for your valuable feedback.

      (1) Mao, Xiao-Jiao, Chunhua Shen, and Yu-Bin Yang. "Image restoration using convolutional auto-encoders with symmetric skip connections." arXiv preprint arXiv:1606.08921 (2016).

      (2) Lehtinen, Jaakko, et al. "Noise2Noise: Learning image restoration without clean data." International Conference on Machine Learning. International Machine Learning Society, 2018.

      Q3: “Related to the above, it is unclear what the manuscript means by readout from motor cortex. A clearer definition of "readout" (a mapping from what to what?) in general is needed. The mapping that the linearity/nonlinearity claims refer to is from the *inferred* behaviorally relevant neural signals, which themselves are inferred nonlinearly using the VAE. This should be explicitly clarified in all claims, i.e., that only the mapping from distilled signals to behavior is linear, not the whole mapping from neural data to behavior. Again, to say the readout from motor cortex is linear is not supported, including in the abstract.” 

      Thank you for your comments. We have revised the manuscript to make it more clearly. Thank you for your valuable feedback.

      Q4: “Claims about individual neurons are also confounded. The d-VAE distilling processing is a population level embedding so the individual distilled neurons are not obtainable on their own without using the population data. This population level approach also raises the possibility that information can leak from one neuron to another during distillation, which is indeed what the authors hope would recover true information about individual neurons that wasn't there in the recording (the pixel denoising example). The authors acknowledge the possibility that information could leak to a neuron that didn't truly have that information and try to rule it out to some extent with some simulations and by comparing the distilled behaviorally relevant signals to the original neural signals. But ultimately, the distilled signals are different enough from the original signals to substantially improve decoding of low information neurons, and one cannot be sure if all of the information in distilled signals from any individual neuron truly belongs to that neuron. It is still quite likely that some of the improved behavior prediction of the distilled version of low-information neurons is due to leakage of behaviorally relevant information from other neurons, not the former's inherent behavioral information. This should be explicitly acknowledged in the manuscript.”

      Thank you for your comments. We value your insights regarding the mixing process. However, we are confident in the robustness of our conclusions. We respectfully disagree with the notion that the small R2 values containing significant information are primarily due to leakage, and we base our disagreement on four key reasons.

      (1) Neural reconstruction performance is a reliable and valid criterion.

      The purpose of latent variable models is to explain neuronal activity as much as possible. Given the fact that the ground truth of behaviorally-relevant signals, the latent variables, and the generative model is unknow, it becomes evident that the only reliable reference at the signal level is the raw signals. A crucial criterion for evaluating the reliability of latent variable models (including latent variables and generated relevant signals) is their capability to effectively explain the raw signals [1]. Consequently, we firmly maintain the belief that if the generated signals closely resemble the raw signals to the greatest extent possible, in accordance with an equivalence principle, we can claim that these obtained signals faithfully retain the inherent properties of single neurons. 

      Reviewer #4 appears to focus on the compression (mixing) process without giving equal consideration to the reconstruction (de-mixing) process. Numerous studies have demonstrated that deep autoencoders can reconstruct the original signal very effectively. For example, in the field of image denoising, autoencoders are capable of accurately restoring the original image [2, 3]. If one persistently focuses on the fact of mixing and ignores the reconstruction (demix) process, even if the only criterion that we can rely on at the signal level is high, one still won't acknowledge it. If this were the case, many problems would become unsolvable. For instance, a fundamental criterion for latent variable models is their ability to explain the original data. If the ground truth of the latent variables remains unknown and the reconstruction criterion is disregarded, how can we validate the effectiveness of the model, the validity of the latent variables, or ensure that findings related to latent variables are not merely by-products of the model? Therefore, we disagree with the aforementioned notion. We believe that as long as the reconstruction performance is satisfactory, the extracted signals have successfully retained the characteristics of individual neurons.

      In our paper, we have shown in various ways that our generated signals sufficiently resemble the raw signals, including visualizing neuronal activity (Fig. 2m, Fig. 3i, and Fig. S5), achieving the highest performance among competitors (Fig. 2d, h, l), and conducting control analyses. Therefore, we believe our results are reliable. 

      (1) Cunningham, J.P. and Yu, B.M., 2014. Dimensionality reduction for large-scale neural recordings. Nature neuroscience, 17(11), pp.1500-1509.

      (2) Mao, Xiao-Jiao, Chunhua Shen, and Yu-Bin Yang. "Image restoration using convolutional auto-encoders with symmetric skip connections." arXiv preprint arXiv:1606.08921 (2016).

      (3) Lehtinen, Jaakko, et al. "Noise2Noise: Learning image restoration without clean data." International Conference on Machine Learning. International Machine Learning Society, 2018.

      (2) There is no reason for d-VAE to add signals that do not exist in the original signals.

      (1) Adding signals that does not exist in the small R2 neurons would decrease the reconstruction performance. This is because if the added signals contain significant information, they will not resemble the irrelevant signals which contain no information, and thus, the generated signals will not resemble the raw signals. The model optimizes towards reducing the reconstruction loss, and this scenario deviates from the model's optimization direction. It is worth mentioning that when the model only has reconstruction loss without the interference of decoding loss, we believe that information leakage does not happen. Because the model can only be optimized in a direction that is similar to the raw signals; adding non-existent signals to the generated signals would increase the reconstruction loss, which is contrary to the objective of optimization. 

      (2) Information carried by these additional signals is redundant for larger R2 neurons, thus they do not introduce new information that can enhance the decoding performance of the neural population, which does not benefit the decoding loss.

      Based on these two points, we believe the model would not perform such counterproductive and harmful operations.

      (3) The criterion that irrelevant signals should contain minimal information can effectively rule out the leakage scenario.

      The criterion that irrelevant signals should contain minimal information is very important, but it seems that reviewer #4 has continuously overlooked their significance. If the model's reconstruction is insufficient, or if additional information is added (which we do not believe will happen), the residuals would decode a large amount of information, and this criterion would exclude selecting such signals. To clarify, if we assume that x, y, and z denote the raw, relevant, and irrelevant signals of smaller R2 neurons, with x=y+z, and the extracted relevant signals become y+m, the irrelevant signals become z-m in this case. Consequently, the irrelevant signals contain a significant amount of information.

      We presented the decoding R2 for irrelevant signals in real datasets under three distillation scenarios: a bias towards reconstruction (alpha=0, an extreme case where the model only has reconstruction loss without decoding loss), a balanced trade-off, and a bias towards decoding (alpha=0.9), as detailed in Table 1. If significant information from small R2 neurons leaks from large R2 neurons, the irrelevant signals should contain a large amount of information. However, our results indicate that the irrelevant signals contain only minimal information, and their performance closely resembles that of the model training solely with reconstruction loss, showing no significant differences (P > 0.05, Wilcoxon rank-sum test). When the model leans towards decoding, some useful information will be left in the residuals, and irrelevant signals will contain a substantial amount of information, as observed in Table 1, alpha=0.9. Therefore, we will not choose these signals for analysis.

      In conclusion, the criterion that irrelevant signals should contain minimal information is a very effective measure to exclude undesirable signals.

      Author response table 1.

      Decoding R2 of irrelevant signals

      (4) Synthetic experiments can effectively rule out the leakage scenario.

      In the absence of ground truth data, synthetic experiments serve as an effective method for validating models and are commonly employed [1-3]. 

      Our experimental results demonstrate that d-VAE can effectively extract neural signals that more closely resemble actual behaviorally relevant signals (Fig. S2g).  If there were information leakage, it would decrease the similarity to the ground truth signals, hence we have ruled out this possibility. Moreover, in synthetic experiments with small R2 neurons (Fig. S10), results also demonstrate that our model could make these neurons more closely resemble ground truth relevant signals and recover their information. 

      In summary, synthetic experiments strongly demonstrate that our model can recover obscured neuronal information, rather than adding signals that do not exist.

      (1) Pnevmatikakis, Eftychios A., et al. "Simultaneous denoising, deconvolution, and demixing of calcium imaging data." Neuron 89.2 (2016): 285-299.

      (2) Schneider, Steffen, Jin Hwa Lee, and Mackenzie Weygandt Mathis. "Learnable latent embeddings for joint behavioural and neural analysis." Nature 617.7960 (2023): 360-368.

      (3) Zhou, Ding, and Xue-Xin Wei. "Learning identifiable and interpretable latent models of high-dimensional neural activity using pi-VAE." Advances in Neural Information Processing Systems 33 (2020): 7234-7247.

      Based on these four points, we are confident in the reliability of our results. If Reviewer #4 considers these points insufficient, we would highly appreciate it if specific concerns regarding any of these aspects could be detailed.

      Thank you for your valuable feedback.

      Q5: “Given the nuances involved in appropriate comparisons across methods and since two of the datasets are public, the authors should provide their complete code (not just the dVAE method code), including the code for data loading, data preprocessing, model fitting and model evaluation for all methods and public datasets. This will alleviate concerns and allow readers to confirm conclusions (e.g., figure 2) for themselves down the line.”

      Thanks for your suggestion.

      Our codes are now available on GitHub at https://github.com/eric0li/d-VAE. Thank you for your valuable feedback.

      Q6: “Related to 1) above, the authors should explore the results if the affine network h(.) (from embedding to behavior) was replaced with a nonlinear ANN. Perhaps linear decoders would no longer be as close to nonlinear decoders. Regardless, the claim of linearity should be revised as described in 1) and 2) above, and all caveats should be discussed.”

      Thank you for your suggestion. We appreciate your feasible proposal that can be empirically tested. Following your suggestion, we have replaced the decoding of the latent variable z to behavior y with a nonlinear neural network, specifically a neural network with a single hidden layer. The modified model is termed d-VAE2. We applied the d-VAE2 to the real data, and selected the optimal alpha through the validation set. As shown in Table 1, results demonstrate that the performance of KF and ANN remains comparable. Therefore, the capacity to linearly decode behaviorally relevant signals does not stem from the linear decoding of embeddings.

      Author response table 2.

      Decoding R2 of behaviorally relevant signals obtained by d-VAE2

      Additionally, it is worth noting that this approach is uncommon and is considered somewhat inappropriate according to the Information Bottleneck theory [1]. According to the Information Bottleneck theory, information is progressively compressed in multilayer neural networks, discarding what is irrelevant to the output and retaining what is relevant. This means that as the number of layers increases, the mutual information between each layer's embedding and the model input gradually decreases, while the mutual information between each layer's embedding and the model output gradually increases. For the decoding part, if the embeddings that is not closest to the output (behaviors) is used, then these embeddings might contain behaviorally irrelevant signals. Using these embeddings to generate behaviorally relevant signals could lead to the inclusion of irrelevant signals in the behaviorally relevant signals.

      To demonstrate the above statement, we conducted experiments on the synthetic data. As shown in Table 2, we present the performance (neural R2 between the generated signals and the ground truth signals) of both models at several alpha values around the optimal alpha of dVAE (alpha=0.9) selected by the validation set. The experimental results show that at the same alpha value, the performance of d-VAE2 is consistently inferior to that of d-VAE, and d-VAE2 requires a higher alpha value to achieve performance comparable to d-VAE, and the best performance of d-VAE2 is inferior to that of d-VAE.

      Author response table 3.

      Neural R2 between generated signals and real behaviorally relevant signals

      Thank you for your valuable feedback.

      (1) Shwartz-Ziv, Ravid, and Naftali Tishby. "Opening the black box of deep neural networks via information." arXiv preprint arXiv:1703.00810 (2017).

      Q7: “The beginning of the section on the "smaller R2 neurons" should clearly define what R2 is being discussed. Based on the response to previous reviewers, this R2 "signifies the proportion of neuronal activity variance explained by the linear encoding model, calculated using raw signals". This should be mentioned and made clear in the main text whenever this R2 is referred to.”

      Thank you for your suggestion. We have made the modifications in the main text. Thank you for your valuable feedback.

      Q8: “Various terms require clear definitions. The authors sometimes use vague terminology (e.g., "useless") without a clear definition. Similarly, discussions regarding dimensionality could benefit from more precise definitions. How is neural dimensionality defined? For example, how is "neural dimensionality of specific behaviors" (line 590) defined? Related to this, I agree with Reviewer 2 that a clear definition of irrelevant should be mentioned that clarifies that relevance is roughly taken as "correlated or predictive with a fixed time lag". The analyses do not explore relevance with arbitrary time lags between neural and behavior data.”

      Thanks for your suggestion. We have removed the “useless” statements and have revised the statement of “the neural dimensionality of specific behaviors” in our revised manuscripts.

      Regarding the use of fixed temporal lags, we followed the same practice as papers related to the dataset we use, which assume fixed temporal lags [1-3]. Furthermore, many studies in the motor cortex similarly use fixed temporal lags [4-6]. To clarify the definition, we have revised the definition in our manuscript. For details, please refer to the response to Q2 of reviewer #2 and our revised manuscript. We believe our definition is clearly articulated.

      Thank you for your valuable feedback.

      (1) Wang, Fang, et al. "Quantized attention-gated kernel reinforcement learning for brain– machine interface decoding." IEEE transactions on neural networks and learning systems 28.4 (2015): 873-886.

      (2) Dyer, Eva L., et al. "A cryptography-based approach for movement decoding." Nature biomedical engineering 1.12 (2017): 967-976.

      (3) Ahmadi, Nur, Timothy G. Constandinou, and Christos-Savvas Bouganis. "Robust and accurate decoding of hand kinematics from entire spiking activity using deep learning." Journal of Neural Engineering 18.2 (2021): 026011.

      (4) Churchland, Mark M., et al. "Neural population dynamics during reaching." Nature 487.7405 (2012): 51-56.

      (5) Kaufman, Matthew T., et al. "Cortical activity in the null space: permitting preparation without movement." Nature neuroscience 17.3 (2014): 440-448.

      (6) Elsayed, Gamaleldin F., et al. "Reorganization between preparatory and movement population responses in motor cortex." Nature communications 7.1 (2016): 13239. 

      Q9: “CEBRA itself doesn't provide a neural reconstruction from its embeddings, but one could obtain one via a regression from extracted CEBRA embeddings to neural data. In addition to decoding results of CEBRA (figure S3), the neural reconstruction of CEBRA should be computed and CEBRA should be added to Figure 2 to see how the behaviorally relevant and irrelevant signals from CEBRA compare to other methods.”

      Thank you for your question. Modifying CEBRA is beyond the scope of our work. As CEBRA is not a generative model, it cannot obtain behaviorally relevant and irrelevant signals, and therefore it lacks the results presented in Fig. 2. To avoid the same confusion encountered by reviewers #3 and #4 among our readers, we have opted to exclude the comparison with CEBRA. It is crucial to note, as previously stated, that our assessment of decoding capabilities has been benchmarked against the performance of the ANN on raw signals, which almost represents the upper limit of performance. Consequently, omitting CEBRA does not affect our conclusions.

      Thank you for your valuable feedback.

      Q10: “Line 923: "The optimal hyperparameter is selected based on the lowest averaged loss of five-fold training data." => why is this explained specifically under CEBRA? Isn't the same criteria used for hyperparameters of other methods? If so, clarify.”

      Thank you for your question. The hyperparameter selection for CEBRA follows the practice of the original CEBRA paper. The hyperparameter selection for generative models is detailed in the Section “The strategy for selecting effective behaviorally-relevant signals”.  Thank you for your valuable feedback.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public Review):

      In this paper, the authors evaluate the utility of brain age derived metrics for predicting cognitive decline by performing a 'commonality' analysis in a downstream regression that enables the different contribution of different predictors to be assessed. The main conclusion is that brain age derived metrics do not explain much additional variation in cognition over and above what is already explained by age. The authors propose to use a regression model trained to predict cognition ('brain cognition') as an alternative suited to applications of cognitive decline. While this is less accurate overall than brain age, it explains more unique variance in the downstream regression.  

      Importantly, in this revision, we clarified that we did not intend to use Brain Cognition as an alternative approach. This is because, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. Here we made this point more explicit and further stated that the relationship between Brain Cognition and fluid cognition indicates the upper limit of Brain Age’s capability in capturing fluid cognition. By examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age. 

      REVISED VERSION: while the authors have partially addressed my concerns, I do not feel they have addressed them all. I do not feel they have addressed the weight instability and concerns about the stacked regression models satisfactorily.

      Please see our responses to Reviewer #1 Public Review #3 below

      I also must say that I agree with Reviewer 3 about the limitations of the brain age and brain cognition methods conceptually. In particular that the regression model used to predict fluid cognition will by construction explain more variance in cognition than a brain age model that is trained to predict age. This suffers from the same problem the authors raise with brain age and would indeed disappear if the authors had a separate measure of cognition against which to validate and were then to regress this out as they do for age correction. I am aware that these conceptual problems are more widespread than this paper alone (in fact throughout the brain age literature), so I do not believe the authors should be penalised for that. However, I do think they can make these concerns more explicit and further tone down the comments they make about the utility of brain cognition. I have indicated the main considerations about these points in the recommendations section below. 

      Thank you so much for raising this point. We now have the following statement in the introduction and discussion to address this concern (see below). 

      Briefly, we made it explicit that, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. That is, the relationship between Brain Cognition and fluid cognition indicates the upper limit of Brain Age’s capability in capturing fluid cognition. More importantly, by examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age. And this is the third goal of this present study. 

      From Introduction:

      “Third and finally, certain variation in fluid cognition is related to brain MRI, but to what extent does Brain Age not capture this variation? To estimate the variation in fluid cognition that is related to the brain MRI, we could build prediction models that directly predict fluid cognition (i.e., as opposed to chronological age) from brain MRI data. Previous studies found reasonable predictive performances of these cognition-prediction models, built from certain MRI modalities (Dubois et al., 2018; Pat, Wang, Anney, et al., 2022; Rasero et al., 2021; Sripada et al., 2020; Tetereva et al., 2022; for review, see Vieira et al., 2022). Analogous to Brain Age, we called the predicted values from these cognition-prediction models, Brain Cognition. The strength of an out-of-sample relationship between Brain Cognition and fluid cognition reflects variation in fluid cognition that is related to the brain MRI and, therefore, indicates the upper limit of Brain Age’s capability in capturing fluid cognition. This is, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. Consequently, if we included Brain Cognition, Brain Age and chronological age in the same model to explain fluid cognition, we would be able to examine the unique effects of Brain Cognition that explain fluid cognition beyond Brain Age and chronological age. These unique effects of Brain Cognition, in turn, would indicate the amount of co-variation between brain MRI and fluid cognition that is missed by Brain Age.”

      From Discussion:

      “Third, by introducing Brain Cognition,  we showed the extent to which Brain Age indices were not able to capture the variation in fluid cognition that is related to brain MRI. More specifically, using Brain Cognition allowed us to gauge the variation in fluid cognition that is related to the brain MRI, and thereby, to estimate the upper limit of what Brain Age can do. Moreover, by examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age.

      From our results, Brain Cognition, especially from certain cognition-prediction models such as the stacked models, has relatively good predictive performance, consistent with previous studies (Dubois et al., 2018; Pat, Wang, Anney, et al., 2022; Rasero et al., 2021; Sripada et al., 2020; Tetereva et al., 2022; for review, see Vieira et al., 2022). We then examined Brain Cognition using commonality analyses (Nimon et al., 2008) in multiple regression models having a Brain Age index, chronological age and Brain Cognition as regressors to explain fluid cognition. Similar to Brain Age indices, Brain Cognition exhibited large common effects with chronological age. But more importantly, unlike Brain Age indices, Brain Cognition showed large unique effects, up to around 11%. As explained above, the unique effects of Brain Cognition indicated the amount of co-variation between brain MRI and fluid cognition that was missed by a Brain Age index and chronological age. This missing amount was relatively high, considering that Brain Age and chronological age together explained around 32% of the total variation in fluid cognition. Accordingly, if a Brain Age index was used as a biomarker along with chronological age, we would have missed an opportunity to improve the performance of the model by around one-third of the variation explained.” 

      This is a reasonably good paper and the use of a commonality analysis is a nice contribution to understanding variance partitioning across different covariates. I have some comments that I believe the authors ought to address, which mostly relate to clarity and interpretation 

      Reviewer #1 Public Review #1

      First, from a conceptual point of view, the authors focus exclusively on cognition as a downstream outcome. I would suggest the authors nuance their discussion to provide broader considerations of the utility of their method and on the limits of interpretation of brain age models more generally. 

      Thank you for your comments on this issue. 

      We now discussed the broader consideration in detail:

      (1) the consistency between our findings on fluid cognition and other recent works on brain disorders, 

      (2) the difference between studies investigating the utility of Brain Age in explaining cognitive functioning, including ours and others (e.g., Butler et al., 2021; Cole, 2020, 2020; Jirsaraie, Kaufmann, et al., 2023) and those explaining neurological/psychological disorders (e.g., Bashyam et al., 2020; Rokicki et al., 2021)

      and 

      (3) suggested solutions we and others made to optimise the utility of Brain Age for both cognitive functioning and brain disorders.

      From Discussion:

      “This discrepancy between the predictive performance of age-prediction models and the utility of Brain Age indices as a biomarker is consistent with recent findings (for review, see Jirsaraie, Gorelik, et al., 2023), both in the context of cognitive functioning (Jirsaraie, Kaufmann, et al., 2023) and neurological/psychological disorders (Bashyam et al., 2020; Rokicki et al., 2021). For instance,  combining different MRI modalities into the prediction models, similar to our stacked models, ocen leads to the highest performance of age prediction models, but does not likely explain the highest variance across different phenotypes, including cognitive functioning and beyond (Jirsaraie, Gorelik, et al., 2023).”

      “There is a notable difference between studies investigating the utility of Brain Age in explaining cognitive functioning, including ours and others (e.g., Butler et al., 2021; Cole, 2020, 2020; Jirsaraie, Kaufmann, et al., 2023) and those explaining neurological/psychological disorders (e.g., Bashyam et al., 2020; Rokicki et al., 2021). We consider the former as a normative type of study and the lader as a case-control type of study (Insel et al., 2010; Marquand et al., 2016). Those case-control Brain Age studies focusing on neurological/psychological disorders often build age-prediction models from MRI data of largely healthy participants (e.g., controls in a case-control design or large samples in a population-based design), apply the built age-prediction models to participants without vs. with neurological/psychological disorders and compare Brain Age indices between the two groups. On the one hand, this means that case-control studies treat Brain Age as a method to detect anomalies in the neurological/psychological group (Hahn et al., 2021). On the other hand, this also means that case-control studies have to ignore underfided models when applied prediction models built from largely healthy participants to participants with neurological/psychological disorders (i.e., Brain Age may predict chronological age well for the controls, but not for those with a disorder). On the contrary, our study and other normative studies focusing on cognitive functioning often build age prediction models from MRI data of largely healthy participants and apply the built age prediction models to participants who are also largely healthy. Accordingly, the age prediction models for explaining cognitive functioning in normative studies, while not allowing us to detect group-level anomalies, do not suffer from being under-fided. This unfortunately might limit the generalisability of our study into just the normative type of study. Future work is still needed to test the utility of brain age in the case-control case.”

      “Next, researchers should not select age-prediction models based solely on age-prediction performance. Instead, researchers could select age-prediction models that explained phenotypes of interest the best. Here we selected age-prediction models based on a set of features (i.e., modalities) of brain MRI. This strategy was found effective not only for fluid cognition as we demonstrated here, but also for neurological and psychological disorders as shown elsewhere (Jirsaraie, Gorelik, et al., 2023; Rokicki et al., 2021). Rokicki and colleagues (2021), for instance, found that, while integrating across MRI modalities led to age prediction models with the highest age-prediction performance, using only T1 structural MRI gave age-prediction models that were better at classifying Alzheimer’s disease. Similarly, using only cerebral blood flow gave age-prediction models that were better at classifying mild/subjective cognitive impairment, schizophrenia and bipolar disorder. 

      As opposed to selecting age-prediction models based on a set of features, researchers could also select age-prediction models based on modelling methods. For instance, Jirsaraie and colleagues (2023) compared gradient tree boosting (GTB) and deep-learning brain network (DBN) algorithms in building age-prediction models. They found GTB to have higher age prediction performance but DBN to have better utility in explaining cognitive functioning. In this case, an algorithm with better utility (e.g., DBN) should be used for explaining a phenotype of interest. Similarly, Bashyam and colleagues (2020) built different DBN-based age-prediction models, varying in age-prediction performance. The DBN models with a higher number of epochs corresponded to higher age-prediction performance. However, DBN-based age-prediction models with a moderate (as opposed to higher or lower) number of epochs were better at classifying Alzheimer’s disease, mild cognitive impairment and schizophrenia. In this case, a model from the same algorithm with better utility (e.g., those DBN with a moderate epoch number) should be used for explaining a phenotype of interest.

      Accordingly, this calls for a change in research practice, as recently pointed out by Jirasarie and colleagues (2023, p7), “Despite mounting evidence, there is a persisting assumption across several studies that the most accurate brain age models will have the most potential for detecting differences in a given phenotype of interest”. Future neuroimaging research should aim to build age-prediction models that are not necessarily good at predicting age, but at capturing phenotypes of interest.”

      Reviewer #1 Public Review #2

      Second, from a methods perspective, there is not a sufficient explanation of the methodological procedures in the current manuscript to fully understand how the stacked regression models were constructed. I would request that the authors provide more information to enable the reader to beUer understand the stacked regression models used to ensure that these models are not overfit. 

      Thank you for allowing us an opportunity to clarify our stacked model. We made additional clarification to make this clearer (see below). We wanted to confirm that we did not use test sets to build a stacked model in both lower and higher levels of the Elastic Net models. Test sets were there just for testing the performance of the models.  

      From Methods:

      “We used nested cross-validation (CV) to build these prediction models (see Figure 7). We first split the data into five outer folds, leaving each outer fold with around 100 participants. This number of participants in each fold is to ensure the stability of the test performance across folds. In each outer-fold CV loop, one of the outer folds was treated as an outer-fold test set, and the rest was treated as an outer-fold training set. Ultimately, looping through the nested CV resulted in a) prediction models from each of the 18 sets of features as well as b) prediction models that drew information across different combinations of the 18 separate sets, known as “stacked models.” We specified eight stacked models: “All” (i.e., including all 18 sets of features),  “All excluding Task FC”, “All excluding Task Contrast”, “Non-Task” (i.e., including only Rest FC and sMRI), “Resting and Task FC”, “Task Contrast and FC”, “Task Contrast” and “Task FC”. Accordingly, there were 26 prediction models in total for both Brain Age and Brain Cognition.

      To create these 26 prediction models, we applied three steps for each outer-fold loop. The first step aimed at tuning prediction models for each of 18 sets of features. This step only involved the outer-fold training set and did not involve the outer-fold test set. Here, we divided the outer-fold training set into five inner folds and applied inner-fold CV to tune hyperparameters with grid search. Specifically, in each inner-fold CV, one of the inner folds was treated as an inner-fold validation set, and the rest was treated as an inner-fold training set. Within each inner-fold CV loop, we used the inner-fold training set to estimate parameters of the prediction model with a particular set of hyperparameters and applied the estimated model to the inner-fold validation set. Acer looping through the inner-fold CV, we, then, chose the prediction models that led to the highest performance, reflected by coefficient of determination (R2), on average across the inner-fold validation sets. This led to 18 tuned models, one for each of the 18 sets of features, for each outer fold.

      The second step aimed at tuning stacked models. Same as the first step, the second step only involved the outer-fold training set and did not involve the outer-fold test set. Here, using the same outer-fold training set as the first step, we applied tuned models, created from the first step, one from each of the 18 sets of features, resulting in 18 predicted values for each participant. We, then, re-divided this outer-fold training set into new five inner folds. In each inner fold, we treated different combinations of the 18 predicted values from separate sets of features as features to predict the targets in separate “stacked” models. Same as the first step, in each inner-fold CV loop, we treated one out of five inner folds as an inner-fold validation set, and the rest as an inner-fold training set. Also as in the first step, we used the inner-fold training set to estimate parameters of the prediction model with a particular set of hyperparameters from our grid. We tuned the hyperparameters of stacked models using grid search by selecting the models with the highest R2 on average across the inner-fold validation sets. This led to eight tuned stacked models.

      The third step aimed at testing the predictive performance of the 18 tuned prediction models from each of the set of features, built from the first step, and eight tuned stacked models, built from the second step. Unlike the first two steps, here we applied the already tuned models to the outer-fold test set. We started by applying the 18 tuned prediction models from each of the sets of features to each observation in the outer-fold test set, resulting in 18 predicted values. We then applied the tuned stacked models to these predicted values from separate sets of features, resulting in eight predicted values. 

      To demonstrate the predictive performance, we assessed the similarity between the observed values and the predicted values of each model across outer-fold test sets, using Pearson’s r, coefficient of determination (R2) and mean absolute error (MAE). Note that for R2, we used the sum of squares definition (i.e., R2 \= 1 – (sum of squares residuals/total sum of squares)) per a previous recommendation (Poldrack et al., 2020). We considered the predicted values from the outer-fold test sets of models predicting age or fluid cognition, as Brain Age and Brain Cognition, respectively.”

      Author response image 1.

      Diagram of the nested cross-validation used for creating predictions for models of each set of features as well as predictions for stacked models. 

      Note some previous research, including ours (Tetereva et al., 2022), splits the observations in the outer-fold training set into layer 1 and layer 2 and applies the first and second steps to layers 1 and 2, respectively. Here we decided against this approach and used the same outer-fold training set for both first and second steps in order to avoid potential bias toward the stacked models. This is because, when the data are split into two layers, predictive models built for each separate set of features only use the data from layer 1, while the stacked models use the data from both layers 1 and 2. In practice with large enough data, these two approaches might not differ much, as we demonstrated previously (Tetereva et al., 2022).

      Reviewer #1 Public Review #3

      Please also provide an indication of the different regression strengths that were estimated across the different models and cross-validation splits. Also, how stable were the weights across splits? 

      The focus of this article is on the predictions. Still, it is informative for readers to understand how stable the feature importance (i.e., Elastic Net coefficients) is. To demonstrate the stability of feature importance, we now examined the rank stability of feature importance using Spearman’s ρ (see Figure 4). Specifically, we correlated the feature importance between two prediction models of the same features, used in two different outer-fold test sets. Given that there were five outer-fold test sets, we computed 10 Spearman’s ρ for each prediction model of the same features.  We found Spearman’s ρ to be varied dramatically in both age-prediction (range\=.31-.94) and fluid cognition-prediction (range\=.16-.84) models. This means that some prediction models were much more stable in their feature importance than others. This is probably due to various factors such as a) the collinearity of features in the model, b) the number of features (e.g., 71,631 features in functional connectivity, which were further reduced to 75 PCAs, as compared to 19 features in subcortical volume based on the ASEG atlas), c) the penalisation of coefficients either with ‘Ridge’ or ‘Lasso’ methods, which resulted in reduction as a group of features or selection of a feature among correlated features, respectively, and d) the predictive performance of the models. Understanding the stability of feature importance is beyond the scope of the current article. As mentioned by Reviewer 1, “The predictions can be stable when the coefficients are not,” and we chose to focus on the prediction in the current article.   

      Author response image 2.

      Stability of feature importance (i.e., Elastic Net Coefficients) of prediction models. Each dot represents rank stability (reflected by Spearman’s ρ) in the feature importance between two prediction models of the same features, used in two different outer-fold test sets. Given that there were five outer-fold test sets, there were 10 Spearman’s ρs for each prediction model.  The numbers to the right of the plots indicate the mean of Spearman’s ρ for each prediction model.  

      Reviewer #1 Public Review #4

      Please provide more details about the task designs, MRI processing procedures that were employed on this sample in addition to the regression methods and bias correction methods used. For example, there are several different parameterisations of the elastic net, please provide equations to describe the method used here so that readers can easily determine how the regularisation parameters should be interpreted.  

      Thank you for the opportunity for us to provide more methodical details.

      First, for the task design, we included the following statements:

      From Methods:

      “HCP-A collected fMRI data from three tasks: Face Name (Sperling et al., 2001), Conditioned Approach Response Inhibition Task (CARIT) (Somerville et al., 2018) and VISual MOTOR (VISMOTOR) (Ances et al., 2009). 

      First, the Face Name task (Sperling et al., 2001) taps into episodic memory. The task had three blocks. In the encoding block [Encoding], participants were asked to memorise the names of faces shown. These faces were then shown again in the recall block [Recall] when the participants were asked if they could remember the names of the previously shown faces. There was also the distractor block [Distractor] occurring between the encoding and recall blocks. Here participants were distracted by a Go/NoGo task. We computed six contrasts for this Face Name task: [Encode], [Recall], [Distractor], [Encode vs. Distractor], [Recall vs. Distractor] and [Encode vs. Recall].

      Second, the CARIT task (Somerville et al., 2018) was adapted from the classic Go/NoGo task and taps into inhibitory control. Participants were asked to press a budon to all [Go] but not to two [NoGo] shapes. We computed three contrasts for the CARIT task: [NoGo], [Go] and [NoGo vs. Go]. 

      Third, the VISMOTOR task (Ances et al., 2009) was designed to test simple activation of the motor and visual cortices. Participants saw a checkerboard with a red square either on the lec or right. They needed to press a corresponding key to indicate the location of the red square. We computed just one contrast for the VISMOTOR task: [Vismotor], which indicates the presence of the checkerboard vs. baseline.” 

      Second, for MRI processing procedures, we included the following statements.

      From Methods:

      “HCP-A provides details of parameters for brain MRI elsewhere (Bookheimer et al., 2019; Harms et al., 2018). Here we used MRI data that were pre-processed by the HCP-A with recommended methods, including the MSMALL alignment (Glasser et al., 2016; Robinson et al., 2018) and ICA-FIX (Glasser et al., 2016) for functional MRI. We used multiple brain MRI modalities, covering task functional MRI (task fMRI), resting-state functional MRI (rsfMRI) and structural MRI (sMRI), and organised them into 19 sets of features.”

      “Sets of Features 1-10: Task fMRI contrast (Task Contrast)

      Task contrasts reflect fMRI activation relevant to events in each task. Bookheimer and colleagues (2019) provided detailed information about the fMRI in HCP-A. Here we focused on the pre-processed task fMRI Connectivity Informatics Technology Initiative (CIFTI) files with a suffix, “_PA_Atlas_MSMAll_hp0_clean.dtseries.nii.” These CIFTI files encompassed both the cortical mesh surface and subcortical volume (Glasser et al., 2013). Collected using the posterior-to-anterior (PA) phase, these files were aligned using MSMALL (Glasser et al., 2016; Robinson et al., 2018), linear detrended (see hdps://groups.google.com/a/humanconnectome.org/g/hcp-users/c/ZLJc092h980/m/GiihzQAUAwAJ) and cleaned from potential artifacts using ICA-FIX (Glasser et al., 2016). 

      To extract Task Contrasts, we regressed the fMRI time series on the convolved task events using a double-gamma canonical hemodynamic response function via FMRIB Software Library (FSL)’s FMRI Expert Analysis Tool (FEAT) (Woolrich et al., 2001). We kept FSL’s default high pass cutoff at 200s (i.e., .005 Hz). We then parcellated the contrast ‘cope’ files, using the Glasser atlas (Gordon et al., 2016) for cortical surface regions and the Freesurfer’s automatic segmentation (aseg) (Fischl et al., 2002) for subcortical regions. This resulted in 379 regions, whose number was, in turn, the number of features for each Task Contrast set of features. “ 

      “Sets of Features 11-13: Task fMRI functional connectivity (Task FC)

      Task FC reflects functional connectivity (FC ) among the brain regions during each task, which is considered an important source of individual differences (Elliod et al., 2019; Fair et al., 2007; Gradon et al., 2018). We used the same CIFTI file “_PA_Atlas_MSMAll_hp0_clean.dtseries.nii.” as the task contrasts. Unlike Task Contrasts, here we treated the double-gamma, convolved task events as regressors of no interest and focused on the residuals of the regression from each task (Fair et al., 2007). We computed these regressors on FSL, and regressed them in nilearn (Abraham et al., 2014). Following previous work on task FC (Elliod et al., 2019), we applied a highpass at .008 Hz. For parcellation, we used the same atlases as Task Contrast (Fischl et al., 2002; Glasser et al., 2016). We computed Pearson’s correlations of each pair of 379 regions, resulting in a table of 71,631 non-overlapping FC indices for each task. We then applied r-to-z transformation and principal component analysis (PCA) of 75 components (Rasero et al., 2021; Sripada et al., 2019, 2020). Note to avoid data leakage, we conducted the PCA on each training set and applied its definition to the corresponding test set. Accordingly, there were three sets of 75 features for Task FC, one for each task. 

      Set of Features 14: Resting-state functional MRI functional connectivity (Rest FC) Similar to Task FC, Rest FC reflects functional connectivity (FC ) among the brain regions, except that Rest FC occurred during the resting (as opposed to task-performing) period. HCPA collected Rest FC from four 6.42-min (488 frames) runs across two days, leading to 26-min long data (Harms et al., 2018). On each day, the study scanned two runs of Rest FC, starting with anterior-to-posterior (AP) and then with posterior-to-anterior (PA) phase encoding polarity. We used the “rfMRI_REST_Atlas_MSMAll_hp0_clean.dscalar.nii” file that was preprocessed and concatenated across the four runs.  We applied the same computations (i.e., highpass filter, parcellation, Pearson’s correlations, r-to-z transformation and PCA) with the Task FC. 

      Sets of Features 15-18: Structural MRI (sMRI)

      sMRI reflects individual differences in brain anatomy. The HCP-A used an established preprocessing pipeline for sMRI (Glasser et al., 2013). We focused on four sets of features: cortical thickness, cortical surface area, subcortical volume and total brain volume. For cortical thickness and cortical surface area, we used Destrieux’s atlas (Destrieux et al., 2010; Fischl, 2012) from FreeSurfer’s “aparc.stats” file, resulting in 148 regions for each set of features. For subcortical volume, we used the aseg atlas (Fischl et al., 2002) from FreeSurfer’s “aseg.stats” file, resulting in 19 regions. For total brain volume, we had five FreeSurfer-based features: “FS_IntraCranial_Vol” or estimated intra-cranial volume, “FS_TotCort_GM_Vol” or total cortical grey mader volume, “FS_Tot_WM_Vol” or total cortical white mader volume, “FS_SubCort_GM_Vol” or total subcortical grey mader volume and “FS_BrainSegVol_eTIV_Ratio” or ratio of brain segmentation volume to estimated total intracranial volume.”

      Third, for regression methods and bias correction methods used, we included the following statements:

      From Methods:

      “For the machine learning algorithm, we used Elastic Net (Zou & Hastie, 2005). Elastic Net is a general form of penalised regressions (including Lasso and Ridge regression), allowing us to simultaneously draw information across different brain indices to predict one target variable. Penalised regressions are commonly used for building age-prediction models (Jirsaraie, Gorelik, et al., 2023). Previously we showed that the performance of Elastic Net in predicting cognitive abilities is on par, if not better than, many non-linear and morecomplicated algorithms (Pat, Wang, Bartonicek, et al., 2022; Tetereva et al., 2022). Moreover, Elastic Net coefficients are readily explainable, allowing us the ability to explain how our age-prediction and cognition-prediction models made the prediction from each brain feature (Molnar, 2019; Pat, Wang, Bartonicek, et al., 2022) (see below). 

      Elastic Net simultaneously minimises the weighted sum of the features’ coefficients. The degree of penalty to the sum of the feature’s coefficients is determined by a shrinkage hyperparameter ‘a’: the greater the a, the more the coefficients shrink, and the more regularised the model becomes. Elastic Net also includes another hyperparameter, ‘ℓ! ratio’, which determines the degree to which the sum of either the squared (known as ‘Ridge’; ℓ! ratio=0) or absolute (known as ‘Lasso’; ℓ! ratio=1) coefficients is penalised (Zou & Hastie, 2005). The objective function of Elastic Net as implemented by sklearn (Pedregosa et al., 2011) is defined as:

      where X is the features, y is the target, and b is the coefficient. In our grid search, we tuned two Elastic Net hyperparameters: a using 70 numbers in log space, ranging from .1 and 100, and ℓ!-ratio using 25 numbers in linear space, ranging from 0 and 1.

      To understand how Elastic Net made a prediction based on different brain features, we examined the coefficients of the tuned model. Elastic Net coefficients can be considered as feature importance, such that more positive Elastic Net coefficients lead to more positive predicted values and, similarly, more negative Elastic Net coefficients lead to more negative predicted values (Molnar, 2019; Pat, Wang, Bartonicek, et al., 2022). While the magnitude of Elastic Net coefficients is regularised (thus making it difficult for us to interpret the magnitude itself directly), we could still indicate that a brain feature with a higher magnitude weights relatively stronger in making a prediction. Another benefit of Elastic Net as a penalised regression is that the coefficients are less susceptible to collinearity among features as they have already been regularised (Dormann et al., 2013; Pat, Wang, Bartonicek, et al., 2022).

      Given that we used five-fold nested cross validation, different outer folds may have different degrees of ‘a’ and ‘ℓ! ratio’, making the final coefficients from different folds to be different. For instance, for certain sets of features, penalisation may not play a big part (i.e., higher or lower ‘a’ leads to similar predictive performance), resulting in different ‘a’ for different folds. To remedy this in the visualisation of Elastic Net feature importance, we refitted the Elastic Net model to the full dataset without spli{ng them into five folds and visualised the coefficients on brain images using Brainspace (Vos De Wael et al., 2020) and Nilern (Abraham et al., 2014) packages. Note, unlike other sets of features, Task FC and Rest FC were modelled acer data reduction via PCA. Thus, for Task FC and Rest FC, we, first, multiplied the absolute PCA scores (extracted from the ‘components_’ attribute of ‘sklearn.decomposition.PCA’) with Elastic Net coefficients and, then, summed the multiplied values across the 75 components, leaving 71,631 ROI-pair indices.

      References

      Abraham, A., Pedregosa, F., Eickenberg, M., Gervais, P., Mueller, A., Kossaifi, J., Gramfort, A., Thirion, B., & Varoquaux, G. (2014). Machine learning for neuroimaging with scikitlearn. Frontiers in Neuroinformatics, 8, 14. hdps://doi.org/10.3389/fninf.2014.00014

      Ances, B. M., Liang, C. L., Leontiev, O., Perthen, J. E., Fleisher, A. S., Lansing, A. E., & Buxton, R. B. (2009). Effects of aging on cerebral blood flow, oxygen metabolism, and blood oxygenation level dependent responses to visual stimulation. Human Brain Mapping, 30(4), 1120–1132. hdps://doi.org/10.1002/hbm.20574

      Bashyam, V. M., Erus, G., Doshi, J., Habes, M., Nasrallah, I. M., Truelove-Hill, M., Srinivasan, D., Mamourian, L., Pomponio, R., Fan, Y., Launer, L. J., Masters, C. L., Maruff, P., Zhuo, C., Völzke, H., Johnson, S. C., Fripp, J., Koutsouleris, N., Saderthwaite, T. D., … on behalf of the ISTAGING Consortium,  the P. A. disease C., ADNI, and CARDIA studies. (2020). MRI signatures of brain age and disease over the lifespan based on a deep brain network and 14 468 individuals worldwide. Brain, 143(7), 2312–2324. hdps://doi.org/10.1093/brain/awaa160

      Bookheimer, S. Y., Salat, D. H., Terpstra, M., Ances, B. M., Barch, D. M., Buckner, R. L., Burgess, G. C., Curtiss, S. W., Diaz-Santos, M., Elam, J. S., Fischl, B., Greve, D. N., Hagy, H. A., Harms, M. P., Hatch, O. M., Hedden, T., Hodge, C., Japardi, K. C., Kuhn, T. P., … Yacoub, E. (2019). The Lifespan Human Connectome Project in Aging: An overview. NeuroImage, 185, 335–348. hdps://doi.org/10.1016/j.neuroimage.2018.10.009

      Butler, E. R., Chen, A., Ramadan, R., Le, T. T., Ruparel, K., Moore, T. M., Saderthwaite, T. D., Zhang, F., Shou, H., Gur, R. C., Nichols, T. E., & Shinohara, R. T. (2021). Pi alls in brain age analyses. Human Brain Mapping, 42(13), 4092–4101. hdps://doi.org/10.1002/hbm.25533

      Cole, J. H. (2020). Multimodality neuroimaging brain-age in UK biobank: Relationship to biomedical, lifestyle, and cognitive factors. Neurobiology of Aging, 92, 34–42. hdps://doi.org/10.1016/j.neurobiolaging.2020.03.014

      Destrieux, C., Fischl, B., Dale, A., & Halgren, E. (2010). Automatic parcellation of human cortical gyri and sulci using standard anatomical nomenclature. NeuroImage, 53(1), 1–15. hdps://doi.org/10.1016/j.neuroimage.2010.06.010

      Dormann, C. F., Elith, J., Bacher, S., Buchmann, C., Carl, G., Carré, G., Marquéz, J. R. G., Gruber, B., Lafourcade, B., Leitão, P. J., Münkemüller, T., McClean, C., Osborne, P. E., Reineking, B., Schröder, B., Skidmore, A. K., Zurell, D., & Lautenbach, S. (2013). Collinearity: A review of methods to deal with it and a simulation study evaluating their performance. Ecography, 36(1), 27–46. hdps://doi.org/10.1111/j.16000587.2012.07348.x

      Dubois, J., Galdi, P., Paul, L. K., & Adolphs, R. (2018). A distributed brain network predicts general intelligence from resting-state human neuroimaging data. Philosophical Transactions of the Royal Society B: Biological Sciences, 373(1756), 20170284. hdps://doi.org/10.1098/rstb.2017.0284

      Elliod, M. L., Knodt, A. R., Cooke, M., Kim, M. J., Melzer, T. R., Keenan, R., Ireland, D., Ramrakha, S., Poulton, R., Caspi, A., Moffid, T. E., & Hariri, A. R. (2019). General functional connectivity: Shared features of resting-state and task fMRI drive reliable and heritable individual differences in functional brain networks. NeuroImage, 189, 516–532. hdps://doi.org/10.1016/j.neuroimage.2019.01.068

      Fair, D. A., Schlaggar, B. L., Cohen, A. L., Miezin, F. M., Dosenbach, N. U. F., Wenger, K. K., Fox, M. D., Snyder, A. Z., Raichle, M. E., & Petersen, S. E. (2007). A method for using blocked and event-related fMRI data to study “resting state” functional connectivity. NeuroImage, 35(1), 396–405. hdps://doi.org/10.1016/j.neuroimage.2006.11.051

      Fischl, B. (2012). FreeSurfer. NeuroImage, 62(2), 774–781. hdps://doi.org/10.1016/j.neuroimage.2012.01.021

      Fischl, B., Salat, D. H., Busa, E., Albert, M., Dieterich, M., Haselgrove, C., van der Kouwe, A., Killiany, R., Kennedy, D., Klaveness, S., Montillo, A., Makris, N., Rosen, B., & Dale, A. M. (2002). Whole Brain Segmentation. Neuron, 33(3), 341–355. hdps://doi.org/10.1016/S0896-6273(02)00569-X

      Glasser, M. F., Smith, S. M., Marcus, D. S., Andersson, J. L. R., Auerbach, E. J., Behrens, T. E. J., Coalson, T. S., Harms, M. P., Jenkinson, M., Moeller, S., Robinson, E. C., Sotiropoulos, S. N., Xu, J., Yacoub, E., Ugurbil, K., & Van Essen, D. C. (2016). The Human Connectome Project’s neuroimaging approach. Nature Neuroscience, 19(9), 1175– 1187. hdps://doi.org/10.1038/nn.4361

      Glasser, M. F., Sotiropoulos, S. N., Wilson, J. A., Coalson, T. S., Fischl, B., Andersson, J. L., Xu, J., Jbabdi, S., Webster, M., Polimeni, J. R., Van Essen, D. C., & Jenkinson, M. (2013). The minimal preprocessing pipelines for the Human Connectome Project. NeuroImage, 80, 105–124. hdps://doi.org/10.1016/j.neuroimage.2013.04.127

      Gordon, E. M., Laumann, T. O., Adeyemo, B., Huckins, J. F., Kelley, W. M., & Petersen, S. E. (2016). Generation and Evaluation of a Cortical Area Parcellation from Resting-State Correlations. Cerebral Cortex, 26(1), 288–303. hdps://doi.org/10.1093/cercor/bhu239

      Gradon, C., Laumann, T. O., Nielsen, A. N., Greene, D. J., Gordon, E. M., Gilmore, A. W., Nelson, S. M., Coalson, R. S., Snyder, A. Z., Schlaggar, B. L., Dosenbach, N. U. F., & Petersen, S. E. (2018). Functional Brain Networks Are Dominated by Stable Group and Individual Factors, Not Cognitive or Daily Variation. Neuron, 98(2), 439-452.e5. hdps://doi.org/10.1016/j.neuron.2018.03.035

      Hahn, T., Fisch, L., Ernsting, J., Winter, N. R., Leenings, R., Sarink, K., Emden, D., Kircher, T., Berger, K., & Dannlowski, U. (2021). From ‘loose fi{ng’ to high-performance, uncertainty-aware brain-age modelling. Brain, 144(3), e31–e31. hdps://doi.org/10.1093/brain/awaa454

      Harms, M. P., Somerville, L. H., Ances, B. M., Andersson, J., Barch, D. M., Bastiani, M., Bookheimer, S. Y., Brown, T. B., Buckner, R. L., Burgess, G. C., Coalson, T. S., Chappell, M. A., Dapredo, M., Douaud, G., Fischl, B., Glasser, M. F., Greve, D. N., Hodge, C., Jamison, K. W., … Yacoub, E. (2018). Extending the Human Connectome Project across ages: Imaging protocols for the Lifespan Development and Aging projects. NeuroImage, 183, 972–984. hdps://doi.org/10.1016/j.neuroimage.2018.09.060

      Insel, T., Cuthbert, B., Garvey, M., Heinssen, R., Pine, D. S., Quinn, K., Sanislow, C., & Wang, P. (2010). Research Domain Criteria (RDoC): Toward a New Classification Framework for Research on Mental Disorders. American Journal of Psychiatry, 167(7), 748–751. hdps://doi.org/10.1176/appi.ajp.2010.09091379

      Jirsaraie, R. J., Gorelik, A. J., Gatavins, M. M., Engemann, D. A., Bogdan, R., Barch, D. M., & Sotiras, A. (2023). A systematic review of multimodal brain age studies: Uncovering a divergence between model accuracy and utility. PaUerns, 4(4), 100712. hdps://doi.org/10.1016/j.pader.2023.100712

      Jirsaraie, R. J., Kaufmann, T., Bashyam, V., Erus, G., Luby, J. L., Westlye, L. T., Davatzikos, C., Barch, D. M., & Sotiras, A. (2023). Benchmarking the generalizability of brain age models: Challenges posed by scanner variance and prediction bias. Human Brain Mapping, 44(3), 1118–1128. hdps://doi.org/10.1002/hbm.26144

      Marquand, A. F., Rezek, I., Buitelaar, J., & Beckmann, C. F. (2016). Understanding Heterogeneity in Clinical Cohorts Using Normative Models: Beyond Case-Control Studies. Biological Psychiatry, 80(7), 552–561. hdps://doi.org/10.1016/j.biopsych.2015.12.023

      Molnar, C. (2019). Interpretable Machine Learning. A Guide for Making Black Box Models Explainable. hdps://christophm.github.io/interpretable-ml-book/

      Nimon, K., Lewis, M., Kane, R., & Haynes, R. M. (2008). An R package to compute commonality coefficients in the multiple regression case: An introduction to the package and a practical example. Behavior Research Methods, 40(2), 457–466. hdps://doi.org/10.3758/BRM.40.2.457

      Pat, N., Wang, Y., Anney, R., Riglin, L., Thapar, A., & Stringaris, A. (2022). Longitudinally stable, brain-based predictive models mediate the relationships between childhood cognition and socio-demographic, psychological and genetic factors. Human Brain Mapping, hbm.26027. hdps://doi.org/10.1002/hbm.26027

      Pat, N., Wang, Y., Bartonicek, A., Candia, J., & Stringaris, A. (2022). Explainable machine learning approach to predict and explain the relationship between task-based fMRI and individual differences in cognition. Cerebral Cortex, bhac235. hdps://doi.org/10.1093/cercor/bhac235

      Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Predenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12(85), 2825–2830.

      Poldrack, R. A., Huckins, G., & Varoquaux, G. (2020). Establishment of Best Practices for Evidence for Prediction: A Review. JAMA Psychiatry, 77(5), 534–540. hdps://doi.org/10.1001/jamapsychiatry.2019.3671

      Rasero, J., Sentis, A. I., Yeh, F.-C., & Verstynen, T. (2021). Integrating across neuroimaging modalities boosts prediction accuracy of cognitive ability. PLOS Computational Biology, 17(3), e1008347. hdps://doi.org/10.1371/journal.pcbi.1008347

      Robinson, E. C., Garcia, K., Glasser, M. F., Chen, Z., Coalson, T. S., Makropoulos, A., Bozek, J., Wright, R., Schuh, A., Webster, M., Huder, J., Price, A., Cordero Grande, L., Hughes, E., Tusor, N., Bayly, P. V., Van Essen, D. C., Smith, S. M., Edwards, A. D., … Rueckert, D. (2018). Multimodal surface matching with higher-order smoothness constraints. NeuroImage, 167, 453–465. hdps://doi.org/10.1016/j.neuroimage.2017.10.037

      Rokicki, J., Wolfers, T., Nordhøy, W., Tesli, N., Quintana, D. S., Alnæs, D., Richard, G., de Lange, A.-M. G., Lund, M. J., Norbom, L., Agartz, I., Melle, I., Nærland, T., Selbæk, G., Persson, K., Nordvik, J. E., Schwarz, E., Andreassen, O. A., Kaufmann, T., & Westlye, L. T. (2021). Multimodal imaging improves brain age prediction and reveals distinct abnormalities in patients with psychiatric and neurological disorders. Human Brain Mapping, 42(6), 1714–1726. hdps://doi.org/10.1002/hbm.25323

      Somerville, L. H., Bookheimer, S. Y., Buckner, R. L., Burgess, G. C., Curtiss, S. W., Dapredo, M., Elam, J. S., Gaffrey, M. S., Harms, M. P., Hodge, C., Kandala, S., Kastman, E. K., Nichols, T. E., Schlaggar, B. L., Smith, S. M., Thomas, K. M., Yacoub, E., Van Essen, D. C., & Barch, D. M. (2018). The Lifespan Human Connectome Project in Development: A large-scale study of brain connectivity development in 5–21 year olds. NeuroImage, 183, 456–468. hdps://doi.org/10.1016/j.neuroimage.2018.08.050

      Sperling, R. A., Bates, J. F., Cocchiarella, A. J., Schacter, D. L., Rosen, B. R., & Albert, M. S. (2001). Encoding novel face-name associations: A functional MRI study. Human Brain Mapping, 14(3), 129–139. hdps://doi.org/10.1002/hbm.1047

      Sripada, C., Angstadt, M., Rutherford, S., Kessler, D., Kim, Y., Yee, M., & Levina, E. (2019). Basic Units of Inter-Individual Variation in Resting State Connectomes. Scientific Reports, 9(1), Article 1. hdps://doi.org/10.1038/s41598-018-38406-5

      Sripada, C., Angstadt, M., Rutherford, S., Taxali, A., & Shedden, K. (2020). Toward a “treadmill test” for cognition: Improved prediction of general cognitive ability from the task activated brain. Human Brain Mapping, 41(12), 3186–3197. hdps://doi.org/10.1002/hbm.25007

      Tetereva, A., Li, J., Deng, J. D., Stringaris, A., & Pat, N. (2022). Capturing brain-cognition relationship: Integrating task-based fMRI across tasks markedly boosts prediction and test-retest reliability. NeuroImage, 263, 119588. hdps://doi.org/10.1016/j.neuroimage.2022.119588

      Vieira, B. H., Pamplona, G. S. P., Fachinello, K., Silva, A. K., Foss, M. P., & Salmon, C. E. G. (2022). On the prediction of human intelligence from neuroimaging: A systematic review of methods and reporting. Intelligence, 93, 101654. hdps://doi.org/10.1016/j.intell.2022.101654

      Vos De Wael, R., Benkarim, O., Paquola, C., Lariviere, S., Royer, J., Tavakol, S., Xu, T., Hong, S.J., Langs, G., Valk, S., Misic, B., Milham, M., Margulies, D., Smallwood, J., & Bernhardt, B. C. (2020). BrainSpace: A toolbox for the analysis of macroscale gradients in neuroimaging and connectomics datasets. Communications Biology, 3(1), 103. hdps://doi.org/10.1038/s42003-020-0794-7

      Woolrich, M. W., Ripley, B. D., Brady, M., & Smith, S. M. (2001). Temporal Autocorrelation in Univariate Linear Modeling of FMRI Data. NeuroImage, 14(6), 1370–1386. hdps://doi.org/10.1006/nimg.2001.0931

      Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320. hdps://doi.org/10.1111/j.1467-9868.2005.00503.x

    1. Author Response

      The following is the authors’ response to the previous reviews.

      Reviewer #2 (Public Review):

      Summary:

      In the revised manuscript, the authors aim to investigate brain-wide activation patterns following administration of the anesthetics ketamine and isoflurane, and conduct comparative analysis of these patterns to understand shared and distinct mechanisms of these two anesthetics. To this end, they perform Fos immunohistochemistry in perfused brain sections to label active nuclei, use a custom pipeline to register images to the ABA framework and quantify Fos+ nuclei, and perform multiple complementary analyses to compare activation patterns across groups.

      In the latest revision, the authors have made some changes in response to our previous comments on how to fix the analyses. However, the revised analyses were not changed correctly and remain flawed in several fundamental ways.

      Critical problems:

      (1) Before one can perform higher level analyses such as hiearchal cluster or network hub (or PC) analysis, it is fundamental to validate that you have significant differences of the raw Fos expression values in the first place. First of all, this means showing figures with the raw data (Fos expression levels) in some form in Figures 2 and 3 before showing the higher level analyses in Figures 4 and 5; this is currently switched around. Second and most importantly, when you have a large number of brain areas with large differences in mean values and variance, you need to account for this in a meaningful way. Changing to log values is a step in the right direction for mean values but does not account well for differences in variance. Indeed, considering the large variances in brain areas with high mean values and variance, it is a little difficult to believe that all brain regions, especially brain areas with low mean values, passed corrections for multiple comparisons test. We suggested Z-scores relative to control values for each brain region; this would have accounted for wide differences in mean values and variance, but this was not done. Overall, validation of anesthesia-induced differences in Fos expression levels is not yet shown.

      (a) Reordering the figures.

      Thank you for your suggestion. We have added Figure 2 (for 201 brain regions) and Figure 2—figure supplement 1 (for 53 brain regions) to demonstrate the statistical differences in raw Fos expression between KET and ISO compared to their respective control groups. These figures specifically present the raw c-Fos expression levels for both KET and ISO in the same brain areas, providing a fundamental basis for the subsequent analyses. Additionally, we have moved the original Figures 4 and 5 to Figures 3 and 4.

      (b) Z-score transformation and validation of anesthesia-induced differences in Fos expression.

      Thank you for your suggestion. Before multiple comparisons, we transformed the data into log c-Fos density and then performed Z-scores relative to control values for each brain region. Indeed, through Z-score transformation, we have identified a larger number of significantly activated brain regions in Figure 2. The number of brain regions showing significant activation increased by 100 for KET and by 39 for ISO. We have accordingly updated the results section to include these findings in Line 80-181. Besides, we have added the following content in the Statistical Analysis section in Line 489: "…In Figure 2 and Figure 2–figure supplement 1, c-Fos densities in both experimental and control groups were log-transformed. Z-scores were calculated for each brain region by normalizing these log-transformed values against the mean and standard deviation of its respective control group. This involved subtracting the control mean from the experimental value and dividing the result by the control standard deviation. For statistical analysis, Z-scores were compared to a null distribution with a zero mean, and adjustments were made for multiple comparisons using the Benjamini–Hochberg method with a 5% false discovery rate (Q)..…".

      Author response image 1.

      KET and ISO induced c-Fos expression relative to their respective control group across 201 distinct brain regions. Z-scores represent the normalized c-Fos expression in the KET and ISO groups, calculated against the mean and standard deviation from their respective control groups. Statistical analysis involved the comparison of Z-scores to a null distribution with a zero mean and adjustment for multiple comparisons using the Benjamini–Hochberg method at a 5% false discovery rate (p < 0.05, p < 0.01, **p < 0.001). n = 6, 6, 8, 6 for the home cage, ISO, saline, and KET, respectively. Missing values resulted from zero standard deviations in control groups. Brain regions are categorized into major anatomical subdivisions, as shown on the left side of the graph.

      Author response image 2.

      KET and ISO induced c-Fos expression relative to their respective control group across 53 distinct brain regions. Z-scores for c-Fos expression in the KET and ISO groups were normalized to the mean and standard deviation of their respective control groups. Statistical analysis involved the comparison of Z-scores to a null distribution with a zero mean and adjustment for multiple comparisons using the Benjamini–Hochberg method at a 5\% false discovery rate (p < 0.05, p < 0.01, **p < 0.001). Brain regions are organized into major anatomical subdivisions, as indicated on the left side of the graph.

      (2) Let's assume for a moment that the raw Fos expression analyses indicate significant differences. They used hierarchal cluster analyses as a rationale for examining 53 brain areas in all subsequent analyses of Fos expression following isoflurane versus home cage or ketamine versus saline. Instead, the authors changed to 201 brain areas with no validated rationale other than effectively saying 'we wanted to look at more brain areas'. And then later, when they examined raw Fos expression values in Figures 4 and 5, they assess 43 brain areas for ketamine and 20 brain areas for isoflurane, without any rationale for why choosing these numbers of brain areas. This is a particularly big problem when they are trying to compare effects of isoflurane versus ketamine on Fos expression in these brain areas - they did not compare the same brain areas.

      (a) Changing to 201 brain areas with validated rationale.

      Thank you for your question. We have revised the original text from “To enhance our analysis of c-Fos expression patterns induced by KET and ISO, we expanded our study to 201 subregions.” to Line 100: "…To enable a more detailed examination and facilitate clearer differentiation and comparison of the effects caused by KET and ISO, we subdivided the 53 brain regions into 201 distinct areas. This approach, guided by the standard mouse atlas available at http://atlas.brain-map.org/atlas, allowed for an in-depth analysis of the responses in various brain regions…". For hierarchal cluster analyses from 53 to 201 brain regions, Line 215: "…To achieve a more granular analysis and better discern the responses between KET and ISO, we expanded our study from the initial 53 brain regions to 201 distinct subregions…"

      (b) Compare the same brain areas for KET and ISO and the rationale for why choosing these numbers of brain areas in Figures 3 and 4.

      We apologize for the confusion and lack of clarity regarding the selection of brain regions for analysis. In Figure 2 and Figure 2—figure supplement 1, we display the c-Fos expression in the same brain regions affected by KET and ISO. In Figures 3 and 4, we applied a uniform standard to specifically report the brain areas most prominently activated by KET and ISO, respectively. As specified in Line 104: "…Compared to the saline group, KET activated 141 out of a total of 201 brain regions (Figure 2). To further identify the brain regions that are most significantly affected by KET, we calculated Cohen's d for each region to quantify the magnitude of activation and subsequently focused on those regions that had a corrected p-value below 0.05 and effect size in the top 40% (Figure 3, Figure 3—figure supplement 1)…" and Line 142: "…Using the same criteria applied to KET, which involved selecting regions with Cohen's d values in the top 40% of significantly activated areas from Figure 2, we identified 32 key brain regions impacted by ISO (Figure 4, Figure 4—figure supplement 1).…".

      Moreover, we illustrate the co-activated brain regions by KET and ISO in Figure 4C. As detailed in Lines 167-180:"…The co-activation of multiple brain regions by KET and ISO indicates that they have overlapping effects on brain functions. Examples of these effects include impacts on sensory processing, as evidenced by the activation of the PIR, ENT 1, and OT2, pointing to changes in sensory perception typical of anesthetics. Memory and cognitive functions are influenced, as indicated by the activation of the subiculum (SUB) 3, dentate gyrus (DG) 4, and RE 5. The reward and motivational systems are engaged, involving the ACB and ventral tegmental area (VTA), signaling the modulation of reward pathways 6. Autonomic and homeostatic control are also affected, as shown by areas like the lateral hypothalamic area (LHA) 7 and medial preoptic area (MPO) 8, emphasizing effects on functions such as feeding and thermoregulation. Stress and arousal responses are impacted through the activation of the paraventricular hypothalamic nucleus (PVH) 10,11 and LC 12. This broad activation pattern highlights the overlap in drug effects and the complexity of brain networks in anesthesia…". Below are the revised Figures 3 and 4.

      (1) Chapuis, J. et al. Lateral entorhinal modulation of piriform cortical activity and fine odor discrimination. J. Neurosci. 33, 13449-13459 (2013). https://doi.org:10.1523/jneurosci.1387-13.2013

      (2) Giessel, A. J. & Datta, S. R. Olfactory maps, circuits and computations. Curr. Opin. Neurobiol. 24, 120-132 (2014). https://doi.org:10.1016/j.conb.2013.09.010

      (3) Roy, D. S. et al. Distinct Neural Circuits for the Formation and Retrieval of Episodic Memories. Cell 170, 1000-1012.e1019 (2017). https://doi.org:10.1016/j.cell.2017.07.013

      (4) Sun, X. et al. Functionally Distinct Neuronal Ensembles within the Memory Engram. Cell 181, 410-423.e417 (2020). https://doi.org:10.1016/j.cell.2020.02.055

      (5) Huang, X. et al. A Visual Circuit Related to the Nucleus Reuniens for the Spatial-Memory-Promoting Effects of Light Treatment. Neuron (2021).

      (6) Al-Hasani, R. et al. Ventral tegmental area GABAergic inhibition of cholinergic interneurons in the ventral nucleus accumbens shell promotes reward reinforcement. Nat. Neurosci. 24, 1414-1428 (2021). https://doi.org:10.1038/s41593-021-00898-2

      (7) Mickelsen, L. E. et al. Single-cell transcriptomic analysis of the lateral hypothalamic area reveals molecularly distinct populations of inhibitory and excitatory neurons. Nat. Neurosci. 22, 642-656 (2019). https://doi.org:10.1038/s41593-019-0349-8

      (8) McGinty, D. & Szymusiak, R. Keeping cool: a hypothesis about the mechanisms and functions of slow-wave sleep. Trends Neurosci. 13, 480-487 (1990). https://doi.org:10.1016/0166-2236(90)90081-k

      (9) Mullican, S. E. et al. GFRAL is the receptor for GDF15 and the ligand promotes weight loss in mice and nonhuman primates. Nat. Med. 23, 1150-1157 (2017). https://doi.org:10.1038/nm.4392

      (10) Rasiah, N. P., Loewen, S. P. & Bains, J. S. Windows into stress: a glimpse at emerging roles for CRH(PVN) neurons. Physiol. Rev. 103, 1667-1691 (2023). https://doi.org:10.1152/physrev.00056.2021

      (11) Islam, M. T. et al. Vasopressin neurons in the paraventricular hypothalamus promote wakefulness via lateral hypothalamic orexin neurons. Curr. Biol. 32, 3871-3885.e3874 (2022). https://doi.org:10.1016/j.cub.2022.07.020

      (12) Ross, J. A. & Van Bockstaele, E. J. The Locus Coeruleus- Norepinephrine System in Stress and Arousal: Unraveling Historical, Current, and Future Perspectives. Front Psychiatry 11, 601519 (2020). https://doi.org:10.3389/fpsyt.2020.601519

      Author response image 3.

      Brain regions exhibiting significant activation by KET. (A) Fifty-five brain regions exhibited significant KET activation. These were chosen from the 201 regions analyzed in Figure 2, focusing on the top 40\% ranked by effect size among those with corrected p values less than 0.05. Data are presented as mean ± SEM, with p-values adjusted for multiple comparisons (p < 0.05, p < 0.01, **p < 0.001). (B) Representative immunohistochemical staining of brain regions identified in Figure 3A, with control group staining available in Figure 3—figure supplement 1. Scale bar: 200 µm.

      Author response image 4.

      Brain regions exhibiting significant activation by ISO. (A) Brain regions significantly activated by ISO were initially identified using a corrected p-value below 0.05. From these, the top 40% in effect size (Cohen’s d) were further selected, resulting in 32 key areas. p-values are adjusted for multiple comparisons (p < 0.01, *p < 0.001). (B) Representative immunohistochemical staining of brain regions identified in Figure 4A. Control group staining is available in Figure 4—figure supplement 1. Scale bar: 200 µm. Scale bar: 200 µm. (C) A Venn diagram displays 43 brain regions co-activated by KET and ISO, identified by the adjusted p-values (p < 0.05) for both KET and ISO. CTX: cerebral cortex; CNU: cerebral nuclei; TH: thalamus; HY: hypothalamus; MB: midbrain; HB: hindbrain.

      Less critical comments:

      (3) The explanation of hierarchical level's in lines 90-95 did not make sense.

      We have revised the section that initially stated in lines 90-95, "…Based on the standard mouse atlas available at http://atlas.brain-map.org/, the mouse brain was segmented into nine hierarchical levels, totaling 984 regions. The primary level consists of grey matter, the secondary of the cerebrum, brainstem, and cerebellum, and the tertiary includes regions like the cerebral cortex and cerebellar nuclei, among others, with some regions extending to the 8th and 9th levels. The fifth level comprises 53 subregions, with detailed expression levels and their respective abbreviations presented in Supplementary Figure 2…". Our revised description, now in line 91: "…Building upon the framework established in previous literature, our study categorizes the mouse brain into 53 distinct subregions1…"

      (1) Do JP, Xu M, Lee SH, Chang WC, Zhang S, Chung S, Yung TJ, Fan JL, Miyamichi K, Luo L et al: Cell type-specific long-range connections of basal forebrain circuit. Elife 2016, 5.

      (4) I am still perplexed by why the authors consider the prelimbic and infralimbic cortex 'neuroendocrine' brain areas in the abstract. In contrast, the prelimbic and infralimbic were described better in the introduction as "associated information processing" areas.

      Thank you for bringing this to our attention. We agree that classifying the prelimbic and infralimbic cortex as 'neuroendocrine' in the abstract was incorrect, which was an oversight on our part. In the revised version, as detailed in line 167, we observed an increased number of brain regions showing overlapping activation by both KET and ISO, which is depicted in Figure 4C. This extensive co-activation across various regions makes it challenging to narrowly define the functional classification of each area. Consequently, we have revised the abstract, updating this in line 21: "…KET and ISO both activate brain areas involved in sensory processing, memory and cognition, reward and motivation, as well as autonomic and homeostatic control, highlighting their shared effects on various neural pathways.…".

      (5) It looks like overall Fos levels in the control group Home (ISO) are a magnitude (~10-fold) lower than those in the control group Saline (KET) across all regions shown. This large difference seems unlikely to be due to a biologically driven effect and seems more likely to be due to a technical issue, such as differences in staining or imaging between experiments. The authors discuss this issue but did not answer whether the Homecage-ISO experiment or at least the Fos labeling and imaging performed at the same time as for the Saline-Ketamine experiment?

      Thank you for highlighting this important point. The c-Fos labeling and imaging for the Home (ISO) and Saline (KET) groups were carried out in separate sessions due to the extensive workload involved in these processes. This study processed a total of 26 brain samples. Sectioning the entire brain of each mouse required approximately 3 hours, yielding 5 slides, with each slide containing 12 to 16 brain sections. We were able to stain and image up to 20 slides simultaneously, typically comprising 2 experimental groups and 2 corresponding control groups. Imaging these 20 slides at 10x magnification took roughly 7 hours, while additional time was required for confocal imaging of specific areas of interest at 20x magnification. Given the complexity of these procedures, to ensure consistency across all experiments, they were conducted under uniform conditions. This included the use of consistent primary and secondary antibody concentrations, incubation times, and imaging parameters such as fixed light intensity and exposure time. Furthermore, in the saline and KET groups, intraperitoneal injections might have evoked pain and stress responses in mice despite four days of pre-experiment acclimation, which could have contributed to the increased c-Fos expression observed. This aspect, along with the fact that procedures were conducted in separate sessions, might have introduced some variations. Thus, we have included a note in our discussion section in Line 353: "…Despite four days of acclimation, including handling and injections, intraperitoneal injections in the saline and KET groups might still elicit pain and stress responses in mice. This point is corroborated by the subtle yet measurable variations in brain states between the home cage and saline groups, characterized by changes in normalized EEG delta/theta power (home cage: 0.05±0.09; saline: -0.03±0.11) and EMG power (home cage: -0.37±0.34; saline: 0.04±0.13), as shown in Figure 1–figure supplement 1. These changes suggest a relative increase in brain activity in the saline group compared to the home cage group, potentially contributing to the higher c-Fos expression. Additionally, despite the use of consistent parameters for c-Fos labeling and imaging across all experiments, the substantial differences observed between the saline and home cage groups might be partly attributed to the fact that the operations were conducted in separate sessions.…"

      Reviewer #3 (Public Review):

      The present study presents a comprehensive exploration of the distinct impacts of Isoflurane and Ketamine on c-Fos expression throughout the brain. To understand the varying responses across individual brain regions to each anesthetic, the researchers employ principal component analysis (PCA) and c-Fos-based functional network analysis. The methodology employed in this research is both methodical and expansive. Notably, the utilization of a custom software package to align and analyze brain images for c-Fos positive cells stands out as an impressive addition to their approach. This innovative technique enables effective quantification of neural activity and enhances our understanding of how anesthetic drugs influence brain networks as a whole.

      The primary novelty of this paper lies in the comparative analysis of two anesthetics, Ketamine and Isoflurane, and their respective impacts on brain-wide c-Fos expression. The study reveals the distinct pathways through which these anesthetics induce loss of consciousness. Ketamine primarily influences the cerebral cortex, while Isoflurane targets subcortical brain regions. This finding highlights the differing mechanisms of action employed by these two anesthetics-a top-down approach for Ketamine and a bottom-up mechanism for Isoflurane. Furthermore, this study uncovers commonly activated brain regions under both anesthetics, advancing our knowledge about the mechanisms underlying general anesthesia.

      We are thankful for your positive and insightful comments on our study. Your recognition of the study's methodology and its significance in advancing our understanding of anesthetic mechanisms is greatly valued. By comprehensively mapping c-Fos expression across a wide range of brain regions, our study reveals the distinct and overlapping impacts of these anesthetics on various brain functions, providing a valuable foundation for future research into the mechanisms of general anesthesia, potentially guiding the development of more targeted anesthetic agents and therapeutic strategies. Thus, we are confident that our work will captivate the interest of our readers.

    2. Author Response

      The following is the authors’ response to the previous reviews.

      We appreciate the reviewers for their insightful feedback, which has substantially improved our manuscript. Following the suggestions of the reviewers, we have undertaken the following major revisions:

      a. Concerning data transformation, we have adjusted the methodology in Figures 2 and 3. Instead of normalizing c-Fos density to the whole brain c-Fos density as initially described, we now normalize to the c-Fos density of the corresponding brain region in the control group. b. We have substituted the PCA approach with hierarchical clustering in Figures 2 and 3.

      c. In the discussion section, we added a subsection on study limitations, focusing on the variations in drug administration routes and anesthesia depth.

      Enclosed are our detailed responses to each of the reviewer's comments.

      Reviewer #1:

      1a. The addition of the EEG/EMG is useful, however, this information is not discussed. For instance, there are differences in EEG/EMG between the two groups (only Ket significantly increased delta/theta power, and only ISO decreased EMG power). These results should be discussed as well as the limitation of not having physiological measures of anesthesia to control for the anesthesia depth.

      1b. The possibility that the differences in fos observed may be due to the doses used should be discussed.

      1c. The possibility that the differences in fos observed may be due kinetic of anesthetic used should be discussed.

      Thank you for your suggestions. We have now discussed EEG/EMG result, limitation of not having physiological measures of anesthesia to control for the anesthesia depth, The possibility that the differences in fos observed may be due to the doses, The possibility that the differences in Fos observed may be due kinetic of anesthetic in the revised manuscript (Lines 308-331, also shown below).

      Lines 308-331: "...Our findings indicate that c-Fos expression in the KET group is significantly elevated compared to the ISO group, and the saline group exhibits notably higher c-Fos expression than the home cage group, as seen in Supplementary Figures 2 and 3. Intraperitoneal saline injections in the saline group, despite pre-experiment acclimation with handling and injections for four days, may still evoke pain and stress responses in mice. Subtle yet measurable variations in brain states between the home cage and saline groups were observed, characterized by changes in normalized EEG delta/theta power (home cage: 0.05±0.09; saline: -0.03±0.11) and EMG power (home cage: -0.37±0.34; saline: 0.04±0.13), as shown in Supplementary Figure 1. These changes suggest a relative increase in overall brain activity in the saline group compared to the home cage group, potentially contributing to the higher c-Fos expression. Although the difference in EEG power between the ISO group and the home cage control was not significant, the increase in EEG power observed in the ISO group was similar to that of KET (0.47 ± 0.07 vs 0.59 ± 0.10), suggesting that both agents may induce loss of consciousness in mice. Regarding EMG power, ISO showed a significant decrease in EMG power compared to its control group. In contrast, the KET group showed a lesser reduction in EMG power (ISO: -1.815± 0.10; KET: -0.96 ± 0.21), which may partly explain the higher overall c-Fos expression levels in the KET group. This is consistent with previous studies where ketamine doses up to 150 mg/kg increase delta power while eliciting a wakefulness-like pattern of c-Fos expression across the brain [1]. Furthermore, the observed differences in c-Fos expression may arise in part from the dosages, routes of administration, and their distinct pharmacokinetic profiles. This variation is compounded by the lack of detailed physiological monitoring, such as blood pressure, heart rate, and respiration, affecting our ability to precisely assess anesthesia depth. Future studies incorporating comprehensive physiological monitoring and controlled dosing regimens are essential to further elucidate these relationships and refine our understanding of the effects of anesthetics on brain activity"

      1. Lu J, Nelson LE, Franks N, Maze M, Chamberlin NL, Saper CB: Role of endogenous sleep-wake and analgesic systems in anesthesia. J Comp Neurol 2008, 508(4):648-662.

      2b. I am confused because Fig 2C seems to show significant decrease in %fos in the hypothalamus, midbrain and cerebellum after KET, while the author responded that " in our analysis, we did not detect regions with significant downregulation when comparing anesthetized mice with controls." Moreover the new figure in the rebuttal in response to reviewer 2 suggests that Ket increases Fos in almost every single region (green vs blue) which is not the conclusion of the paper.

      Your concern regarding the apparent discrepancy is well-founded. The inconsistency arose due to an inappropriate data transformation, which affected the interpretation. We have now rectified this by adjusting the data transformation in Figures 2 and 3. Specifically, we have recalculated the log relative c-Fos density values relative to the control group for each brain region. This revision has resolved the issue, confirming that our analysis did not detect any regions with significant downregulation in the anesthetized mice compared to controls. We have also updated the results, discussion, and methods sections of Figures 2 and 3 to accurately reflect these changes and ensure consistency with our findings.

      Author response image 1.

      Figure 2. Whole-brain distributions of c-Fos+ cells induced by ISO and KET. (A) Hierarchical clustering was performed on the log relative c-Fos density data for ISO and KET using the complete linkage method based on the Euclidean distance matrix, with clusters identified by a dendrogram cut-off ratio of 0.5. Numerical labels correspond to distinct clusters within the dendrogram. (B) Silhouette values plotted against the ratio of tree height for ISO and KET, indicating relatively higher Silhouette values at 0.5 (dashed line), which is associated with optimal clustering. (C) The number of clusters identified in each treatment condition at different ratios of the dendrogram tree height, with a cut-off level of 0.5 corresponding to 4 clusters for both ISO and KET (indicated by the dashed line). (D) The bar graph depicts Z scores for clusters in ISO and KET conditions, represented with mean values and standard errors. One-way ANOVA with Tukey's post hoc multiple comparisons. ns: no significance; ***P < 0.001. (E) Z-scored log relative density of c-Fos expression in the clustered brain regions. The order and abbreviations of the brain regions and the numerical labels correspond to those in Figure 2A. The red box denotes the cluster with the highest mean Z score in comparison to other clusters. CTX: cortex; TH: thalamus; HY: hypothalamus; MB: midbrain; HB: hindbrain.

      Author response image 2.

      Figure 3. Similarities and differences in ISO and KET activated c-Fos brain areas. (A) Hierarchical clustering was performed on the log-transformed relative c-Fos density data for ISO and KET using the complete linkage method based on the Euclidean distance matrix, with clusters identified by a dendrogram cut-off ratio of 0.5. (B) Silhouette values are plotted against the ratio of tree height from the hierarchical clustered dendrogram in Figure 3A. (C) The relationship between the number of clusters and the tree height ratio of the dendrogram for ISO and KET, with a cut-off ratio of 0.5 resulting in 3 clusters for ISO and 5 for KET (indicated by the dashed line). (D) The bar graph depicts Z scores for clusters in ISO and KET conditions, represented with mean values and standard errors. One-way ANOVA with Tukey's post hoc multiple comparisons. ns: no significance; ***P < 0.001. (E) Z-scored log relative density of c-Fos expression within the identified brain region clusters. The arrangement, abbreviations of the brain regions, and the numerical labels are in accordance with Figure 3A. The red boxes highlight brain regions that rank within the top 10 percent of Z score values. The white boxes denote brain regions with an Z score less than -2.

      1. There are still critical misinterpretations of the PCA analysis. For instance, it is mentioned that " KET is associated with the activation of cortical regions (as evidenced by positive PC1 coefficients in MOB, AON, MO, ACA, and ORB) and the inhibition of subcortical areas (indicated by negative coefficients) " as well as " KET displays cortical activation and subcortical inhibition, whereas ISO shows a contrasting preference, activating the cerebral nucleus (CNU) and the hypothalamus while inhibiting cortical areas. To reduce inter-individual variability." These interpretations are in complete contradiction with the answer 2b above that there was no region that had decreased Fos by either anesthetic.

      Thank you for bringing this to our attention. In response to your concerns, we have made significant revisions to our data analysis. We have updated our input data to incorporate log-transformed relative c-Fos density values, normalized against the control group for each brain region, as illustrated in Figures 2 and 3. Instead of PCA, we have applied this updated data to hierarchical clustering analysis. The results of these analyses are consistent with our original observation that neither anesthetic led to a decrease in Fos expression in any region.

      1. I still do not understand the rationale for the use of that metric. The use of a % of total Fos makes the data for each region dependent on the data of the other regions which wrongly leads to the conclusion that some regions are inhibited while they are not when looking at the raw data. Moreover, the interdependence of the variable (relative density) may affect the covariance structure which the PCA relies upon. Why not using the PCA on the logarithm of the raw data or on a relative density compared to the control group on a region-per-region basis instead of the whole brain?

      Thank you for your insightful suggestion. Following your advice, we have revised our approach and now utilize the logarithm of the relative density compared to the control group on a region-by-region basis. We attempted PCA analyses using the logarithm of the raw data, the logarithm of the Z-score, and the logarithm of the relative density compared to control, but none yielded distinct clusters.

      Author response image 3.

      As a result, we employed hierarchical cluster analysis. We then examined the Z-scores of the log-transformed relative c-Fos densities (Figures 2E and 3E) to assess expression levels across clusters. Our analysis revealed that neither ISO nor KET treatments led to a significant suppression of c-Fos expression in the 53 brain regions examined. In the ISO group alone, there were 10 regions that demonstrated relative suppression (Z-score < -2, indicated by white boxes) as shown in Figure 3.

      Fig. 2B: it's unclear to me why the regions are connected by a line. Such representation is normally used for time series/within-subject series. What is the rationale for the order of the regions and the use of the line? The line connecting randomly organized regions is meaningless and confusing.

      Thank you for your suggestion. We have discontinued the use of PCA calculations and have removed this figure.

      Fig 6A. The correlation matrices are difficult to interpret because of the low resolution and arbitrary order of brain regions. I recommend using hierarchical clustering and/or a combination of hierarchical clustering and anatomical organization (e.g. PMID: 31937658). While it is difficult to add the name of the regions on the graph I recommend providing supplementary figures with large high-resolution figures with the name of each brain region so the reader can actually identify the correlation between specific brain regions and the whole brain, Rationale for Metric Choice: Note that I do not dispute the choice of the log which is appropriate, it is the choice of using the relative density that I am questioning.

      Thank you for your constructive feedback. In line with your suggestion, we have implemented hierarchical clustering combined with anatomical organization as per the referenced literature. Additionally, we have updated the vector diagrams in Figure 6A to present them with greater clarity.

      Furthermore, we have revised our network modular division method based on cited literature recommendations. We used hierarchical clustering with correlation coefficients to segment the network into modules, illustrated in Figure 6—figure supplement 1. Due to the singular module structure of the KET network and the sparsity of intermodular connections in the home cage and saline networks, the assessment of network hub nodes did not employ within-module degree Z-score and participation coefficients, as these measures predominantly underscore the importance of connections within and between modules. Instead, we used degree, betweenness centrality, and eigenvector centrality to detect the hub nodes, as detailed in Figure 6—figure supplement 2. With this new approach, the hub node for the KET condition changed from SS to TeA. Corresponding updates have been made to the results section for Figure 6, as well as to the related discussions and the abstract of our paper.

      Author response image 4.

      Figure 6. Generation of anesthetics-induced networks and identification of hub regions. (A) Heatmaps display the correlations of log c-Fos densities within brain regions (CTX, CNU, TH, HY, MB, and HB) for various states (home cage, ISO, saline, KET). Correlations are color-coded according to Pearson's coefficients. The brain regions within each anatomical category are organized by hierarchical clustering of their correlation coefficients. (B) Network diagrams illustrate significant positive correlations (P < 0.05) between regions, with Pearson’s r exceeding 0.82. Edge thickness indicates correlation magnitude, and node size reflects the number of connections (degree). Node color denotes betweenness centrality, with a spectrum ranging from dark blue (lowest) to dark red (highest). The networks are organized into modules consistent with the clustering depicted in Supplementary Figure 8. Figure 6—figure supplement 1

      Author response image 5.

      Figure 6—figure supplement 1. Hierarchical clustering of brain regions under various conditions: home cage, ISO, saline, and KET. (A) Heatmaps show the relative distances among brain regions assessed in naive mice. Modules were identified by sectioning each dendrogram at a 0.7 threshold. (B) Silhouette scores plotted against the dendrogram tree height ratio for each condition, with optimal cluster definition indicated by a dashed line at a 0.7 ratio. (C) The number of clusters formed at different cutoff levels. At a ratio of 0.7, ISO and saline treatments result in three clusters, whereas home cage and KET conditions yield two clusters. (D) The mean Pearson's correlation coefficient (r) was computed from interregional correlations displayed in Figure 6A. Data were analyzed using one-way ANOVA with Tukey’s post hoc test, ***P < 0.001.

      Author response image 6.

      Figure 6—figure supplement 2. Hub region characterization across different conditions: home cage (A), ISO (B), saline (C), and KET (D) treatments. Brain regions are sorted by degree, betweenness centrality, and eigenvector centrality, with each metric presented in separate bar graphs. Bars to the left of the dashed line indicate the top 20% of regions by rank, highlighting the most central nodes within the network. Red bars signify regions that consistently appear within the top rankings for both degree and betweenness centrality across the metrics.

      1. I am still having difficulties understanding Fig. 3.

      Panel A: The lack of identification for the dots in panel A makes it impossible to understand which regions are relevant.

      Panel B: what is the metric that the up/down arrow summarizes? Fos density? Relative density? PC1/2?

      Panel C: it's unclear to me why the regions are connected by a line. Such representation is normally used for time series/within-subject series. What is the rationale for the order of the regions?

      Thank you for your patience and for reiterating your concerns regarding Figure 3.

      a. In Panel A, we have substituted the original content with a display of hierarchical clustering results, which now clearly marks each brain region. This change aids readers in identifying regions with similar expression patterns and facilitates a more intuitive understanding of the data.

      a. Acknowledging that our analysis did not reveal any significantly inhibited brain regions, we have decided to remove the previous version of Panel B from the figure.

      b. We have discontinued the use of PCA calculations and have removed this figure to avoid any confusion it may have caused. Our revised analysis focuses on hierarchical clustering, which are presented in the updated figures.

      Reviewer #2:

      1. Aside from issues with their data transformation (see below), (a) I think they have some interesting Fos counts data in Figures 4B and 5B that indicate shared and distinct activation patterns after KET vs. ISO based anesthesia. These data are far closer to the raw data than PC analyses and need to be described and analyzed in the first figures long before figures with the more abstracted PC analyses. In other words, you need to show the concrete raw data before describing the highly transformed and abstracted PC analyses. (b) This gets to the main point that when selecting brain areas for follow up analyses, these should be chosen based on the concrete Fos counts data, not the highly transformed and abstracted PC analyses.

      Thank you for your suggestions.

      a. We have added the original c-Fos cell density distribution maps for Figures 2, 3, 4, and 5 in Supplementary Figures 2 and 3 (also shown below). To maintain consistency across the document, we have updated both the y-axis label and the corresponding data in Figures 4B and 5B from 'c-Fos cell count' to 'c-Fos density'.

      b. The analyses in Figures 2 and 3 include all brain regions. Figures 4 and 5 present the brain regions with significant differences as shown in Figure 3—figure supplement 1.

      Author response image 7.

      Figure 2—figure supplement 1. The c-Fos density in 53 brain areas for different conditions. (home cage, n = 6; ISO, n = 6 mice; saline, n = 8; KET, n = 6). Each point represents the c-Fos density in a specific brain region, denoted on the y-axis with both abbreviations and full names. Data are shown as mean ± SEM. Brain regions are categorized into 12 brain structures, as indicated on the right side of the graph.

      Author response image 8.

      Figure 3—figure supplement 1. c-Fos density visualization across 201 distinct brain regions under various conditions. The graph depicts the c-Fos density levels for each condition, with data presented as mean and standard error. Brain regions with statistically significant differences are featured in Figures 4 and 5. Brain regions are organized into major anatomical subdivisions, as indicated on the left side of the graph.

      1. Now, the choice of data transformation for Fos counts is the most significant problem. First, the authors show in the response letter that not using this transformation (region density/brain density) leads to no clustering. However, they also showed the region-densities without transformation (which we appreciate) and it looks like overall Fos levels in the control group Home (ISO) are a magnitude (~10-fold) higher than those in the control group Saline (KET) across all regions shown. This large difference seems unlikely to be due to a biologically driven effect and seems more likely to be due to a technical issue, such as differences in staining or imaging between experiments. Was the Homecage-ISO experiment or at least the Fos labeling and imaging performed at the same time as for the Saline-Ketamine experiment? Please state the answer to this question in the Results section one way or the other.

      a. “Home (ISO) are a magnitude (~10-fold) higher than those in the control group saline (KET) across all regions shown.” We believe you might be indicating that compared to the home cage group (gray), the saline group (blue) shows a 10-fold higher expression (Supplementary Figure 2/3). Indeed, we observed that the total number of c-Fos cells in the home cage group is significantly lower than in the saline group. This difference may be due to reduced sleep during the light-on period (ZT 6- ZT 7.5) in the saline mice or the pain and stress response caused by intraperitoneal injection of saline. We have explained this discrepancy in the discussion section.Line 308-317(also see below)

      “…Our findings indicate that c-Fos expression in the KET group is significantly elevated compared to the ISO group, and the saline group exhibits notably higher c-Fos expression than the home cage group, as seen in Supplementary Figures 2 and 3. Intraperitoneal saline injections in the saline group, despite pre-experiment acclimation with handling and injections for four days, may still evoke pain and stress responses in mice. Subtle yet measurable variations in brain states between the home cage and saline groups were observed, characterized by changes in normalized EEG delta/theta power (home cage: 0.05±0.09; saline: -0.03±0.11) and EMG power (home cage: -0.37±0.34; saline: 0.04±0.13), as shown in Figure 1—figure supplement 1. These changes suggest a relative increase in overall brain activity in the saline group compared to the home cage group, potentially contributing to the higher c-Fos expression…”

      b. Drug administration and tissue collection for both Homecage-ISO and Saline-Ketamine groups were consistently scheduled at 13:00 and 14:30, respectively. Four mice were administered drugs and had tissues collected each day, with two from the experimental group and two from the control group, to ensure consistent sampling. The 4% PFA fixation time, sucrose dehydration time, primary and secondary antibody concentrations and incubation times, staining, and imaging parameters and equipment (exposure time for VS120 imaging was fixed at 100ms) were all conducted according to a unified protocol.

      We have included the following statement in the results section: Line 81-83, “Sample collection for all mice was uniformly conducted at 14:30 (ZT7.5), and the c-Fos labeling and imaging were performed using consistent parameters throughout all experiments. ”

      1. Second, they need to deal with this large difference in overall staining or imaging for these two (Home/ISO and Saline/KET) experiments more directly; their current normalization choice does not really account for the large overall differences in mean values and variability in Fos counts (e.g. due to labeling and imaging differences).

      3a. I think one option (not perfect but I think better than the current normalization choice) could be z-scoring each treatment to its respective control. They can analyze these z-scored data first, and then in later figures show PC analyses of these data and assess whether the two treatments separate on PC1/2. And if they don't separate, then they don't separate, and you have to go with these results.

      3b. Alternatively, they need to figure out the overall intensity distributions from the different runs (if that the main reason of markedly different counts) and adjust their thresholds for Fos-positive cell detection based on this. I would expect that the saline and HC groups should have similar levels of activation, so they could use these as the 'control' group to determine a Fos-positive intensity threshold that gets applied to the corresponding 'treatment' group.

      3c. If neither 3a nor 3b is an option then they need to show the outcomes of their analysis when using the untransformed data in the main figures (the untransformed data plots in their responses to reviewer are currently not in the main or supplementary figs) and discuss these as well.

      a. Thank you very much for your valuable suggestion. We conducted PCA analysis on the ISO and KET data after Z-scoring them with their respective control groups and did not find any significant separation.

      Author response image 9.

      As mentioned in our response to reviewer #1, we have reprocessed the raw data. Firstly, we divided the ISO and KET data by their respective control brain regions and then performed a logarithmic transformation to obtain the log relative c-Fos density. The purpose of this is to eliminate the impact of baseline differences and reduce variability. We then performed hierarchical clustering, and finally, we Z-scored the log relative c-Fos density data. The aim is to facilitate comparison of ISO and KET on the same data dimension (Figure 2 and 3).

      b. We appreciate your concerns regarding the detection thresholds for Fos-positive cells. The enclosed images, extracted from supplementary figures for Figures 4 and 5, demonstrate notable differences in c-Fos expression between saline and home cage groups in specific brain regions. These regions exhibit a discernible difference in staining intensity, with the saline group showing enhanced c-Fos expression in the PVH and PVT regions compared to the home cage group. An examination of supplementary figures for Figures 4 and 5 shows that c-Fos expression in the home cage group is consistently lower than in the saline group. This comparative analysis confirms that the discrepancies in c-Fos levels are not due to varying detection thresholds.

      Author response image 10.

      b. We have added the corresponding original data graphs to Supplementary Figures 2 and 3, and discussed the potential reasons for the significant differences between these groups in the discussion section (also shown below).

      Lines 308-317: "...Our findings indicate that c-Fos expression in the KET group is significantly elevated compared to the ISO group, and the saline group exhibits notably higher c-Fos expression than the home cage group, as seen in Supplementary Figures 2 and 3. Intraperitoneal saline injections in the saline group, despite pre-experiment acclimation with handling and injections for four days, may still evoke pain and stress responses in mice. Subtle yet measurable variations in brain states between the home cage and saline groups were observed, characterized by changes in normalized EEG delta/theta power (home cage: 0.05±0.09; saline: -0.03±0.11) and EMG power (home cage: -0.37±0.34; saline: 0.04±0.13), as shown in Figure 3—figure supplement 1. These changes suggest a relative increase in overall brain activity in the saline group compared to the home cage group, potentially contributing to the higher c-Fos expression.…”

    3. Author Response

      The following is the authors’ response to the original reviews.

      We sincerely thank the editor and reviewers for their constructive feedback on our manuscript. Based on their recommendations, we've conducted additional experiments, made revisions to the text and figures, and provide a point-by-point response below.

      Reviewer #1 (Recommendations for the authors):

      1) The lack of behavioral/physiological measures of the depth of anesthesia (ventilation, heart rate, blood pressure, temperature, O2, pain reflexes, etc...) combined with the lack of dose-response and the use of different routes of administration makes the data difficult to interpret. Sure, there is a clear difference in network activation between KET and ISO, but are those effects due to the depth of the anesthesia, the route of administration, and the dose used? The lack of behavioral/physiological measures prevents the identification of brain regions responsible for some of the physiological effects and different effects of anesthetics.

      We greatly appreciate the insightful feedback you have provided.

      In response to the concerns about anesthesia depth:

      a. We recorded EEG and EMG data both before and after drug administration. Supplementary Figure 1 showcases the changes in EEG and EMG power observed 30 minutes post-drug administration, normalized to a 5-minute baseline taken prior to the drug's administration. Notably, no significant differences were detected in the normalized EEG and EMG power between the ISO and KET groups. Given the marked statistical differences observed between the EEG power in the KET and saline groups, and the EMG power in the home cage and ISO groups, we infer that both anesthetics effectively induced a loss of consciousness.

      b. We used standard methods and doses for inducing c-Fos expression with anesthetics, as documented in prior studies (Hua, T, et al., Nat Neurosci, 2020; 23(7): 854-868; Jiang-Xie, L F, et al., Neuron, 2019; 102(5): 1053-1065.e4; Lu, J, et al., J Comp Neurol, 2008; 508(4): 648-62). In future research, it might be more optimal to adopt continuous intraperitoneal or intravenous administration of ketamine.

      c. Within the scope of our study, while disparities in anesthesia duration might potentially influence the direct statistical comparison of ISO and KET, such disparities wouldn't compromise the identification of brain regions activated by KET or ISO when assessed as distinct stimuli (ISO vs. home cage; KET vs. saline) or in relation to their individual functional network hub node results.

      We hope these additions and clarifications adequately address your concerns and enhance the comprehensibility of our data.

      2) Under anesthesia there should be an overall reduction of activity, is that the case? There is no mention of significantly downregulated regions. The authors use multiple transformations of the data to interpret the results (%, PC1 values, logarithm) without much explanation or showing the full raw data in Fig 1. It would be helpful to interpret the data to compare the average fos+ neurons in each region between treatment and control for each drug.

      Absence of Significantly Downregulated Regions Under Anesthesia: There are two primary reasons for this observation:

      a. Our study's sampling time for the home cage, ISO, saline, and KET groups was during Zeitgeber Time (ZT) 6-7.5. During this period, mice in both the home cage and saline groups typically showed reduced spontaneous activity or were in a sleep state. Our Supplementary Figure 1 EEG and EMG data corroborate this, revealing no significant statistical variations in EEG power between the home cage and ISO groups, nor in EMG power between the saline and KET groups.

      b. Our immunohistochemical data showed that the total number of c-Fos positive cells in the two control groups was notably lower than in the experimental groups (Saline group vs KET group: 11808±2386 versus 308705±106131, P = 0.006; Home cage vs ISO group: 3371±840 vs 12326±1879, P = 0.001). This is in line with previous studies, like the one by Cirelli C and team, which found minimal c-Fos expression throughout the mouse brain during physiological sleep (Cirelli, C, and G Tononi, Sleep, 2000; 23(4): 453-69). Thus, in our analysis, we did not detect regions with significant downregulation when comparing anesthetized mice with controls.

      Interpreting Raw Data from Figure 1: Regarding the average Fos+ neurons:

      In Figures 4 and 5, we utilized raw data (c-Fos cell count) to assess cell expression differences across 201 brain regions within each group. Only brain regions that had significant statistical differences after multiple comparison corrections are shown in the figures.

      3) I do not understand their interpretation of the PCA analyses. For instance, in Fig 2 they claim that KET is associated with PC1 while ISO is associated with PC2. Looking at the distribution of points it's clear that the KET animals are all grouped at around +2.5 on PC1 and -2.0 on PC2, this means that KET is associated with both PC1 and PC2 to a similar degree (2 to 2.5). Moreover, I'm confused about why they use PCA to represent the animals/group. PCA is a powerful technique to reduce dimensionality and identify groups of variables that may represent the same underlying construct; however, it is not the best way to identify clusters of individuals or groups.

      Clarification on PCA Analyses in Figure 2: Thank you for pointing out the ambiguities in our initial presentation of the PCA analyses. We are grateful for the opportunity to address these concerns.

      KET and ISO Associations with PC1 and PC2: You rightly observed that KET samples manifest both a positive value on PC1 (around +2.5) and a negative one on PC2 (around -2.0), suggesting that KET has a substantial influence on both principal components. In PCA, a positive score implies a positive association with that component, whereas a negative score suggests a negative association. Contrarily, ISO samples predominantly exhibit values around +2.5 on PC2, with nearly neutral values for PC1, underlining its stronger association with PC2 and lack of significant correlation with PC1. To ensure transparency and clarity, we've adjusted the corresponding descriptions in our manuscript, which can be found on Line 100.

      Rationale Behind Using PCA to Represent Animals/Groups: Our initial step was to conduct PCA clustering analysis on the 201 brain regions within both the ISO and KET groups. In the accompanying chart, varying colors denote different brain regions, while distinct shapes represent separate clusters. There wasn't a pronounced distribution pattern within the ISO and KET groups, which led us to adopt the current computational method presented in the paper. This approach was chosen to directly contrast the relative differential expressions between ISO and KET.

      We deeply value your feedback, which has steered us toward a clearer and more accurate presentation of our data. We genuinely appreciate your meticulous review.

      Author response image 1.

      4) The actual metric used for the first PCA is unclear, is it the FOS density in each of the regions (some of those regions are large and consist of many subregions, how does that affect the analysis) is it the %-fos, or normalized cells? The wording describing this is variable causing some confusion. How would looking at these different metrics influence the analysis?

      Thank you for raising concerns about the metrics used in our PCA analysis. We recognize the need for clearer exposition and appreciate the opportunity to clarify.

      PCA Metrics: The metric for our PCA is calculated by obtaining the ratio of the Fos density within a specific brain region to the global Fos density across the brain. Briefly, this entails dividing the number of Fos-positive cells in a given region by its volume, and then comparing this to the Fos density of the whole brain. The logarithm of this ratio provides our PCA metric. We've elaborated on this in the Materials and Methods section (Lines 401) and enhanced clarity in our revised manuscript, particularly at Line 96.

      In Figure 2A, we employed 53 larger, mutually exclusive brain regions based on the reference from the study by Do et al. (eLife, 2016;5:e13214). However, in Figure 3A, we used a more detailed segmentation, incorporating 201 distinct brain areas that are more granular than those in Figure 2A. Notably, the PCA results from both representations were consistent. The rationale behind selecting either the 53 or 201 brain regions can be found in our response to Question 10.

      Rationale for Metric Choice: The log ratio of regional c-Fos densities relative to the global brain density was chosen due to:

      a. Notable disparities in c-Fos cell expression across the groups.

      b. A significant non-normal distribution of density values across animals within the group. Employing the log ratio effectively mitigates the impact of extreme values and outliers, achieving a more standardized data distribution.

      We've added PCA plots based on c-Fos densities, depicted in Author response image 2. However, the data dispersion has resulted in a significantly spread-out horizontal scale for these visuals.

      Author response image 2.

      5) Based on Fig 3 the authors concludes that ISO activates the hypothalamic regions and inhibits the cortex, however, Fig 1 shows neither an activation of the hypothalamus in the ISO nor an inhibition of the cortex when compared to home cage control. If anything it suggests the opposite.

      Thank you for your insightful observations regarding the discrepancies between Figures 2 and 3. We believe that when you refer to Figure 1, you are actually referencing Figure 2C.

      ISO activation in Hypothalamus: In Figure 2C, we regret the oversight where we inadvertently interchanged the positions of ISO and Saline. When accurately represented, Figure 2C indeed shows that ISO notably activates the periventricular zone (PVZ) and the lateral zone (LZ) of the hypothalamus compared to the home cage group. Moreover, there's a discernible difference in the hypothalamic response between ISO and KET.

      ISO's Effect on the Cortex: The main aim of Figure 3 was to highlight the differing responses between ISO and KET in the cortex. Notably, KET demonstrates a positive correlation with PC1 (+7 on PC1), whereas ISO shows a negative association (-3 on PC1). Given that the coefficient of PC1 for the cortical region is positive, it suggests that the cortical areas activated by KET are inhibited by ISO (with KET's distribution around 0 on PC2). However, the divergence between ISO and the home cage is most apparent in PC2, with ISO clusters at +4 and the home cage approximately at -2, suggesting that ISO activates a different set of cortical nuclei. In alignment with this, Figure 2C also illustrates that ISO activates specific cortical areas, such as ILA and PIR, in contrast to the home cage.

      Thus, Figure 3 primarily employs PCA to delineate the contrasts between ISO and KET, whereas Figure 2C emphasizes the comparison of each against their respective controls.

      6) Control for isoflurane should be air in the induction chamber rather than home cage. It is possible that Fos activation reflects handling/stress pre-anesthesia in the animals, which would increase Fos expression in the stress-related regions such as the BST, striatum (CeA), hypothalamus (PVH) and potentially the LC.

      Thank you for emphasizing the importance of an appropriate control for Isoflurane.

      In our efforts to minimize the potential impact of stress-induced c-Fos expression, we implemented several precautionary measures. Prior to the experiment, both groups of mice were subjected to handling and acclimatization within the induction chamber over four days. By the day of the experiment, for the mice in the experimental group, we ensured they were comfortable and exhibited no signs of distress or fear—such as cowering or evading. With care, we slowly relocated them to the nearby anesthesia induction chamber. Using 5% ISO, anesthesia was induced promptly, following a meticulously devised protocol to reduce stress impacts on c-Fos expression.

      Moreover, existing studies have shown Isoflurane's activation of BST/CeA (Hua, T, et al., Nat Neurosci, 2020, 23: 854-868), PVH (Xu, Z, et al., British Journal of Anaesthesia, 2023, 130: 446-458), and LC (Lu, J, et al., J Comp Neurol, 2008, 508: 648-62), even when using oxygen controls. Such literature supports our findings, indicating that the activation we observed was indeed due to Isoflurane and not purely stress-related.

      7) In the Ket network there are a few anticorrelated regions, most of which are amongst the list of the most activated regions, does this mean that the strong correlation results from an overall decreased activation? And if so, is it possible that the ketamine anesthesia was stronger than the isoflurane, causing a more general reduction in activity?

      The pronounced correlations observed within the ketamine (KET) network do not signify a generalized decrease in activation. Instead, these correlations reflect significantly enhanced activity in specific regions under KET anesthesia. This amplified correlation is an indication of a more widespread increase in activity, rather than a decrease. These findings are consistent with previous research, which showed that anesthetic doses of ketamine produce patterns of Fos expression in the CNS similar to wakefulness (Lu, J, et al., J Comp Neurol, 2008; 508(4): 648-62).

      Regarding the comparative strength of KET versus ISO anesthesia, our electroencephalographic evidence confirms that both agents induce a loss of consciousness. No significant differences were observed in EEG and EMG readings within the first 30 minutes post-administration. In future research, a continuous intravenous or intraperitoneal administration of KET might be a preferable method.

      8) Since they have established networks it would be easy and useful to look at how the different regions identified (sleep, pain, neuroendocrine, motor-related, ...) work together to maintain analgesia, are they within the same module? Do they become functionally connected and is this core network of functional connections similar for KET and ISO?

      Thank you for your suggestion. In response to your inquiry, we undertook analysis of the core functional networks for KET and ISO, using a set threshold at r>0.82 and P<0.05. For evaluating the modularity of each network, we utilized Newman's spectral community detection algorithm.

      (A) The ISO’s core functional network (56 nodes, 372 edges) predominantly divides into two modules with a modularity quotient of 0.345. ISO-active regions include arousal-associated regions (PL, ILA, PVT), analgesia-related (CeA, LC, PB), neuroendocrine function nuclei (TU, PVi, ARH, PVH, SON) as detailed in Figure 5. Notably, ARH and SON weren't incorporated into the core network. Analgesia-associated regions, such as CeA, LC, and PB, reside within module 1, while neuroendocrine nuclei are spread between modules 1 and 2.

      (B) In contrast, KET's core functional network (61 nodes, 1820 edges) splits into three distinct modules, but its low modularity quotient (0.06) indicates a lack of clear functional modularization, suggesting denser interconnections among brain regions. Furthermore, functionally-related regions such as arousal (PL, ILA, PVT, DR), analgesia-related (ACA, APN, PAG, LC), and neuroendocrine regulation (PVH, SON),etc., as seen in Figure 4, are distributed across different modules. This distribution may implies that functions like analgesia and neuroendocrine regulation are not governed by simple, linear processes, but arise from complex, overlapping pathways spanning various modules and functional zones.

      In summary, the core functional networks of ISO and KET differ, with functionally-related regions spanning multiple modules, reflecting their diverse roles in varied physiological regulations.

      Author response image 3.

      9) The naming of the function of some of the regions is very much debatable. For instance, PL/ILA are named "sleep-wakefulness regulation" regions in the paper. I can think of many more important functions of the PL/IL including executive functions, behavioral flexibility, and emotional control. It is unclear how the functions of all the regions were attributed. I am not sure that this biased labeling of structure-function is useful to the reports, it may instead suggest wrong conclusions.

      Thank you for your thoughtful feedback regarding our classification of the functions of the PL/ILA regions in our manuscript.

      We recognize the challenge in accurately defining the functions of brain regions. While there is evidence highlighting the role of PL/ILA in arousal pathways, we also acknowledge their documented roles in executive functions, behavioral flexibility, and emotional control. In response to your comments, we have refined our description, changing "sleep-wakefulness regulation" to "wake-promoting pathways" (see Line: 159, 164).

      It's worth noting that many brain regions, including the PL/ILA, have multiple functions. We agree that a single label might not capture the entirety of their roles. To provide a broader perspective, we will add a section in our manuscript that sheds light on the varied functions of these regions (Line: 181).

      10) A point of concern and confusion is the number of brain regions analyzed. In the introduction, it is mentioned that 987 brain regions are considered, but this is reduced to 53 selected brain regions in Figure 2, then 201 brain regions in Figure 3, and reduced again to 63 for the network analysis. The rationale for selecting different brain regions is not clear.

      For the 987 brain regions: Using the standard mouse atlas available at http://atlas.brain-map.org/, the mouse brain is organized into nine levels. The broadest category is the grey matter, which then progresses to more specific subdivisions, totaling 987 unique regions.

      For the 53 brain regions: To effectively understand the activation patterns of ISO and KET, we started with a broad approach, looking at larger brain areas like the thalamus and hypothalamus. This broad view, presented in Figure 2, focuses on the 5th-level brain regions, encompassing 53 primary areas. This methodology is also employed in the study by Do et al. (Elife, 2016; 5: e13214). We have added the rationale for selecting these brain regions in the main text (Line: 92).

      Regarding the 201 brain regions in Figures 3, 4, and 5: We delved deeper, examining the 6th-level brain regions, a common granularity in neuroscience research. This detailed view allowed us to highlight specific areas, like the CeA and PVH (Line:129).

      Finally, for Figures 6 and 7, we selected 63 regions that were activated by both ISO and KET, as well as regions previously reported to be related to the mechanism of general anesthesia(Leung, L, et al., Progress in neurobiology, 2014; 122: 24-44) (Line: 220). Using these regions, we analyzed the correlation of c-Fos expression, aiming to construct a functional brain network with strong positive connections.

      We hope this clarifies our approach and the rationale behind our region selection at each stage of the study. Thank you for your attention to this detail.

      11) The statistical analysis does not seem appropriate considering the high number of comparisons. They use simple t-tests without correction for multiple comparisons.

      Thank you for pointing out the concern regarding our statistical analysis. In the revised manuscript, we addressed the issue of multiple comparisons correction in our t-tests. We adopted the statistical methods detailed in the papers by Renier, N, et al., Cell, 2016; and Benjamini, Y, and Y Hochberg, 1995. P-values were adjusted for multiple comparisons using the two-stage linear step-up procedure of Benjamini, Krieger, and Yekutieli, with a false discovery rate (FDR) threshold (Q) of 0.05. This approach is now explained in the Materials and Methods section (Line: 434). After this adjustment, the brain regions we initially identified remained statistically significant. Furthermore, we revisited the original immunohistochemical images to confirm the differences in c-Fos cell expression between the experimental and control groups, reinforcing our conclusions.

      12) There is no statistical analysis in Fig 2C。

      Thank you for bringing to our attention the lack of statistical analysis in Fig 2C. We have now added the relevant statistical data in Supplementary Table 1 and provided annotations in Fig 2C to reflect this.

      Reviewer #2

      1) The authors report 987 brain regions in the introduction, but I cannot find any analysis that incorporates these or even which regions they are. Very little rationale is provided for the regions included in any of the analyses and numbers range from 53 in Figure 1, to 201 in Figure 3, to 63 in Figure 6. It would help if the authors could first survey Fos+ counts across all regions to identify a subset that is of interest (significantly changed by either condition compared to control) for follow up analysis.

      Thank you for your insightful comments on the number of brain regions analyzed in our study.

      987 Brain Regions: The reference to 987 brain regions from the standard mouse atlas (http://atlas.brain-map.org/) represents the entire categorization of the mouse brain across nine levels. We recognize that a comprehensive analysis of all these regions would be valuable, but to ensure clarity and depth, we took a focused approach.

      Region Selection Rationale:

      Figure 2: Concentrated on 5th-level brain regions (53 areas), inspired by methods from Do et al. (eLife, 2016;5:e13214). This provided a broad overview of c-Fos expression differences. Figures 4 and 5: Delved into 6th-level brain regions (201 areas), a common practice in neuroscience for more detailed study. Figure 6: We focused on 63 regions, which encompass not only the regions activated by both ISO and KET but also those previously reported to be associated with the mechanisms of general anesthesia. Methodological Approach: Our region selection was rooted in identifying areas with significant changes under anesthetic conditions compared to controls. This staged approach allowed a targeted analysis of the most affected regions, ensuring robust conclusions.

      Enhancements: We've incorporated comparative analyses of activated brain regions at different hierarchical levels in Figures 4 and 5. For clearer comprehension, we’ve added clarifications in the manuscript at Lines: 92, 130, and 220.

      2) Different data transformations are used for each analysis. One that is especially confusing is the 'normalization' of brain regions by % of total brain activation for each animal prior to PCA analysis in Figures 2 and 3. This would obscure any global differences in activation and make it unlikely to observe decreases in activation (which I think is likely here) that could be identified using the Fos+ counts after normalizing for region size (ie. Fos+ count / mm3) which is standard practice in such Fos-based activity mapping studies. While PCA can be powerful approach to identify global patterns, the purpose of the analysis in its current form is unclear. It would be more meaningful to show that regional activation patterns (measured as counts/mm3) are on separate PCs by group.

      Thank you for your thoughtful comments. We regret any confusion caused by our initial presentation. For the PCA analysis in Figures 2A and 3A, we calculated the ratio of cell density in each brain region to the overall brain density, and then applied a logarithmic transformation to this ratio. Our approach in Figure 2C was to use the proportion of c-Fos cell counts in individual brain regions to the total cell counts throughout the brain. This methodology considers variations in overall c-Fos cell counts across animals, effectively mitigating potential biases due to differential global activation levels across subjects.

      Furthermore, our direct comparison of differences in c-Fos cell counts between ISO, KET, and their respective control groups in Figures 4 and 5 addresses your concerns about potential decreases in activation. Notably, we did not identify any brain regions with significant suppression in these figures, which is consistent with the trends observed post-normalization in Figure 2C.

      Given your feedback, we conducted another PCA using cell densities for each region (counts/mm3). However, we found significant variability and non-normal distribution of c-Fos density across the groups, leading to extensive data dispersion. Consequently, normalizing the cell counts across regions and then applying a logarithmic transformation before PCA might be more appropriate.

      Author response image 4.

      Additionally, our exploration of regional activation patterns using PCA analysis for ISO and KET separately, based on the logarithm ratio of the c-Fos density, revealed that there was no distinct clustering feature among the different brain regions (as illustrated in Author response image 5: colors represented distinct brain regions, while the shapes were indicative of different clusters). This observation further suggests that our original statistical approach might be more suitable.

      Author response image 5.

      3) Critical problem: The authors include a control group for each anesthetic (ketamine vs. saline, isofluorane vs. homecage) but most analyses do not make use of the control groups or directly compare Fos+ counts across the groups. Strictly speaking, they should have compared relative levels of induction by ketamine versus induction by isoflurane using ANOVAs. Instead, each type of induction was separate from the other. This does not account for increased variability in the ketamine versus isoflurane groups. There is no mention in the Statistics section or in Results section that any multiple comparison corrections were used. It appears that the authors only used Students t-test for each region and did not perform any corrections.

      We appreciate the reviewer's insights and have addressed your concerns:

      Given the pronounced difference in c-Fos cell count expression between the KET and ISO groups, a direct comparison of Fos+ counts may not effectively capture their inherent disparities. To better highlight these distinctions, we used the logarithm ratio of c-Fos density in our PCA analysis (Figure 3), mitigating potential disparities in overall cell counts between samples and emphasizing relative variations. However, in response to your feedback, we've included additional analyses. Author response image 6 depicts the c-Fos density (cells/mm^3) across different brain regions for the home cage, ISO, saline, and KET groups, with regions like the cerebral cortex, cerebral nuclei, thalamus, and others differentiated by shaded backgrounds. Data are represented as mean ± SEM. We performed a one-way ANOVA followed by Tukey’s post hoc test, marking significant differences between ISO and KET with asterisks: P < 0.001, P < 0.01, P < 0.05.

      Regarding multiple comparison corrections, we've conducted thorough analyses on the data in Figure 2C and Figures 4, 5, and 6, implementing multiple comparison corrections. The detailed methodology is provided in the “Statistical analysis” section.

      Author response image 6.

      4) Figures 4 and 5 show brain regions 'significantly activated' following KET or ISO respectively, but again a subset of regions are shown and the stats seem to be t-tests with no multiple comparisons correction. It would help to show these two figures side by side, include the same regions, and keep the y axis ranges similar so the reader can easily compare the 'activation patterns' across the two treatments. Indeed, it looks like KET/Saline induced activation is an order or magnitude or two higher than ISO/Homecage. I would also recommend that this be the first data figure before any other analyses and maybe further analysis could be restricted to regions that are significantly changed in following KET or ISO here.

      Thank you for your constructive feedback regarding Figures 4 and 5.

      Comparison and Presentation of Figures 4 and 5: We acknowledge your suggestion to present these figures side by side for easier comparison. In the supplementary figure provided in the previous question, we've placed Figures 4 and 5 adjacent to each other, with consistent y-axis ranges, ensuring that readers can make direct comparisons between the activation patterns elicited by KET and ISO.

      Statistical Concerns and Region Selection: As mentioned in our previous response, we have conducted multiple comparison corrections on the data presented in Figures 4 and 5. Detailed procedures are elaborated in the “Statistical analysis” section. We believe this approach addresses your concerns regarding the use of t-tests without corrections for multiple comparisons.

      Difference in Activation Levels: We observed that the c-Fos activation due to KET is significantly higher than that from ISO. When presented side-by-side using the same scale, ISO activations appear less prominent, potentially mask subtle differences in the activation patterns of ISO, particularly if both KET and ISO showed changes in the same direction in certain brain regions but differed in magnitude. To address this, we used the proportion of c-Fos cell counts in Figure 2C, the logarithm ratio of c-Fos density in Figure 2A and Figure 3. This method emphasizes the relative changes, rather than absolute values, giving a more balanced view of the effects of each treatment.

      5) Analyses in Figure 6 and 7 are interesting but again the choice of regions to include is unclear and makes interpreting the results impossible. For example, in Figure 7 it is unclear why the list of regions in bar graphs showing Degree and Betweenness Centrality are not the same even within a single row?

      Thank you for your pertinent observation. The choice of brain regions in Figures 6 and 7 was carefully determined based on two main criteria: regions that were significantly activated by ISO or KET within the scope of our study, and those previously reported to be associated with anesthesia mechanisms and sleep-wake regulation.

      Regarding your second concern on Figure 7, the discrepancies observed in the x-axes of the bar graphs arise from our methodological approach. We prioritized presenting the top 20% of regions based on their Degree or Betweenness Centrality values. By separately ranking these regions from highest to lowest, the regions presented for each metric inherently differ. This approach was taken to elucidate nodes that consistently emerge as significant across both metrics, thereby highlighting core nodes in the functional network. Were we to use a consistent x-axis without this ranking, it would not only necessitate a more extensive presentation but might also dilute the emphasis on key information. To clarify this methodology and its rationale for our readers, we have expanded upon this in the manuscript at Line 243.

      We hope these clarifications address your concerns and facilitate a clearer understanding of our findings.

      Reviewer #1 (Recommendations For The Authors):

      Minor points

      1) In Table 1: the separation of which substructures belong to which brain structure is not clear

      2) Line 132 on page 3 seems to repeat the sentence earlier in the paragraph "KET predominantly affects brain regions within the cerebral cortex (CTX), while significantly inhibiting the hypothalamus, midbrain, and hindbrain."

      3) Typos

      a) Line 99/100 and 130 Central nucleus (CNU) should be cerebral nucleus

      b) Comma on line 166

      c) Fig. 4D: KET instead of Keta

      d) Line 263 "ep"

      e) Line 332: 35" "ml (add space)

      4) Will data and code be made available?

      Thank you for your detailed feedback.

      1. We have revised Table 1 to clarify which substructures belong to which brain structures.

      2. We acknowledge the redundancy and have now edited line 139 on page 3 to remove the repeated sentence regarding the effects of KET on brain regions.

      3. We have addressed the typos you pointed out:

      a. The terms "Central nucleus (CNU)" have been corrected to "cerebral nucleus."

      b. The comma issue on line 166 has been rectified.

      c. In Fig. 4D, we have corrected "Keta" to "KET."

      d. We have corrected the typo "ep" on line 263.

      e. A space has been added between "35" and "ml" on line 332 as you indicated.

      1. Regarding the availability of data and code, we are currently conducting additional analyses related to this study. Once these analyses are completed, we will be more than happy to make the data and code available.

      Thank you for assisting us in improving our manuscript.

      Reviewer #2 (Recommendations For The Authors):

      Minor comments:

      6) The term 'whole-brain mapping' in the title suggests that the mapping was performed on 'intact brains' where in fact serial sections were used here. Maybe the authors could change to 'brain-wide mapping' to align better with the study.

      Thank you for your insightful comments.

      We have revised the title as suggested, changing "whole-brain mapping" to "brain-wide mapping".

      7) It is unclear if the mice were kept under anesthesia for the 90-min duration and how the authors monitored the level of sedation. Additionally, if the KET mice were already sedated why were they further sedated with ISO before perfusions and tissue extraction? The methods should be clarified and any potential confounds discussed.

      To maintain consistency in the experimental protocol and to reduce stress reactions in the mice, ISO was used before perfusion in all cases. However, this does not affect c-Fos expression as the expression of c-Fos protein starts 20-30 minutes after stimulation (Lara Aparicio, S Y, et al., NeuroSci, 2022; 3(4): 687-702).

      We appreciate your guidance in enhancing the clarity of our manuscript.

      Reviewer #3 (Recommendations For The Authors):

      Recommendation: Minor corrections.

      1) The authors should delve deeper into the molecular mechanisms underlying the observed effects, particularly the changes associated with NMDA and GABA receptors. Exploring these mechanisms would provide a more comprehensive understanding of how Ketamine and Isoflurane modulate neural activity and induce anesthesia.

      2) The clinical relevance of these findings has not been sufficiently addressed. It would be valuable to elaborate on how the current research outcomes could potentially lead to changes in current anesthesia practices. For instance, identifying the distinct pathways of action for Ketamine and Isoflurane could aid anesthesiologists in selecting the most appropriate anesthetic based on the specific needs of individual patients or surgical procedures.

      3) Both Ketamine and Isoflurane have been associated with neurotoxicity. It is important to discuss how the c-Fos activation induced by these anesthetics could contribute, at least partially, to anesthesia-related neurotoxicity. Examining the potential neurotoxic effects would provide a more comprehensive understanding of the risks associated with these anesthetics and aid in the development of safer anesthesia protocols.

      Thank you for your valuable suggestions.

      Regarding the three points (1, 2, and 3) you've raised, we fully recognize their significance. In the current study, our primary focus was on the differential impacts of Isoflurane and Ketamine on widespread c-Fos expression in the brain. However, we indeed acknowledge the importance of delving deeper into these mechanisms and their clinical relevance. Therefore, we intend to explore these critical issues in greater detail in our future research endeavors.

      We appreciate your feedback, which provides constructive guidance for our subsequent research directions.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Responses to Reviewer’s Comments:  

      To Reviewer #2:

      (1) The use of two m<sup>5</sup>C reader proteins is likely a reason for the high number of edits introduced by the DRAM-Seq method. Both ALYREF and YBX1 are ubiquitous proteins with multiple roles in RNA metabolism including splicing and mRNA export. It is reasonable to assume that both ALYREF and YBX1 bind to many mRNAs that do not contain m<sup>5</sup>C. 

      To substantiate the author's claim that ALYREF or YBX1 binds m<sup>5</sup>C-modified RNAs to an extent that would allow distinguishing its binding to non-modified RNAs from binding to m<sup>5</sup>Cmodified RNAs, it would be recommended to provide data on the affinity of these, supposedly proven, m<sup>5</sup>C readers to non-modified versus m<sup>5</sup>C-modified RNAs. To do so, this reviewer suggests performing experiments as described in Slama et al., 2020 (doi: 10.1016/j.ymeth.2018.10.020). However, using dot blots like in so many published studies to show modification of a specific antibody or protein binding, is insufficient as an argument because no antibody, nor protein, encounters nanograms to micrograms of a specific RNA identity in a cell. This issue remains a major caveat in all studies using so-called RNA modification reader proteins as bait for detecting RNA modifications in epitranscriptomics research. It becomes a pertinent problem if used as a platform for base editing similar to the work presented in this manuscript.

      The authors have tried to address the point made by this reviewer. However, rather than performing an experiment with recombinant ALYREF-fusions and m<sup>5</sup>C-modified to unmodified RNA oligos for testing the enrichment factor of ALYREF in vitro, the authors resorted to citing two manuscripts. One manuscript is cited by everybody when it comes to ALYREF as m<sup>5</sup>C reader, however none of the experiments have been repeated by another laboratory. The other manuscript is reporting on YBX1 binding to m<sup>5</sup>C-containing RNA and mentions PARCLiP experiments with ALYREF, the details of which are nowhere to be found in doi: 10.1038/s41556-019-0361-y.

      Furthermore, the authors have added RNA pull-down assays that should substitute for the requested experiments. Interestingly, Figure S1E shows that ALYREF binds equally well to unmodified and m<sup>5</sup>C-modified RNA oligos, which contradicts doi:10.1038/cr.2017.55, and supports the conclusion that wild-type ALYREF is not specific m<sup>5</sup>C binder. The necessity of including always an overexpression of ALYREF-mut in parallel DRAM experiments, makes the developed method better controlled but not easy to handle (expression differences of the plasmid-driven proteins etc.) 

      Thank you for pointing this out. First, we would like to correct our previous response: the binding ability of ALYREF to m<sup>5</sup>C-modified RNA was initially reported in doi: 10.1038/cr.2017.55, (and not in doi: 10.1038/s41556-019-0361-y), where it was observed through PAR-CLIP analysis that the K171 mutation weakens its binding affinity to m<sup>5</sup>C -modified RNA.

      Our previous experimental approach was not optimal: the protein concentration in the INPUT group was too high, leading to overexposure in the experimental group. Additionally, we did not conduct a quantitative analysis of the results at that time. In response to your suggestion, we performed RNA pull-down experiments with YBX1 and ALYREF, rather than with the pan-DRAM protein, to better validate and reproduce the previously reported findings. Our quantitative analysis revealed that both ALYREF and YBX1 exhibit a stronger affinity for m<sup>5</sup>C -modified RNAs. Furthermore, mutating the key amino acids involved in m<sup>5</sup>C recognition significantly reduced the binding affinity of both readers. These results align with previous studies (doi: 10.1038/cr.2017.55 and doi: 10.1038/s41556-019-0361-y), confirming that ALYREF and YBX1 are specific readers of m<sup>5</sup>C -modified RNAs. However, our detection system has certain limitations. Despite mutating the critical amino acids, both readers retained a weak binding affinity for m<sup>5</sup>C, suggesting that while the mutation helps reduce false positives, it is still challenging to precisely map the distribution of m<sup>5</sup>C modifications. To address this, we plan to further investigate the protein structure and function to obtain a more accurate m<sup>5</sup>C sequencing of the transcriptome in future studies. Accordingly, we have updated our results and conclusions in lines 294-299 and discuss these limitations in lines 109114.

      In addition, while the m<sup>5</sup>C assay can be performed using only the DRAM system alone, comparing it with the DRAM<sup>mut</sup> control enhances the accuracy of m<sup>5</sup>C region detection. To minimize the variations in transfection efficiency across experimental groups, it is recommended to use the same batch of transfections. This approach not only ensures more consistent results but also improve the standardization of the DRAM assay, as discussed in the section added on line 308-312.

      (2) Using sodium arsenite treatment of cells as a means to change the m<sup>5</sup>C status of transcripts through the downregulation of the two major m<sup>5</sup>C writer proteins NSUN2 and NSUN6 is problematic and the conclusions from these experiments are not warranted. Sodium arsenite is a chemical that poisons every protein containing thiol groups. Not only do NSUN proteins contain cysteines but also the base editor fusion proteins. Arsenite will inactivate these proteins, hence the editing frequency will drop, as observed in the experiments shown in Figure 5, which the authors explain with fewer m<sup>5</sup>C sites to be detected by the fusion proteins.

      The authors have not addressed the point made by this reviewer. Instead the authors state that they have not addressed that possibility. They claim that they have revised the results section, but this reviewer can only see the point raised in the conclusions. An experiment would have been to purify base editors via the HA tag and then perform some kind of binding/editing assay in vitro before and after arsenite treatment of cells.

      We appreciate the reviewer’s insightful comment. We fully agree with the concern raised. In the original manuscript, our intention was to use sodium arsenite treatment to downregulate NSUN mediated m<sup>5</sup>C levels and subsequently decrease DRAM editing efficiency, with the aim of monitoring m<sup>5</sup>C dynamics through the DRAM system. However, as the reviewer pointed out, sodium arsenite may inactivate both NSUN proteins and the base editor fusion proteins, and any such inactivation would likely result in a reduced DRAM editing.

      This confounds the interpretation of our experimental data.

      As demonstrated in Author response image 1A, western blot analysis confirmed that sodium arsenite indeed decreased the expression of fusion proteins. In addition, we attempted in vitro fusion protein purificationusing multiple fusion tags (HIS, GST, HA, MBP) for DRAM fusion protein expression, but unfortunately, we were unable to obtain purified proteins. However, using the Promega TNT T7 Rapid Coupled In Vitro Transcription/Translation Kit, we successfully purified the DRAM protein (Author response image 1B). Despite this success, subsequent in vitro deamination experiments did not yield the expected mutation results (Author response image 1C), indicating that further optimization is required. This issue is further discussed in line 314-315.

      Taken together, the above evidence supports that the experiment of sodium arsenite treatment was confusing and we determined to remove the corresponding results from the main text of the revised manuscript.

      Author response image 1.

      (3) The authors should move high-confidence editing site data contained in Supplementary Tables 2 and 3 into one of the main Figures to substantiate what is discussed in Figure 4A. However, the data needs to be visualized in another way then excel format. Furthermore, Supplementary Table 2 does not contain a description of the columns, while Supplementary Table 3 contains a single row with letters and numbers.

      The authors have not addressed the point made by this reviewer. Figure 3F shows the screening process for DRAM-seq assays and principles for screening highconfidence genes rather than the data contained in Supplementary Tables 2 and 3 of the former version of this manuscript.

      Thank you for your valuable suggestion. We have visualized the data from Supplementary Tables 2 and 3 in Figure 4A as a circlize diagram (described in lines 213-216), illustrating the distribution of mutation sites detected by the DRAM system across each chromosome. Additionally, to improve the presentation and clarity of the data, we have revised Supplementary Tables 2 and 3 by adding column descriptions, merging the DRAM-ABE and DRAM-CBE sites, and including overlapping m<sup>5</sup>C genes from previous datasets.

      Responses to Reviewer’s Comments:  

      To Reviewer #3:

      The authors have again tried to address the former concern by this reviewer who questioned the specificity of both m<sup>5</sup>C reader proteins towards modified RNA rather than unmodified RNA. The authors chose to do RNA pull down experiments which serve as a proxy for proving the specificity of ALYREF and YBX1 for m<sup>5</sup>C modified RNAs. Even though this reviewer asked for determining the enrichment factor of the reader-base editor fusion proteins (as wildtype or mutant for the identified m<sup>5</sup>C specificity motif) when presented with m<sup>5</sup>C-modified RNAs, the authors chose to use both reader proteins alone (without the fusion to an editor) as wildtype and as respective m<sup>5</sup>C-binding mutant in RNA in vitro pull-down experiments along with unmodified and m<sup>5</sup>C-modified RNA oligomers as binding substrates. The quantification of these pull-down experiments (n=2) have now been added, and are revealing that (according to SFigure 1 E and G) YBX1 enriches an RNA containing a single m<sup>5</sup>C by a factor of 1.3 over its unmodified counterpart, while ALYREF enriches by a factor of 4x. This is an acceptable approach for educated readers to question the specificity of the reader proteins, even though the quantification should be performed differently (see below).

      Given that there is no specific sequence motif embedding those cytosines identified in the vicinity of the DRAM-edits (Figure 3J and K), even though it has been accepted by now that most of the m<sup>5</sup>C sites in mRNA are mediated by NSUN2 and NSUN6 proteins, which target tRNA like substrate structures with a particular sequence enrichment, one can conclude that DRAM-Seq is uncovering a huge number of false positives. This must be so not only because of the RNA bisulfite seq data that have been extensively studied by others, but also by the following calculations: Given that the m<sup>5</sup>C/C ratio in human mRNA is 0.02-0.09% (measured by mass spec) and assuming that 1/4 of the nucleotides in an average mRNA are cytosines, an mRNA of 1.000 nucleotides would contain 250 Cs. 0.02- 0.09% m<sup>5</sup>C/C would then translate into 0.05-0.225 methylated cytosines per 250 Cs in a 1000 nt mRNA. YBX1 would bind every C in such an mRNA since there is no m<sup>5</sup>C to be expected, which it could bind with 1.3 higher affinity. Even if the mRNAs would be 10.000 nt long, YBX1 would bind to half a methylated cytosine or 2.25 methylated cytosines with 1.3x higher affinity than to all the remaining cytosines (2499.5 to 2497.75 of 2.500 cytosines in 10.000 nt, respectively). These numbers indicate a 4999x to 1110x excess of cytosine over m<sup>5</sup>C in any substrate RNA, which the "reader" can bind as shown in the RNA pull-downs on unmodified RNAs. This reviewer spares the reader of this review the calculations for ALYREF specificity, which is slightly higher than YBX1. Hence, it is up to the capable reader of these calculations to follow the claim that this minor affinity difference allows the unambiguous detection of the few m<sup>5</sup>C sites in mRNA be it in the endogenous scenario of a cell or as fusion-protein with a base editor attached? 

      We sincerely appreciate the reviewer’s rigorous analysis. We would like to clarify that in our RNA pulldown assays, we indeed utilized the full DRAM system (reader protein fused to the base editor) to reflect the specificity of m<sup>5</sup>C recognition. As previously suggested by the reviewer, to independently validate the m<sup>5</sup>C-binding specificity of ALYREF and YBX1, we performed separate pulldown experiments with wild-type and mutant reader proteins (without the base editor fusion) using both unmodified and m<sup>5</sup>C-modified RNA substrates. This approach aligns with established methodologies in the field (doi:10.1038/cr.2017.55 and doi: 10.1038/s41556-019-0361-y). We have revised the Methods section (line 230) to explicitly describe this experimental design.

      Although the m<sup>5</sup>C/C ratios in LC/MS-assayed mRNA are relatively low (ranging from 0.02% to 0.09%), as noted by the reviewer, both our data and previous studies have demonstrated that ALYREF and YBX1 preferentially bind to m<sup>5</sup>C-modified RNAs over unmodified RNAs, exhibiting 4-fold and 1.3-fold enrichment, respectively (Supplementary Figure 1E–1G). Importantly, this specificity is further enhanced in the DRAM system through two key mechanisms: first, the fusion of reader proteins to the deaminase restricts editing to regions near m<sup>5</sup>C sites, thereby minimizing off-target effects; second, background editing observed in reader-mutant or deaminase controls (e.g., DRAM<sup>mut</sup>-CBE in Figure 2D) is systematically corrected for during data analysis.

      We agree that the theoretical challenge posed by the vast excess of unmodified cytosines. However, our approach includes stringent controls to alleviate this issue. Specifically, sites identified in NSUN2/NSUN6 knockout cells or reader-mutant controls are excluded (Figure 3F), which significantly reduces the number of false-positive detections. Additionally, we have observed deamination changes near high-confidence m<sup>5</sup>C methylation sites detected by RNA bisulfite sequencing, both in first-generation and high-throughput sequencing data. This observation further substantiates the validity of DRAM-Seq in accurately identifying m<sup>5</sup>C sites.

      We fully acknowledge that residual false positives may persist due to the inherent limitations of reader protein specificity, as discussed in line 299-301 of our manuscript. To address this, we plan to optimize reader domains with enhanced m<sup>5</sup>C binding (e.g., through structure-guided engineering), which is also previously implemented in the discussion of the manuscript.

      The reviewer supports the attempt to visualize the data. However, the usefulness of this Figure addition as a readable presentation of the data included in the supplement is up to debate.

      Thank you for your kind suggestion. We understand the reviewer's concern regarding data visualization. However, due to the large volume of DRAM-seq data, it is challenging to present each mutation site and its characteristics clearly in a single figure. Therefore, we chose to categorize the data by chromosome, which not only allows for a more organized presentation of the DRAM-seq data but also facilitates comparison with other database entries. Additionally, we have updated Supplementary Tables 2 and 3 to provide comprehensive information on the mutation sites. We hope that both the reviewer and editors will understand this approach. We will, of course, continue to carefully consider the reviewer's suggestions and explore better ways to present these results in the future.

      (3) A set of private Recommendations for the Authors that outline how you think the science and its presentation could be strengthened

      NEW COMMENTS to TEXT:

      Abstract:

      "5-Methylcytosine (m<sup>5</sup>C) is one of the major post-transcriptional modifications in mRNA and is highly involved in the pathogenesis of various diseases."

      In light of the increasing use of AI-based writing, and the proof that neither DeepSeek nor ChatGPT write truthfully statements if they collect metadata from scientific abstracts, this sentence is utterly misleading.

      m<sup>5</sup>C is not one of the major post-transcriptional modifications in mRNA as it is only present with a m<sup>5</sup>C/C ratio of 0.02- 0.09% as measured by mass-spec. Also, if m<sup>5</sup>C is involved in the pathogenesis of various diseases, it is not through mRNA but tRNA. No single published work has shown that a single m<sup>5</sup>C on an mRNA has anything to do with disease. Every conclusion that is perpetuated by copying the false statements given in the many reviews on the subject is based on knock-out phenotypes of the involved writer proteins. This reviewer wishes that the authors would abstain from the common practice that is currently flooding any scientific field through relentless repetitions in the increasing volume of literature which perpetuate alternative facts.

      We sincerely appreciate the reviewer’s insightful comments. While we acknowledge that m<sup>5</sup>C is not the most abundant post-transcriptional modification in mRNA, we believe that research into m<sup>5</sup>C modification holds considerable value. Numerous studies have highlighted its role in regulating gene expression and its potential contribution to disease progression. For example, recent publications have demonstrated that m<sup>5</sup>C modifications in mRNA can influence cancer progression, lipid metabolism, and other pathological processes (e.g., PMID: 37845385; 39013911; 39924557; 38042059; 37870216).

      We fully agree with the reviewer on the importance of maintaining scientific rigor in academic writing. While m<sup>5</sup>C is not the most abundant RNA modification, we cannot simply draw a conclusion that the level of modification should be the sole criterion for assessing its biological significance. However, to avoid potential confusion, we have removed the word “major”.

      COMMENTS ON FIGURE PRESENTATION:

      Figure 2D:

      The main text states: "DRAM-CBE induced C to U editing in the vicinity of the m<sup>5</sup>C site in AP5Z1 mRNA, with 13.6% C-to-U editing, while this effect was significantly reduced with APOBEC1 or DRAM<sup>mut</sup>-CBE (Fig.2D)." The Figure does not fit this statement. The seq trace shows a U signal of about 1/3 of that of C (about 30%), while the quantification shows 20+ percent

      Thank you for your kind suggestion. Upon visual evaluation, the sequencing trace in the figure appears to suggest a mutation rate closer to 30% rather than 22%. However, relying solely on the visual interpretation of sequencing peaks is not a rigorous approach. The trace on the left represents the visualization of Sanger sequencing results using SnapGene, while the quantification on the right is derived from EditR 1.0.10 software analysis of three independent biological replicates. The C-to-U mutation rates calculated were 22.91667%, 23.23232%, and 21.05263%, respectively. To further validate this, we have included the original EditR analysis of the Sanger sequencing results for the DRAM-CBE group used in the left panel of Figure 2D (see Author response image 2). This analysis confirms an m<sup>5</sup>C fraction (%) of 22/(22+74) = 22.91667, and the sequencing trace aligns well with the mutation rate we reported in Figure 2D. In conclusion, the data and conclusions presented in Figure 2D are consistent and supported by the quantitative analysis.

      Author response image 2.

      Figure 4B: shows now different numbers in Venn-diagrams than in the same depiction, formerly Figure 4A

      We sincerely thank the reviewer for pointing out this issue, and we apologize for not clearly indicating the changes in the previous version of the manuscript. In response to the initial round of reviewer comments, we implemented a more stringent data filtering process (as described in Figure 3F and method section) : "For high-confidence filtering, we further adjusted the parameters of Find_edit_site.pl to include an edit ratio of 10%–60%, a requirement that the edit ratio in control samples be at least 2-fold higher than in NSUN2 or NSUN6knockout samples, and at least 4 editing events at a given site." As a result, we made minor adjustments to the Venn diagram data in Figure 4A, reducing the total number of DRAM-edited mRNAs from 11,977 to 10,835. These changes were consistently applied throughout the manuscript, and the modifications have been highlighted for clarity. Importantly, these adjustments do not affect any of the conclusions presented in the manuscript.

      Figure 4B and D: while the overlap of the DRAM-Seq data with RNA bisulfite data might be 80% or 92%, it is obvious that the remaining data DRAM seq suggests a detection of additional sites of around 97% or 81.83%. It would be advised to mention this large number of additional sites as potential false positives, unless these data were normalized to the sites that can be allocated to NSUN2 and NSUN6 activity (NSUN mutant data sets could be substracted).

      Thank you for pointing this out. The Venn diagrams presented in Figure 4B and D already reflect the exclusion of potential false-positive sites identified in methyltransferasedeficient datasets, as described in our experimental filtering process, and they represent the remaining sites after this stringent filtering. However, we acknowledge that YBX1 and ALYREF, while preferentially binding to m<sup>5</sup>C-modified RNA, also exhibit some affinity for unmodified RNA. Although we employed rigorous controls, including DRAM<sup>mut</sup> and deaminase groups, to minimize false positives, the possibility of residual false positives cannot be entirely ruled out. Addressing this limitation would require even more stringent filtering methods, as discussed in lines 299–301 of the manuscript. We are committed to further optimizing the DRAM system to enhance the accuracy of transcriptome-wide m<sup>5</sup>C analysis in future studies.

      SFigure 1: It is clear that the wild type version of both reader proteins are robustly binding to RNA that does not contain m<sup>5</sup>C. As for the calculations of x-fold affinity loss of RNA binding using both ALYREF -mut or YBX1 -mut, this reviewer asks the authors to determine how much less the mutated versions of the proteins bind to a m<sup>5</sup>C-modified RNAs. Hence, a comparison of YBX1 versus YBX1 -mut (ALYREF versus ALYREF -mut) on the same substrate RNA with the same m<sup>5</sup>C-modified position would allow determining the contribution of the so-called modification binding pocket in the respective proteins to their RNA binding. The way the authors chose to show the data presently is misleading because what is compared is the binding of either the wild type or the mutant protein to different RNAs.

      We appreciate the reviewer’s valuable feedback and apologize for any confusion caused by the presentation of our data. We would like to clarify the rationale behind our approach. The decision to present the wild-type and mutant reader proteins in separate panels, rather than together, was made in response to comments from Reviewer 2. Below, we provide a detailed explanation of our experimental design and its justification.

      First, we confirmed that YBX1 and ALYREF exhibit stronger binding affinity to m<sup>5</sup>Cmodified RNA compared to unmodified RNA, establishing their role as m<sup>5</sup>C reader proteins. Next, to validate the functional significance of the DRAM<sup>mut</sup> group, we demonstrated that mutating key amino acids in the m<sup>5</sup>C-binding pocket significantly reduces the binding affinity of YBX1<sup>mut</sup> and ALYREF<sup>mut</sup> to m<sup>5</sup>C-modified RNA. This confirms that the DRAM<sup>mut</sup> group effectively minimizes false-positive results by disrupting specific m<sup>5</sup>C interactions.

      Crucially, in our pull-down experiments, both the wild-type and mutant proteins (YBX1/YBX1<sup>mut</sup> and ALYREF/ALYREF<sup>mut</sup>) were incubated with the same RNA sequences. To avoid any ambiguity, we have included the specific RNA sequence information in the Methods section (lines 463–468). This ensures a assessment of the reduced binding affinity of the mutant versions relative to the wild-type proteins, even though they are presented in separate panels.

      We hope this explanation clarifies our approach and demonstrates the robustness of our findings. We sincerely appreciate the reviewer’s understanding and hope this addresses their concerns.

      SFigure 2C: first two panels are duplicates of the same image.

      Thank you for pointing this out. We sincerely apologize for incorrectly duplicating the images. We have now updated Supplementary Figure 2C with the correct panels and have provided the original flow cytometry data for the first two images. It is important to note that, as demonstrated by the original data analysis, the EGFP-positive quantification values (59.78% and 59.74%) remain accurate. Therefore, this correction does not affect the conclusions of our study. Thank you again for bringing this to our attention.

      Author response image 3.

      SFigure 4B: how would the PCR product for NSUN6 be indicative of a mutation? The used primers seem to amplify the wildtype sequence.

      Thank you for your kind suggestion. In our NSUN6<sup>-/-</sup> cell line, the NSUN6 gene is only missing a single base pair (1bp) compared to the wildtype, which results in frame shift mutation and reduction in NSUN6 protein expression. We fully agree with the reviewer that the current PCR gel electrophoresis does not provide a clear distinction of this 1bp mutation. To better illustrate our experimental design, we have included a schematic representation of the knockout sequence in SFigure 4B. Additionally, we have provided the original sequencing data, and the corresponding details have been added to lines 151-153 of the manuscript for further clarification.

      Author response image 4.

      SFigure 4C: the Figure legend is insufficient to understand the subfigure.

      Thank you for your valuable suggestion. To improve clarity, we have revised the figure legend for SFigure 4C, as well as the corresponding text in lines 178-179. We have additionally updated the title of SFigure 4 for better clarity. The updated SFigure 4C now demonstrates that the DRAM-edited mRNAs exhibit a high degree of overlap across the three biological replicates.

      SFigure 4D: the Figure legend is insufficient to understand the subfigure.

      Thank you for your kind suggestion. We have revised the figure legend to provide a clearer explanation of the subfigure. Specifically, this figure illustrates the motif analysis derived from sequences spanning 10 nucleotides upstream and downstream of DRAMedited sites mediated by loci associated with NSUN2 or NSUN6. To enhance clarity, we have also rephrased the relevant results section (lines 169-175) and the corresponding discussion (lines 304-307).

      SFigure 7: There is something off with all 6 panels. This reviewer can find data points in each panel that do not show up on the other two panels even though this is a pairwise comparison of three data sets (file was sent to the Editor) Available at https://elife-rp.msubmit.net/elife-rp_files/2025/01/22/00130809/02/130809_2_attach_27_15153.pdf

      Response: We thank the reviewer for pointing this out. We would like to clarify the methodology behind this analysis. In this study, we conducted pairwise comparisons of the number of DRAM-edited sites per gene across three biological replicates of DRAM-ABE or DRAM-CBE, visualized as scatterplots. Each data point in the plots corresponds to a gene, and while the same gene is represented in all three panels, its position may vary vertically or horizontally across the panels. This variation arises because the number of mutation sites typically differs between replicates, making it unlikely for a data point to occupy the exact same position in all panels. A similar analytical approach has been used in previous studies on m6A (PMID: 31548708). To address the reviewer’s concern, we have annotated the corresponding positions of the questioned data points with arrows in Author response image 5.

      Author response image 5.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      By identifying a loss of function mutant of IQCH in infertile patient, Ruan et al. shows that IQCH is essential for spermiogenesis by generating a knockout mouse model of IQCH. Similar to infertile patient with mutant of IQCH, Iqch knockout mice are characterized by a cracked flagellar axoneme and abnormal mitochondrial structure. Mechanistically, IQCH regulates the expression of RNA-binding proteins (especially HNRPAB), which are indispensable for spermatogenesis.

      Although this manuscript contains a potentially interesting piece of work that delineates a mechanism of IQCH that associates with spermatogenesis, this reviewer feels that a number of issues require clarification and re-evaluation for a better understanding of the role of IQCH in spermatogenesis.

      Line 251 - 253, "To elucidate the molecular mechanism by which IQCH regulates male fertility, we performed liquid chromatography tandem mass spectrometry (LC‒MS/MS) analysis using mouse sperm lysates and detected 288 interactors of IQCH (Figure 5-source data 1)."

      The reviewer had already raised significant concerns regarding the text above, noting that "LC‒MS/MS analysis using mouse sperm lysates" would not identify interactors of IQCH. However, this issue was not addressed in the revised manuscript. In the Methods section detailing LC-MS/MS, the authors stated that it was conducted on "eluates obtained from IP". However, there was no explanation provided on how IP for LC-MS/MS was performed. Additionally, it was unclear whether LC-MS or LC-MS/MS was utilized. The primary concern is that if LC‒MS/MS was conducted for the IP of IQCH, IQCH itself should have been detected in the results; however, as indicated by Figure 5-source data 1, IQCH was not listed.

      Thanks to reviewer’s comments. Additional details regarding the IP protocol for LC-MS/MS analysis have been included in the methods section in the revised manuscript. Furthermore, we apologize for the previous inconsistencies in the terminology used for LC-MS/MS and have now ensured its consistent usage throughout the document. Regarding the primary concern about the absence of IQCH in Figure 5-source data 1, our study only showed identifying proteins that interact with IQCH, not IQCH itself. Additionally, we conducted co-IP experiments to validate the interactions identified by LC-MS/MS analysis. Actually, we identified the IQCH itself by LC-MS/MS analysis (Author response table 1).

      Author response table 1.

      Results of the LC-MS/MS analysis.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      The authors should know what experiments have been done for the studies.

      We apologize for our oversights. The method for RNA-binding protein immunoprecipitation (RIP) has been detailed in the revised manuscript.

      Typos still remain in the text, e.g., line 253, "Fiugre".

      We are sorry for the spelling errors. We have engaged professional editing services to refine our manuscript.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Public Review:

      This manuscript by Yue et al. aims to understand the molecular mechanisms underlying the better reproductive outcomes of Tibetans at high altitude by characterizing the transcriptome and histology of full-term placenta of Tibetans and compare them to those Han Chinese at high elevations.

      The approach is innovative, and the data collected are valuable for testing hypotheses regarding the contribution of the placenta to better reproductive success of populations that adapted to hypoxia. The authors identified hundreds of differentially expressed genes (DEGs) between Tibetans and Han, including the EPAS1 gene that harbors the strongest signals of genetic adaptation. The authors also found that such differential expression is more prevalent and pronounced in the placentas of male fetuses than those of female fetuses, which is particularly interesting, as it echoes with the more severe reduction in birth weight of male neonates at high elevation observed by the same group of researchers (He et al., 2022).

      This revised manuscript addressed several concerns raised by reviewers in last round. However, we still find the evidence for natural selection on the identified DEGs--as a group--to be very weak, despite more convincing evidence on a few individual genes, such as EPAS1 and EGLN1.

      The authors first examined the overlap between DEGs and genes showing signals of positive selection in Tibetans and evaluated the significance of a larger overlap than expected with a permutation analysis. A minor issue related to this analysis is that the p-value is inflated, as the authors are counting permutation replicates with MORE genes in overlap than observed, yet the more appropriate way is counting replicates with EQUAL or MORE overlapping genes. Using the latter method of p-value calculation, the "sex-combined" and "female-only" DEGs will become non-significantly enriched in genes with evidence of selection, and the signal appears to solely come from male-specific DEGs. A thornier issue with this type of enrichment analysis is whether the condition on placental expression is sufficient, as other genomic or transcriptomic features (e.g., expression level, local sequence divergence level) may also confound the analysis.

      According to the suggested methods, we counted the replicates with equal or more overlapping genes than observed (≥4 for the “combined” set; ≥9 for the “male-only” set; ≥0 for the “female-only” set). We found that the overlaps between DEGs and TSNGs were significantly enriched only in the “male-only” set (p-value < 1e-4, counting 0 time from 10,000 permutations), but not in the “female-only” set (p-value = 1, counting 10,000 time from 10,000 permutations), or “combined” set (p-value = 0.0603, counting 603 time from 10,000 permutations) (see Table R1 below).

      We updated this information in the revised manuscript, including Results, Methods, and Figure S9.

      Author response table 1.

      Permutation analysis of the overlapped genes between DEGs and TSNGs.

      The authors next aimed to detect polygenic signals of adaptation of gene expression by applying the PolyGraph method to eQTLs of genes expressed in the placenta (Racimo et al 2018). This approach is ambitious but problematic, as the method is designed for testing evidence of selection on single polygenic traits. The expression levels of different genes should be considered as "different traits" with differential impacts on downstream phenotypic traits (such as birth weight). As a result, the eQTLs of different genes cannot be naively aggregated in the calculation of the polygenic score, unless the authors have a specific, oversimplified hypothesis that the expression increase of all genes with identified eQTL will improve pregnancy outcome and that they are equally important to downstream phenotypes. In general, PolyGraph method is inapplicable to eQTL data, especially those of different genes (but see Colbran et al 2023 Genetics for an example where the polygenic score is used for testing selection on the expression of individual genes).

      We would recommend removal of these analyses and focus on the discussion of individual genes with more compelling evidence of selection (e.g., EPAS1, EGLN1).

      According to the suggestion, we removed these analyses in the revised manuscript.

    1. Author Response

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      In this paper, the authors developed an image analysis pipeline to automacally idenfy individual neurons within a populaon of fluorescently tagged neurons. This applicaon is opmized to deal with mul-cell analysis and builds on a previous soware version, developed by the same team, to resolve individual neurons from whole-brain imaging stacks. Using advanced stascal approaches and several heuriscs tailored for C. elegans anatomy, the method successfully idenfies individual neurons with a fairly high accuracy. Thus, while specific to C. elegans, this method can become instrumental for a variety of research direcons such as in-vivo single-cell gene expression analysis and calcium-based neural acvity studies.

      Thank you.

      Reviewer #2 (Public Review):

      The authors succeed in generalizing the pre-alignment procedure for their cell idenficaon method to allow it to work effecvely on data with only small subsets of cells labeled. They convincingly show that their extension accurately idenfies head angle, based on finding auto florescent ssue and looking for a symmetric l/r axis. They demonstrate method works to allow the idenficaon of a parcular subset of neurons. Their approach should be a useful one for researchers wishing to idenfy subsets of head neurons in C. elegans, and the ideas might be useful elsewhere.

      The authors also assess the relave usefulness of several atlases for making identy predicons. They atempt to give some addional general insights on what makes a good atlas, but here insights seem less clear as available data does not allow for experiments that cleanly decouple: 1. the number of examples in the atlas 2. the completeness of the atlas. and 3. the match in strain and imaging modality discussed. In the presented experiments the custom atlas, besides the strain and imaging modality mismatches discussed is also the only complete atlas with more than one example. The neuroPAL atlas, is an imperfect stand in, since a significant fracon of cells could not be idenfied in these data sets, making it a 60/40 mix of Openworm and a hypothecal perfect neuroPAL comparison. This waters down general insights since it is unclear if the performance is driven by strain/imaging modality or these difficules creang a complete neuroPal atlas. The experiments do usefully explore the volume of data needed. Though generalizaon remains to be shown the insight is useful for future atlas building that for the specific (small) set of cells labeled in the experiments 5-10 examples is sufficient to build a accurate atlas.

      The reviewer brings up an interesting point. As the reviewer noted, given the imperfection of the datasets (ours and others’), it is possible that artifacts from incomplete atlases can interfere with the assessment of the performances of different atlases. To address this, as the reviewer suggested, we have searched the literature and found two sets of data that give specific coordinates of identified neurons (both using NeuroPAL). We compared the performance of the atlases derived from these datasets to the strain-specific atlases, and the original conclusion stands. Details are now included in the revised manuscript (Figure 3- figure supplement 2).

      Recommendaons for the authors:

      Reviewer #1 (Recommendaons For The Authors):

      I appreciate the new mosaic analysis (Fig. 3 -figure suppl 2). Please fix the y-axis ck label that I believe should be 0.8 (instead of 0.9).

      We thank the reviewer for spotting the typo. We have fixed the error.

      **Reviewer #2 (Recommendaons For The Authors):

      Though I'm not familiar with the exact quality of GT labels in available neuroPAL data I know increasing volumes of published data is available. Comparison with a complete neuroPAL atlas, and a similar assessment on atlas size as made with the custom atlas would to my mind qualitavely increase the general insights on atlas construcon.

      We thank the reviewer for the insightful suggestion. We have newly constructed several other NeuroPAL atlases by incorporating neuron positional data from two other published data: [Yemini E. et al. NeuroPAL: A Multicolor Atlas for Whole-Brain Neuronal Identification in C. elegans. Cell. 2021 Jan 7;184(1):272-288.e11] and [Skuhersky, M. et al. Toward a more accurate 3D atlas of C. elegans neurons. BMC Bioinformatics 23, 195 (2022)].

      Interestingly, we found that the two new atlases (NP-Yemini and NP-Skuhersky) have significantly different values of PA, LR, DV, and angle relationships for certain cells compared to the OpenWorm and glr-1 atlases. For example, in both the NP atlases, SMDD is labeled as being anterior to AIB, which is the opposite of the SMDD-AIB relationship in the glr-1 atlas.

      Because this relationship (and other similar cases) were missing in our original NeuroPAL atlas (NP-Chaudhary), the addition of these two NeuroPAL datasets to our NeuroPAL atlas dramatically changed the atlas. As a result, incorporating the published data sets into the NeuroPAL atlas (NP-all) actually decreased the average prediction accuracy to 44%, while the average accuracy of original NeuroPAL atlas (NP-Chaudhary) was 57%. The atlas based on the Yemini et al. data alone (NP-Yemini) had 43% accuracy, and the atlas based on the Skuhersky et al. data alone (NP-Skuhersky) had 38% accuracy.

      For the rest of our analysis, we focused on comparing the NeuroPAL atlas that resulted in the highest accuracy against other atlases in figure 3 (NP-Chaudhary). Therefore, we have added Figure 3- figure supplement 2 and the following sentence in the discussion. “Several other NeuroPAL atlases from different data sources were considered, and the atlas that resulted in the highest neuron ID correspondence was selected (Figure 3- figure supplement 2).”

      Author response image 1.

      Figure3- figure supplement 2. Comparison of neuron ID correspondences resulng from addional atlases- atlases driven from NeuroPAL neuron posional data from mulple sources (Chaudhary et al., Yemini et al., and Skuhersky et al.) in red compared to other atlases in Figure 3. Two sample t-tests were performed for stascal analysis. The asterisk symbol denotes a significance level of p<0.05, and n.s. denotes no significance. OW: atlas driven by data from OpenWorm project, NP-source: NeuroPAL atlas driven by data from the source. NP-Chaudhary atlas corresponds to NeuroPAL atlas in Figure 3.

      80% agreement among manual idenficaons seems low to me for a relavely small, (mostly) known set of cells, which seems to cast into doubt ground truth idenes based on a best 2 out of 3 vote. The authors menon 3% of cell idenes had total disagreement and were excluded, what were the fracon unanimous and 2/3? Are there any further insights about what limited human performance in the context of this parcular idenficaon task?

      We closely looked into the manual annotation data. The fraction of cells in unanimous, two thirds, and no agreement are approximately 74%, 20%, and 6%, respectively. We made the corresponding change in the manuscript from 3% to 6%. Indeed, we identified certain patterns in labels that were more likely to be disagreed upon. First, cells in close proximity to each other, such as AVE and RMD, were often switched from annotator to annotator. Second, cells in the posterior part of the cluster, such as RIM, AVD, AVB, were more variable in positions, so their identities were not clear at times. Third, annotators were more likely to disagree on cells whose expressions are rare and low, and these include AIB, AVJ, and M1. These observations agree with our results in figure 4c.

    2. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This research advance arctile describes a valuable image analysis method to identify individual cells (neurons) within a population of fluorescently labeled cells in the nematode C. elegans. The findings are solid and the method succeeds to identify cells with high precision. The method will be valuable to the C. elegans research community.

      Public Reviews:

      Reviewer #1 (Public Review):

      In this paper, the authors developed an image analysis pipeline to automatically identify individual neurons within a population of fluorescently tagged neurons. This application is optimized to deal with multi-cell analysis and builds on a previous software version, developed by the same team, to resolve individual neurons from whole-brain imaging stacks. Using advanced statistical approaches and several heuristics tailored for C. elegans anatomy, the method successfully identifies individual neurons with a fairly high accuracy. Thus, while specific to C. elegans, this method can become instrumental for a variety of research directions such as in-vivo single-cell gene expression analysis and calcium-based neural activity studies.

      The analysis procedure depends on the availability of an accurate atlas that serves as a reference map for neural positions. Thus, when imaging a new reporter line without fair prior knowledge of the tagged cells, such an atlas may be very difficult to construct. Moreover, usage of available reference atlases, constructed based on other databases, is not very helpful (as shown by the authors in Fig 3), so for each new reporter line a de-novo atlas needs to be constructed.

      We thank the reviewer for pointing out a place where we can use some clarification. While in principle that every new reporter line would need fair prior knowledge, atlases are either already available or not difficult to construct. If one can make the assumption that the anatomy of a particular line is similar to existing atlases (Yemini 2021,Nejatbakhsh 2023,Toyoshima 2020), the cell ID can be immediately performed. Even in the case that one suspects the anatomy may have changes from existing atlases (e.g. in the case of examining mutants), existing atlases can serve as a starting point to provide a draft ID, which facilitates manual annotation. Once manual annotations on ~5 animals are available as we have shown in this work (which is a manageable number in practice), this new dataset can be used to build an updated atlas that can be used for future inferences. We have added this discussion in the manuscript: “If one determines that the anatomy of a particular animal strain is substantially different from existing atlases, new atlases can be easily constructed using existing atlases as starting points.” (page 18).

      I have a few comments that may help to better understand the potential of the tool to become handy.

      1. I wonder the degree by which strain mosaicism affects the analysis (Figs 1-4) as it was performed on a non-integrated reporter strain. As stated, for constructing the reference atlas, the authors used worms in which they could identify the complete set of tagged neurons. But how senstiive is the analysis when assaying worms with different levels of mosaicism? Are the results shown in the paper stem from animals with a full neural set expression? Could the authors add results for which the assayed worms show partial expression where only 80%, 70%, 50% of the cells population are observed, and how this will affect idenfication accuracy? This may be important as many non-integrated reporter lines show high mosaic patterns and may therefore not be suitable for using this analytic method. In that sense, could the authors describe the mosaic degree of their line used for validating the method.

      We appreciate the reviewer for this comment. We want to clarify that most of the worms used in the construction of the atlas are indeed affected by mosaicism and thus do not express the full set of candidate neurons. We have added such a plot as requested (Figure 3 – figure supplement 2, copied below). Our data show that there is no correlation between the fraction of cells expressed in a worm and neuron ID correspondence. We agree with the reviewer this additional insight may be helpful; we have modified the text to include this discussion: “Note that we observed no correlation between the degree of mosaicism and neuron ID correspondence (Figure 3- figure supplement 2).” (page 10).

      Author response image 1.

      No correlation between the degree of mosaicism (fraction of cells expressed in the worm) and neuron ID correspondence.

      1. For the gene expression analysis (Fig 5), where was the intensity of the GFP extracted from? As it has no nuclear tag, the protein should be cytoplasmic (as seen in Fig 5a), but in Fig 5c it is shown as if the region of interest to extract fluorescence was nuclear. If fluorescence was indeed extracted from the cytoplasm, then it will be helpful to include in the software and in the results description how this was done, as a huge hurdle in dissecting such multi-cell images is avoiding crossreads between adjacent/intersecting neurons.

      For this work, we used nuclear-localized RFP co-expressed in the animal, and the GFP intensities were extracted from the same region RFP intensities were extracted. If cytosolic reporters are used, one would imagine a membrane label would be necessary to discern the border of the cells. We clarified our reagents and approach in the text: “The segmentation was done on the nuclear-localized mCherry signals, and GFP intensities were extracted from the same region.” (page21).

      1. In the same mater: In the methods, it is specified that the strain expressing GCAMP was also used in the gene expression analysis shown in Figure 5. But the calcium indicator may show transient intensities depending on spontaneous neural activity during the imaging. This will introduce a significant variability that may affect the expression correlation analysis as depicted in Figure 5.

      We apologize for the error in text. The strain used in the gene expression analysis did not express GCaMP. We did not analyze GCaMP expression in figure 5. We have corrected the error in the methods.

      Reviewer #2 (Public Review):

      The authors succeed in generalizing the pre-alignment procedure for their cell idenfication method to allow it to work effectively on data with only small subsets of cells labeled. They convincingly show that their extension accurately identifies head angle, based on finding auto fluorescent tissue and looking for a symmetric l/r axis. They demonstrate that the method works to identify known subsets of neurons with varying accuracy depending on the nature of underlying atlas data. Their approach should be a useful one for researchers wishing to identify subsets of head neurons in C. elegans, for example in whole brain recording, and the ideas might be useful elsewhere.

      The authors also strive to give some general insights on what makes a good atlas. It is interesting and valuable to see (at least for this specific set of neurons) that 5-10 ideal examples are sufficient. However, some critical details would help in understanding how far their insights generalize. I believe the set of neurons in each atlas version are matched to the known set of cells in the sparse neuronal marker, however this critical detail isn't explicitly stated anywhere I can see.

      This is an important point. We have made text modifications to make it clear to the readers that for all atlases, the number of entities (candidate list) was kept consistent as listed in the methods. In the results section under “CRF_ID 2.0 for automatic cell annotation in multi-cell images,” we added the following sentence: “Note that a truncated candidate list can be used for subse-tspecific cell ID if the neuronal expression is known” (page 3). In the methods section, we added the following sentence: “For multi-cell neuron predictions on the glr-1 strain, a truncated atlas containing only the above 37 neurons was used to exclude neuron candidates that are irrelevant for prediction” (Page 20).

      In addition, it is stated that some neuron positions are missing in the neuropal data and replaced with the (single) position available from the open worm atlas. It should be stated how many neurons are missing and replaced in this way (providing weaker information).

      We modified the text in the result section as follows: “Eight out of 37 candidate neurons are missing in the neuroPAL atlas, which means 40% of the pairwise relationships of neurons expressing the glr-1p::NLS-mcherry transgene were not augmented with the NeuroPAL data but were assigned the default values from the OpenWorm atlas” (page 10).

      It also is not explicitly stated that the putative identities for the uncertain cells (designated with Greek letters) are used to sample the neuropal data. Large numbers of openworm single positions or if uncertain cells are misidentified forcing alignment against the positions of nearby but different cells would both handicap the neuropal atlas relative to the matched florescence atlas. This is an important question since sufficient performance from an ideal neuropal atlas (subsampled) would avoid the need for building custom atlases per strain.

      The putative identities are not used to sample the NeuroPAL data. They were used in the glr-1 multi-cell case to indicate low confidence in manual identification/annotation. For all steps of manual annotation and CRF_ID predictions, we used real neuron labels, and the Greek labels were used for reporting purposes only. It is true that the OpenWorm values (40% of the atlas) would be a handicap for the neuroPAL atlas. This is mainly due to the difficulty of obtaining NeuroPAL data as it requires 3-color fluorescence microscopy and significant time and labor to annotate the large set of neurons. This is one reason to take a complementary approach as we do in this paper.

      Reviewer #1 (Recommendations For The Authors):

      1. Figure 3, there is a confusion in the legend relating to panels c-e (e.g. panel c is neuron ID accuracy but it is described per panel e in the legend.

      We made the necessary changes.

      1. Figure 3, were statistical tests performed for panels d-e? if so, and the outcome was not significant, then it might be good to indicate this in the legend.

      We have added results of statistical tests in the legend as the following sentence: “All distributions in panel d and e had a p-value of less than 0.0001 for one sample t-test against zero.” One sample t-tests were performed because what is plotted already represents each atlas’ differences to the glr-1 25 dataset atlas, we didn’t think the statistical analyses between the other atlases would add significant value.

      1. Figure 4, no asterisks are shown in the figure so it is possible to remove the sentence in the legend describing what the asterisk stands for.

      Thank you. We made the necessary changes.

      Reviewer #2 (Recommendations For The Authors):

      Comparison with deep learning approaches could be more nuanced and structured, the authors (prior) approach extended here combines a specific set of comparative relationship measurements with a general optimization approach for matching based on comparative expectations. Other measurements could be used whether explicit (like neighbor expectations) or learned differences in embeddings. These alternate measurements would both need to be extensively re-calibrated for different sets of cells but might provide significant performance gains. In addition deep learning approaches don't solve the optimization part of the matching problem, so the authors approach seems to bring something strong to the table even if one is committed to learned methods (necessary I suspect for human level performance in denser cell sets than the relatively small number here). A more complete discussion of these themes might better frame the impact of the work and help readers think about the advantages and disadvantages or different methods for their own data.

      We thank the reviewer for bringing up this point. We apologize perhaps not making the point clearer in the original submission. This extension of the original work (Chaudhary et al) is not changing the CRF-based framework, but only augmenting the approach with a better defined set of axes (solely because in multicell and not whole-brain datasets, the sparsity of neurons degrades the axis definition and consequently the neuron ID predictions). We are not fundamentally changing the framework, and therefore all the advantages (over registration-based approaches for example) also apply here. The other purpose of this paper is to demonstrate a couple of use-cases for gene expression analysis, which is common in studies in C. elegans (and other organisms). We hope that by showing a use-case others can see how this approach is useful for their own applications.

      We have clarified these points in the paper (page 18). “The fundamental framework has not been changed from CRF_ID 1.0, and therefore the advantages of CRF_ID outlined in the original work apply for CRF_ID 2.0 as well.”

      The atribution of anatomical differences to strain is interesting, but seems purely speculative, and somewhat unlikely. I would suspect the fundamentally more difficult nature of aligning N items to M>>N items in an atlas accounts for the differences in using the neuroPAL vs custom atlas here. If this is what is meant, it could be stated more clearly.

      It is important to note that the same neuron candidate list (listed in methods) was used for all atlases, so there is no difference among the atlases in terms of the number of cells in the query vs. candidate list. In other words, the same values for M and for N are used regardless of the reference atlas used.

      We have preliminary data indicating differences between the NeuroPAL and custom atlas. For instance, the NeuroPAL atlas scales smaller than the custom glr-1 atlas. Since direct comparisons of the different atlases are beyond the scope of this paper, we will leave the exact comparisons for future work. We suspect that the differences are from a combination of differences in anatomy and imaging conditions. While NeuroPAL atlas may not be exactly fitting for the custom dataset, it can serve as a good starting point for guesses when no custom atlases are available, as we have discussed earlier (response to Public Comments from Reviewer 1 Point 1). As explained earlier, we have added these discussions in the paper (see page 18).

      I was also left wondering if the random removal of landmarks had to be adjusted in this work given it is (potentially) helping cope with not just occasional weak cells but the systematic loss of most of the cells in the atlas. If the parameters of this part of the algorithm don't influence the success for N to M>>N alignment (here when the neuroPAL or OpenWorm atlas is used) this seems interesting in itself and worth discussing. Conversely, if these parameters were opitmized for the matched atlas and used for the others, this would seem to bias performance results.

      We may have failed to make this clear in the main text. As we have stated in our responses in the public review section, we do systematically limit the neuron labels in the candidate list to neurons that are known to be expressed by the promotor. The candidate list, which is kept consistent for all atlases, has more neurons than cells in the query, so it is always an N-to-M matching where M>N. We did not use landmarks, but such usage is possible and will only improve the matching.

      We have attempted to clarify these points in the manuscript. In the results section under “CRF_ID 2.0 for automatic cell annotation in multi-cell images,” we added the following sentence: “Note that a truncated candidate list can be used for subset-specific cell ID if the neuronal expression is known” (page 3). In the methods section, we added the following sentence: “For multi-cell neuron predictions on the glr-1 strain, a truncated atlas containing only the above 37 neurons was used to exclude neuron candidates that are irrelevant for prediction” (Page 20).

    1. Author Response

      The following is the authors’ response to the previous reviews.

      We thank the reviewers for collectively highlighting our study as “interesting and timely” and as making significant advances regarding the functional role of Orai in the activity of central dopaminergic neurons underlying the development of Drosophila flight behaviour. We hope that based on the revisions detailed below the data supporting our findings will be considered complete.

      Reviewer 1:

      • In this revision, the authors have addressed most points using text changes but there is still one important issue that continues to be inadequately addressed. This relates to point 1.

      If Set2 is acting downstream of SOCE, it is not clear to me how STIM1 over expression rescues Set2-dependent downstream responses in flies that do not have Set2. It seems that if STIM1 over-expression, which would presumably enhance SOCE, largely rescues Set2-dependent effector responses in the Set2RNAi flies, then the proposed pathway cannot be true (because if Set2 is downstream of SOCE, it shouldn't matter whether SOCE is boosted in flies that lack Set2). This discrepancy is not explained. Does STIM1 over-expression somehow restore Set2 expression in the Set2RNAi flies?

      Ans: Based on the requirement of Orai-mediated Ca2+ entry for Set2 expression (THD’>OraiE180A neurons, Figure 2C) we had indeed proposed that rescue of flight in Set2RNAi flies by STIMOE is because Set2 expression in Set2RNAi flies is restored by STIMOE. However, we agree that this has not been tested experimentally. Since these data are supportive but not essential to our findings here, we have removed data demonstrating flight rescue of Set2RNAi by STIMOE from Figure 2 – supplement 5 and associated text from the revised manuscript. We plan to investigate the effect of STIMOE on Set2 in the context of Drosophila dopaminergic neurons in the future.

      Reviewer 2:

      The manuscript analyses the functional role of Orai in the excitability of central dopaminergic neurons in Drosophila. The authors answer the previous concerns, but several important issues have not been experimentally tested. Especially, the lack of characterization of SOCE or calcium release from the intracellular calcium stores limits considerably the impact of the study. They comment on a number of technical problems but, taking into account the nature of the study, based on Orai and SOCE, the lack of these experimental data reduces the relevance of the study. Below are some specific comments:

      1. The response to question 1 is unconvincing. The authors do not demonstrate experimentally that STIM over-expression enhances SOCE or how excess SOCE might overcome the loss of SET2.

      Ans: The reason we have not performed experiments in this manuscript to investigate SOCE in STIM overexpression condition is two-fold. Firstly, extensive characterisation of SOCE by STIM overexpression in Drosophila pupal neurons forms part of an earlier publication (Chakraborty and Hasan, Front. Mol. Neurosci, 2017). A graph from Chakraborty and Hasan, 2017 where SOCE was measured in primary cultures of pupal neurons from an IP3R mutant (S224F/G1891S) of Drosophila. Reduced SOCE in IP3R mutant neurons (red trace) was restored by overexpression of STIM (black trace). The green trace is of wild-type neurons with STIM overexpression and the grey trace with STIMRNAi. Similar experiments were performed with Orai+STIM overexpression and the rescue in SOCE was compared with STIM overexpression in pupal neurons of wild type and IP3R mutant S224F/G1891S. See Chakraborty and Hasan, 2017 (Front. Mol. Neurosci. 10:111. doi: 10.3389/fnmol.2017.00111)

      2) Secondly, rescue by STIMOE is supportive but not essential to the findings of this manuscript which relate primarily to the analysis of an Orai-dependent transcriptional feed-back mechanism acting via Trl and Set2 in flight promoting dopaminergic neurons (See Fig 2C where we demonstrate that OraiE180A expression in THD’ neurons brings down Set2 expression).

      We agree that we have not demonstrated how loss of Set2 can be compensated by STIM overexpression. Therefore, we have now removed the supplementary data relating to STIM rescue of Set2RNAi (THD’>Set2RNAi; STIMOE) flight phenotypes since as mentioned above it was supportive but not essential to the main theme of the manuscript. Consistent with this, we have also removed rescue of flight in TrlRNAi by STIMOE (Figure 4C).

      1. The authors do not present a characterization of SOCE in the cells investigated expressing native Orai or the dominant negative OraiE180A mutant yet. They comment on some technical problems for in situ determination or using culture cells but, apparently, in previous studies they have reported some results.

      Ans: We respectfully submit that characterisation of SOCE in cells expressing native Orai and OraiE180A from primary cultures of Drosophila pupal dopaminergic neurons, form part of an earlier publication (Pathak, T., et al., (2015). The Journal of Neuroscience, 35, 13784–13799. https://doi.org/10.1523/jneurosci.1680-15.2015). As mentioned in lines 80-84 the dopaminergic neurons studied here (THD’) are a subset of the dopaminergic neurons studied in the Pathak et al., 2015 publication (TH). As evident in Figure 2 panels B-D expression of OraiE180A in dopaminergic neurons abrogates SOCE.

      In this study we have focused on identifying the molecular mechanism by which OraiE180A expression and concomitant loss of cellular Ca2+ signals (Figure 3B, 3C) affects dopaminergic neuron function. In lines 270-274 (page 10) we have stated the technical reason why Ca2+ measurements made in this study from ex-vivo brain preps measure a composite of ER-Ca2+ release and SOCE. Our observation that the measured Ca2+ response is significantly attenuated in cells expressing OraiE180A leads us to the conclusion that we are indeed measuring an SOCE component in the ex-vivo brain preps. This is also explained in ‘Limitations of the study’.

      1. Concerning the question about the STIM:Orai stoichiometry the authors answer that "We agree that STIM-Orai stoichiometry is essential for SOCE, and propose that the rescue backgrounds possess sufficient WT Orai, which is recruited by the excess STIM to mediate the rescue"; however, again, this is not experimentally tested.

      Ans: To address this point we have now measured relative stoichiometries of STIM and Orai mRNA by qPCR under WT conditions in Drosophila THD’ neurons at 72 hr APF. The observed stoichiometry as per these measurements is STIM:Orai =1.6:1 (~8:5). These data are in relative agreement with the normalised read counts of STIM and Orai in THD’ neurons in the RNAseq performed and described in Fig 1F. The qPCR (A) and RNAseq (B) measures of STIM and Orai are appended below.

      Author response image 1.

      In comparison to the numerous studies investigating structural, biophysical and cellular characterisation of Orai channels in heterologous systems, there are fewer studies which have traced systemic implications of Orai function through multiple tiers of investigation including organismal behaviour. Leveraging the wealth of genetic resources available in Drosophila, we have attempted this here. While we respectfully agree that questions pertaining to the stoichiometries of STIM/Orai proteins are indeed relevant to cellular regulation of SOCE, we submit they may be better suited for investigation in heterologous systems involving cell culture, or with in-vitro systems with purified recombinant proteins, or indeed using computational and modelling approaches. None of these methods fall within the scope of our current investigation which is to understand how by Orai mediated Ca2+ entry regulates developmental maturation of Drosophila flight promoting dopaminergic neurons.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews: 

      Reviewer #1 (Public review): 

      The authors investigated the role of the C. elegans Flower protein, FLWR-1, in synaptic transmission, vesicle recycling, and neuronal excitability. They confirmed that FLWR-1 localizes to synaptic vesicles and the plasma membrane and facilitates synaptic vesicle recycling at neuromuscular junctions. They observed that hyperstimulation results in endosome accumulation in flwr-1 mutant synapses, suggesting that FLWR-1 facilitates the breakdown of endocytic endosomes. Using tissue-specific rescue experiments, the authors showed that expressing FLWR-1 in GABAergic neurons restored the aldicarb-resistant phenotype of flwr-1 mutants to wild-type levels. By contrast, cholinergic neuron expression did not rescue aldicarb sensitivity at all. They also showed that FLWR-1 removal leads to increased Ca<sup>2+</sup> signaling in motor neurons upon photo-stimulation. From these findings, the authors conclude that FLWR-1 helps maintain the balance between excitation and inhibition (E/I) by preferentially regulating GABAergic neuronal excitability in a cell-autonomous manner. 

      Overall, the work presents solid data and interesting findings, however the proposed cell-autonomous model of GABAergic FLWR-1 function may be overly simplified in my opinion. 

      Most of my previous comments have been addressed; however, two issues remain. 

      (1) I appreciate the authors' efforts conducting additional aldicarb sensitivity assays that combine muscle-specific rescue with either cholinergic or GABergic neuron-specific expression of FLWR-1. In the revised manuscript, they conclude, "This did not show any additive effects to the pure neuronal rescues, thus FLWR-1 effects on muscle cell responses to cholinergic agonists must be cellautonomous." However, I find this interpretation confusing for the reasons outlined below. 

      Figure 1 - Figure Supplement 3B shows that muscle-specific FLWR-1 expression in flwr-1 mutants significantly restores aldicarb sensitivity. However, when FLWR-1 is co-expressed in both cholinergic neurons and muscle, the worms behave like flwr-1 mutants and no rescue is observed. Similarly, cholinergic FLWR-1 alone fails to restore aldicarb sensitivity (shown in the previous manuscript).

      This data is still shown in the manuscript, Fig. 3D. We interpreted our finding in the muscle/cholinergic co-rescue experiment as meaning, that FLWR-1 in cholinergic neurons over-compensates, so worms should be resistant, and the rescuing effect of muscle FLWR-1 is therefore cancelled. But it is true, if this were the case, why does the pure cholinergic rescue not show over-compensation? We added a sentence to acknowledge this inconsistency and we added a sentence in the discussion (see also below, comment 1) of reviewer #2).

      These observations indicate a non-cell-autonomous interaction between cholinergic neurons and muscle, rather than a strictly muscle cell-autonomous mechanism. In other words, FLWR-1 expressed in cholinergic neurons appears to negate or block the rescue effect of muscle-expressed FLWR-1. Therefore, FLWR-1 could play a more complex role in coordinating physiology across different tissues. This complexity may affect interpretations of Ca<sup>2+</sup> dynamics and/or functional data, particularly in relation to E/I balance, and thus warrants careful discussion or further investigation. 

      For the Ca<sup>2+</sup> dynamics, we think the effects of flwr-1 are likely very immediate, as the imaging assay relies on a sensor expressed directly in the neurons or muscles under study, and not on indirect phenotypes as muscle contraction and behavior, that depend on an interplay of several cell types influencing each other.

      (2) The revised manuscript includes new GCaMP analyses restricted to synaptic puncta. The authors mention that "we compared Ca<sup>2+</sup> signals in synaptic puncta versus axon shafts, and did not find any differences," concluding that "FLWR-1's impact is local, in synaptic boutons." This is puzzling: the similarity of Ca<sup>2+</sup> signals in synaptic regions and axon shafts seems to indicate a more global effect on Ca<sup>2+</sup> dynamics or may simply reflect limited temporal resolution in distinguishing local from global signals due to rapid Ca<sup>2+</sup> diffusion. The authors should clarify how they reached the conclusion that FLWR-1 has a localized impact at synaptic boutons, given that synaptic and axonal signals appear similar. Based on the presented data, the evidence supporting a local effect of FLWR-1 on Ca<sup>2+</sup> dynamics appears limited.

      We apologize, here we simply overlooked this misleading wording in our rebuttal letter. The data we mentioned, showing no obvious difference in axon vs. bouton, are shown below, including time constants for the onset and the offset of the stimulus (data is peak normalized for better visualization):

      Author response image 1.

      One can see that axonal Ca<sup>2+</sup> signals may rise a bit slower than synaptic Ca<sup>2+</sup> signals, as expected for Ca<sup>2+</sup> entering the boutons, and then diffusing out into the axon. The loss of FLWR1 does not affect this. However, the temporal resolution of the used GCaMP6f sensor is ca. 200 ms to reach peak, and the decay time (to t1/2) is ca. 400 ms (PMID: 23868258). Thus, it would be difficult to see effects based on Ca<sup>2+</sup> diffusion using this assay. For the decay, this is similar for both axon and synapse, while flwr-1 mutants do not reduce Ca<sup>2+</sup> as much as wt. In the axon, there is a seemingly slightly slower reduction in flwr-1 mutants, however, given the kinetics of the sensor, this is likely not a meaningful difference. Therefore, we wrote we did not find differences. The interpretation should not have been that the impact of FLWR-1 is local. It may be true if one could image this at faster time scales, i.e. if there is more FLWR-1 localized in boutons (as indicated by our data showing FLWR-1 enrichment in boutons; Fig. 3), and when considering its possible effect on MCA-3 localization (and assuming that MCA-3 is the active player in Ca<sup>2+</sup> removal), i.e. FLWR-1 recruiting MCA-3 to boutons (Fig. 9C, D).  

      Reviewer #2 (Public review): 

      Summary: 

      The Flower protein is expressed in various cell types, including neurons. Previous studies in flies have proposed that Flower plays a role in neuronal endocytosis by functioning as a Ca<sup>2+</sup> channel. However, its precise physiological roles and molecular mechanisms in neurons remain largely unclear. This study employs C. elegans as a model to explore the function and mechanism of FLWR-1, the C. elegans homolog of Flower. This study offers intriguing observations that could potentially challenge or expand our current understanding of the Flower protein. Nevertheless, further clarification or additional experiments are required to substantiate the study's conclusions. 

      Strengths: 

      A range of approaches was employed, including the use of a flwr-1 knockout strain, assessment of cholinergic synaptic activity via analyzing aldicarb (a cholinesterase inhibitor) sensitivity, imaging Ca<sup>2+</sup> dynamics with GCaMP3, analyzing pHluorin fluorescence, examination of presynaptic ultrastructure by EM, and recording postsynaptic currents at the neuromuscular junction. The findings include notable observations on the effects of flwr-1 knockout, such as increased Ca<sup>2+</sup> levels in motor neurons, changes in endosome numbers in motor neurons, altered aldicarb sensitivity, and potential involvement of a Ca<sup>2+</sup>-ATPase and PIP2 binding in FLWR-1's function. 

      The authors have adequately addressed most of my previous concerns, however, I recommend minor revisions to further strengthen the study's rigor and interpretation: 

      Major suggestions 

      (1) This study relies heavily on aldicarb assays to support its conclusions. While these assays are valuable, their results may not fully align with direct assessment of neurotransmitter release from motor neurons. For instance, prior work has shown that two presynaptic modulators identified through aldicarb sensitivity assays exhibited no corresponding electrophysiological defects at the neuromuscular junction (Liu et al., J Neurosci 27: 10404-10413, 2007). Similarly, at least one study from the Kaplan lab has noted discrepancies between aldicarb assays and electrophysiological analyses. The authors should consider adding a few sentences in the Discussion to acknowledge this limitation and the potential caveats of using aldicarb assays, especially since some of the aldicarb assay results in this study are not easily interpretable. 

      Aldicarb assays have been used very successfully in identifying mutants with defects in chemical synaptic transmission, and entire genetic screens have been conducted this way. The reviewer is right, one needs to realize that it is the balance of excitation and inhibition at the NMJ of C. elegans, which underlies the effects on the rate of aldicarb-induced paralysis, not just cholinergic transmission. I.e. if a given mutant affects cholinergic and GABAergic transmission differently, things become difficult to interpret, particularly if also muscle physiology is affected. Therefore, we combined mutant analyses with cell-type specific rescue. We acknowledge that results are nonetheless difficult to interpret. We thus added a sentence in the first paragraph of the discussion.

      (2) The manuscript states, "Elevated Ca<sup>2+</sup> levels were not further enhanced in a flwr-1;mca-3 double mutant." (lines 549-550). However, Figure 7C does not include statistical comparisons between the single and double mutants of flwr-1 and mca-3. Please add the necessary statistical analysis to support this statement. 

      Because we only marked significant differences in that figure, and n.s. was not shown. This was stated in the figure legend.

      (3) The term "Ca<sup>2+</sup> influx" should be avoided, as this study does not provide direct evidence (e.g. voltage-clamp recordings of Ca<sup>2+</sup> inward currents in motor neurons) for an effect of the flwr-1 mutation of Ca<sup>2+</sup> influx. The observed increase in neuronal GCaMP signals in response to optogenetic activation of ChR2 may result from, or be influenced by, Ca<sup>2+</sup> mobilization from of intracellular stores. For example, optogenetic stimulation could trigger ryanodine receptor-mediated Ca<sup>2+</sup> release from the ER via calcium-induced calcium release (CICR) or depolarization-induced calcium release (DICR). It would be more appropriate to describe the observed increase in Ca<sup>2+</sup> signal as "Ca<sup>2+</sup> elevation" rather than increased "Ca<sup>2+</sup> influx". 

      Ok, yes, we can do this, we referred by ‘influx’ to cytosolic Ca<sup>2+</sup>, that fluxes into the cytosol, be it from the internal stores or the extracellular. Extracellular influx, more or less, inevitably will trigger further influx from internal stores, to our understanding. We changed this to “elevated Ca<sup>2+</sup> levels” or “Ca<sup>2+</sup> level rise” or “Ca<sup>2+</sup> level increase”.

      Recommendations for the authors: 

      Reviewer #1 (Recommendations for the authors):

      A thorough discussion on the impact of cell-autonomous versus non-cell-autonomous effects is necessary. 

      Revise and clarify the distinction between local and global Ca²⁺ changes. 

      see above.

      Reviewer #2 (Recommendations for the authors): 

      Minor suggestions 

      (1) In "Few-Ubi was shown to facilitate recovery of neurons following intense synaptic activity (Yao et al.,....." (lines 283-284), please specify which aspects of neuronal recovery are influenced by the Flower protein. 

      We added “refilling of SV pools”.

      (2) The abbreviation "Few-Ubi" is used for the Drosophila Flower protein (e.g., line 283, Figure 1A, and Figure 8A). Please clarify what "Ubi" stands for and verify whether its inclusion in the protein name is appropriate.

      This is inconsistent across the literature, sometimes Fwe-Ubi is also referred to as FweA. We now added this term. Ubi refers to ubiquitous (“Therefore, we named this isoform fweubi because it is expressed ubiquitously in imaginal discs“) (Rhiner 2010)

      (3) The manuscript uses "pflwr-1" (line 303 and elsewhere) to denote the flwr-1 promoter. This notation could be misleading, as it may be interpreted as a gene name. Please consider using either "flwr-1p" or "Pflwr-1" instead. Additionally, ensure proper italicization of gene names throughout the manuscript. 

      We changed this throughout. We will change to italicized at proof stage, it would be too timeconsuming to spot these incidents now.

      (4) The authors tagged the C-terminus of FLWR-1 by GFP (lines 321). The fusion protein is referred to as "GFP::FLWR-1" throughout the manuscript. Please verify whether "FLWR-1::GFP" would be the more appropriate designation.

      Thank you, yes, we changed this in the text, GFP is indeed N-terminal.

      (5) In "This did not show any additive effects...." (line 363), please clarify what "This" refers to. 

      Altered to “The combined rescues did not show any additive effects…”

      (6) In "..., supporting our previous finding of increased neurotransmitter release in GABAergic neurons" (lines 412-413), please provide a citation for the referenced previous study.

      This refers to our aldicarb data within this paper, just further up in the text. We removed “previous”.

      (7) Figure 4C, D examines the effect of flwr-1 mutation on body length in the genetic background of the unc-29 mutation, which selectively disrupts the levamisole-sensitive acetylcholine receptor. Please comment on the rationale for implicating only the levamisole receptor rather than the nicotinic acetylcholine receptor in muscle cells. 

      This was because we used a behavioral assay. Despite the fact that the homopentameric ACR16/N-AChR mediate about 2/3 of the peak currents in response to acute ACh application to the NMJ (e.g. Almedom et al., EMBO J, 2009), the acr-16 mutant has virtually no behavioral / locomotion phenotype. Likely, this is because the heteropentameric, UNC-29 containing LAChR, while only contributing 1/3 of the peak current, desensitizes much more slowly and thus unc-29 mutants show a severe behavioral phenotype (uncoordinated locomotion, etc.). We thus did not expect a major effect when performing the behavoral assay in acr-16 mutants and thus chose the unc-29 mutant background.

      (8) In "we found no evidence ....insertion into the PM (Yao et al., 2009)", It appears that the cited paper was not authored by any of the current manuscript. Please confirm whether this citation is correctly attributed. 

      This sentence was arranged in a misleading way, we did not mean that we authored this paper. It was change in the text: “While a facilitating role of Flower in endocytosis appears to be conserved in C. elegans, in contrast to previous findings from Drosophila (Yao et al., 2009), we found no evidence that FLWR-1 conducts Ca<sup>2+</sup> upon insertion into the PM.”

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      Summary:

      This study examines to what extent this phenomenon varies based on the visibility of the saccade target. Visibility is defined as the contrast level of the target with respect to the noise background, and it is related to the signal-to-noise ratio of the target. A more visible target facilitates the oculomotor behavior planning and execution, however, as speculated by the authors, it can also benefit foveal prediction even if the foveal stimulus visibility is maintained constant. Remarkably, the authors show that presenting a highly visible saccade target is beneficial for foveal vision as detection of stimuli with an orientation similar to that of the saccade target is improved, the lower is the saccade target visibility, the less prominent is this effect.

      Strengths:

      The results are convincing and the research methodology is technically sound.

      Weaknesses:

      It is still unclear why the pre-saccadic enhancement would oscillate for targets with higher opacity levels, and what would be the benefit of this oscillatory pattern. The authors do not speculate too much on this and loosely relate it to feedback processes, which are characterized by neural oscillations in a similar range.

      We thank the reviewer for their assessment. We intentionally decided to describe the oscillatory pattern without claiming to be able to pinpoint its origin. The finding was incidental and, based on psychophysical data alone, we would not feel comfortable doing anything but loosely relating it to potential mechanisms on an explicitly speculative basis. In the potential explanation we provide in the manuscript, the oscillatory pattern would likely not serve a benefit–rather, it would constitute an innate consequence and, thus, a coincidental perceptual signature of potential feedback processes.

      Reviewer #2 (Public review):

      Summary:

      In this manuscript, the authors ran a dual task. Subjects monitored a peripheral location for a target onset (to generate a saccade to), and they also monitored a foveal location for a foveal probe. The foveal probe could be congruent or incongruent with the orientation of the peripheral target. In this study, the authors manipulated the conspicuity of the peripheral target, and they saw changes in performance in the foveal task. However, the changes were somewhat counterintuitive.

      We regret that our findings remain counterintuitive to the reviewer even after our extensive explanations in the previous revision round and the corresponding changes in the manuscript. We repeat that both the decrease in foveal Hit Rates and the increase in foveal enhancement with increasing target contrast were expected and preregistered prior to data collection.

      Strengths:

      The authors use solid analysis methods and careful experimental design.

      Comments on revisions:

      The authors have addressed my previous comments.

      One minor thing is that I am confused by their assertion that there was no smoothing in the manuscript (other than the newly added time course analysis). Figure 3A and Figure 6 seem to have smoothing to me.

      When the reviewer suggested that the “data appear too excessively smoothed” in the first revision, we assumed that they were referring to pre-saccadic foveal Hit and False Alarm rates, not to fitted distributions. As we state in the legend of Figure 3A (as well as in Figures 6 and S1), the “smoothed” curves constitute the probability density distributions of our raw data. Concerning the energy maps resulting from reverse correlation analyses, we described our proceeding in detail in our initial article (Kroell & Rolfs, 2022): 

      “Using this method, we obtained filter responses for 260 SF*ori combinations per noise image (Figure 6 in Materials and methods, ‘Stimulus analysis’). SFs ranged from 0.33 to 1.39 cpd (in 20 equal increments). Orientations ranged from –90–90° (in 13 equal increments). To normalize the resulting energy maps, we z-transformed filter responses using the mean and standard deviation of filter responses from the set of images presented in a certain session. To obtain more fine-grained maps, we applied 2D linear interpolations by iteratively halving the interval between adjacent values 4 times in each dimension. To facilitate interpretability, we flipped the energy maps of trials in which the target was oriented to the left. In all analyses and plots,+45° thus corresponds to the target’s orientation while –45° corresponds to the other potential probe orientation. Filter responses for all response types are provided at https://osf.io/v9gsq/.”

      We have added a pointer to this explanation to the current manuscript (see line 836).

      Another minor comment is related to the comment of Reviewer 1 about oscillations. Another possible reason for what looks like oscillations is saccadic inhibition. when the foveal probe appears, it can reset the saccade generation process. when aligned to saccade onset, this appears like a characteristic change in different parameters that is time-locked to saccade onset (about a 100 ms earlier). So, maybe the apparent oscillation is a manifestation of such resetting and it's not really an oscillation. so, I agree with Reviewer 1 about removing the oscillation sentence from the abstract.

      While we understand that a visible probe will result in saccadic inhibition (White & Rolfs, 2016), we are unsure how a resetting of the saccade generation process should manifest in increased perceptual enhancement of a specific, peripheral target orientation in the presaccadic fovea. Moreover, as we describe in our initial article (Kroell & Rolfs, 2022), we updated the background noise image every 50 ms and embedded our probe stimulus into the surrounding noise using smooth orientation filters and raised cosine masks to avoid a disruptive influence of probe appearance on movement planning and execution (Hanning, Deubel, & Szinte, 2019). And indeed, we demonstrated that the appearance of the foveal probe did not disrupt saccade preparation, that is, did not increase saccade latencies compared to ‘probe absent’ trials in which no foveal probe was presented (see Kroell & Rolfs, 2022; sections “Parameters of included saccades in Experiment 1” and “Parameters of included saccades in Experiment 2”). In the current submission, saccade latencies in ‘probe present’ trials exceeded saccade latencies in ‘probe absent’ trials by a mere 4.7±2.3 ms. Additionally, to inspect the variation of saccade execution frequency directly, we aligned the number of saccade generation instances to the onset of the foveal probe stimulus (see Author response image 1). In line with what we described in a previous paradigm employing flickering bandpass filtered noise patches (Kroell & Rolfs, 2021; 10.1016/j.cortex.2021.02.021), we observed a regular variation in saccade execution frequency that reflected the duration of an individual background noise image (50 ms in this investigation). In other words, the repeated dips in saccadic frequency are likely caused by the flickering background noise and not the onset of the foveal probe which would produce a single dip ~100 ms after probe onset. Given these results, we do not see a straight-forward explanation for how the variation of saccade execution frequency in 20 Hz intervals would boost peripheral-to-foveal feature prediction before the saccade in ~10 Hz intervals. Nonetheless, we removed the sentence referencing oscillations from the Abstract.

      Author response image 1.

       

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Overall, The authors did a good job in addressing the points I raised. Two new sections were added to the manuscript, one to address how the mechanisms of foveal predictions would play out in natural viewing conditions, and another one examining more in depth the potential neural mechanisms implicated in foveal predictions. I found these two sections to be quite speculative, and at points, a bit convoluted but could help the reader get the bigger picture. I still do not have a clear sense of why the pre-saccadic enhancement would oscillate for targets with higher opacity levels, and what would be the benefit of this oscillatory pattern. The authors do not speculate too much on this and loosely relate it to feedback processes, which are characterized by neural oscillations in a similar range.  

      Please see our response to ‘Weaknesses’.

      I still find this a loose connection and would suggest removing the following phrase from the abstract "Interestingly, the temporal frequency of these oscillations corresponded to the frequency range typically associated with neural feedback signaling". 

      We have removed this phrase.

      Finally, the authors should specify how much of this oscillation is due to oscillations in HR of cong vs. oscillations in HR of incongruent trials or both.

      We fitted separate polynomials to congruent and incongruent Hit Rates instead of their difference. Peaks in enhancement relied on both, oscillatory increases in congruent Hit Rates and simultaneous decreases in incongruent Hit Rates. In other words, enhancement peaks appear to reflect a foveal enhancement of target-congruent feature information along with a concurrent suppression of target-incongruent features. We added this paragraph and Figure 4 to the Results section.

      Additional changes:

      Two figures had accidentally been labeled as Figure 5 in our first revision. We corrected the figure legends and all corresponding figure references in the text.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      As to the exceptionally minor issue, namely, correction for multiple statistical tests (minor because the data and the error are presented in the text). We have now conducted one-way ANOVA to back the data displayed in Fig 4A., and Supp. Figs 19 and 21. In each case ANOVA revealed a highly significant difference among means: Dunnett’s post hoc test was then used to test each result against SBW25, with the multiple comparisons corrected for in the analysis.

      This resulted in changes to the description of the statistical analysis in the following captions:

      To Figure 4.

      Where we previously referred to paired t-tests we now state:  ANOVA revealed a highly significant difference among means [F<sub>7,16</sub> = 8.19, p < 0.001] with Dunnett’s post-hoc test adjusted for multiple comparisons showing that five genotypes (*) differ significantly (p < 0.05) from SBW25.

      To Supplementary Figure 19.

      Where we previously referred to paired t-tests we now state: ANOVA revealed a highly significant difference among means [F<sub>7,16</sub> = 16.74, p < 0.001] with Dunnett’s post-hoc test adjusted for multiple comparisons showing that three genotypes (*) differ significantly (p < 0.05) from SBW25.

      To Supplementary Figure 21.

      Where we previously referred to paired t-tests we now state:  ANOVA revealed a highly significant difference among means [F<sub>7,89</sub> = 9.97, p < 0.0001] with Dunnett’s post-hoc test adjusted for multiple comparisons showing that SBW25 ∆mreB and SBW25 ∆PFLU4921-4925 are significantly different (*) from SBW25 (p < 0.05).


      The following is the authors’ response to the original reviews.

      Public Reviews: 

      Reviewer #1 (Public Review): 

      Summary: 

      The authors performed experimental evolution of MreB mutants that have a slow-growing round phenotype and studied the subsequent evolutionary trajectory using analysis tools from molecular biology. It was remarkable and interesting that they found that the original phenotype was not restored (most common in these studies) but that the round phenotype was maintained. 

      Strengths: 

      The finding that the round phenotype was maintained during evolution rather than that the original phenotype, rod-shaped cells, was recovered is interesting. The paper extensively investigates what happens during adaptation with various different techniques. Also, the extensive discussion of the findings at the end of the paper is well thought through and insighXul. 

      Weaknesses: 

      I find there are three general weaknesses: 

      (1) Although the paper states in the abstract that it emphasizes "new knowledge to be gained" it remains unclear what this concretely is. On page 4 they state 3 three research questions, these could be more extensively discussed in the abstract. Also, these questions read more like genetics questions while the paper is a lot about cell biological findings. 

      Thank you for drawing attention to the unnecessary and gratuitous nature of the last sentence of the Abstract. We are in agreement. It has been modified, and we have taken  advantage of additional word space to draw attention to the importance of the two competing (testable) hypotheses laid out in the Discussion. 

      As to new knowledge, please see the Results and particularly the Discussion. But beyond this, and as recognised by others, there is real value for cell biology in seeing how (and whether) selection can compensate for effects that are deleterious to fitness. The results will very often depart from those delivered from, for example, suppressor analyses, or bottom up engineering. 

      In the work recounted in our paper, we chose to focus – by way of proof-of principle – on the most commonly observed mutations, namely, those within pbp1A.  But beyond this gene, we detected mutations  in other components of the cell shape / division machinery whose connections are not yet understood and which are the focus of on-going investigation.  

      As to the three questions posed at the end of the Introduction, the first concerns whether selection can compensate for deleterious effects of deleting mreB (a question that pertains to evolutionary aspects); the second seeks understanding of genetic factors; the third aims to shed light on the genotype-to-phenotype map (which is where the cell biology comes into play).  Given space restrictions, we cannot see how we could usefully expand, let alone discuss, the three questions raised at the end of the Introduction in restrictive space available in the Abstract.   

      (2) It is not clear to me from the text what we already know about the restoration of MreB loss from suppressors studies (in the literature). Are there suppressor screens in the literature and which part of the findings is consistent with suppressor screens and which parts are new knowledge?  

      As stated in the Introduction, a previous study with B. subtilis (which harbours three MreB isoforms and where the isoform named “MreB” is essential for growth under normal conditions), suppressors of MreB lethality were found to occur in ponA, a class A penicillin binding protein (Kawai et al., 2009). This led to recognition that MreB plays a role in recruiting Pbp1A to the lateral cell wall. On the other hand, Patel et al. (2020) have shown that deletion of classA PBPs leads to an up-regulation of rod complex activity. Although there is a connection between rod complex and class A PBPs, a further study has shown that the two systems work semi-autonomously (Cho et al., 2016). 

      Our work confirms a connection between MreB and Pbp1A, and has shed new light on how this interaction is established by means of natural selection, which targets the integrity of cell wall. Indeed, the Rod complex and class A PBPs have complementary activities in the building of the cell wall with each of the two systems able to compensate for the other in order to maintain cell wall integrity. Please see the major part of the Discussion. In terms of specifics, the connection between mreB and pbp1A (shown by Kawai et al (2009)) is indirect because it is based on extragenic transposon insertions. In our study, the genetic connection is mechanistically demonstrated.  In addition, we capture that the evolutionary dynamics is rapid and we finally enriched understanding of the genotype-to-phenotype map.

      (3) The clarity of the figures, captions, and data quantification need to be improved.  

      Modifications have been implemented. Please see responses to specific queries listed below.

      Reviewer #2 (Public Review): 

      Yulo et al. show that deletion of MreB causes reduced fitness in P. fluorescens SBW25 and that this reduction in fitness may be primarily caused by alterations in cell volume. To understand the effect of cell volume on proliferation, they performed an evolution experiment through which they predominantly obtained mutations in pbp1A that decreased cell volume and increased viability. Furthermore, they provide evidence to propose that the pbp1A mutants may have decreased PG cross-linking which might have helped in restoring the fitness by rectifying the disorganised PG synthesis caused by the absence of MreB. Overall this is an interesting study. 

      Queries: 

      Do the small cells of mreB null background indeed have no DNA? It is not apparent from the DAPI images presented in Supplementary Figure 17. A more detailed analysis will help to support this claim. 

      It is entirely possible that small cells have no DNA, because if cell division is aberrant then division can occur prior to DNA segregation resulting in cells with no DNA. It is clear from microscopic observation that both small and large cells do not divide. It is, however, true, that we are unable to state – given our measures of DNA content – that small cells have no DNA. We have made this clear on page 13, paragraph 2.

      What happens to viability and cell morphology when pbp1A is removed in the mreB null background? If it is actually a decrease in pbp1A activity that leads to the rescue, then pbp1A- mreB- cells should have better viability, reduced cell volume and organised PG synthesis. Especially as the PG cross-linking is almost at the same level as the T362 or D484 mutant.  

      Please see fitness data in Supp. Fig. 13. Fitness of ∆mreBpbp1A is no different to that caused by a point mutation. Cells remain round.  

      What is the status of PG cross-linking in ΔmreB Δpflu4921-4925 (Line 7)? 

      This was not analysed as the focus of this experiment was PBPs. A priori, there is no obvious reason to suspect that ∆4921-25 (which lacks oprD) would be affected in PBP activity.

      What is the morphology of the cells in Line 2 and Line 5? It may be interesting to see if PG cross-linking and cell wall synthesis is also altered in the cells from these lines. 

      The focus of investigation was restricted to L1, L4 and L7. Indeed, it would be interesting to look at the mutants harbouring mutations in :sZ, but this is beyond scope of the present investigation (but is on-going). The morphology of L2 and L5 are shown in Supp. Fig. 9.

      The data presented in 4B should be quantified with appropriate input controls. 

      Band intensity has now been quantified (see new Supp. Fig .20). The controls are SBW25, SBW25∆pbp1A, SBW25 ∆mreB and SBW25 ∆mreBpbp1A as explained in the paper.

      What are the statistical analyses used in 4A and what is the significance value? 

      Our oversight. These were reported in Supp. Fig. 19, but should also have been presented in Fig. 4A. Data are means of three biological replicates. The statistical tests are comparisons between each mutant and SBW25, and assessed by paired t-tests.  

      A more rigorous statistical analysis indicating the number of replicates should be done throughout. 

      We have checked and made additions where necessary and where previously lacking. In particular, details are provided in Fig. 1E, Fig. 4A and Fig. 4B. For Fig. 4C we have produced quantitative measures of heterogeneity in new cell wall insertion. These are reported in Supp. Fig. 21 (and referred to in the text and figure caption) and show that patterns of cell wall insertion in ∆mreB are highly heterogeneous.

      Reviewer #3 (Public Review): 

      This paper addresses an understudied problem in microbiology: the evolution of bacterial cell shape. Bacterial cells can take a range of forms, among the most common being rods and spheres. The consensus view is that rods are the ancestral form and spheres the derived form. The molecular machinery governing these different shapes is fairly well understood but the evolutionary drivers responsible for the transition between rods and spheres are not. Enter Yulo et al.'s work. The authors start by noting that deletion of a highly conserved gene called MreB in the Gram-negative bacterium Pseudomonas fluorescens reduces fitness but does not kill the cell (as happens in other species like E. coli and B. subtilis) and causes cells to become spherical rather than their normal rod shape. They then ask whether evolution for 1000 generations restores the rod shape of these cells when propagated in a rich, benign medium. 

      The answer is no. The evolved lineages recovered fitness by the end of the experiment, growing just as well as the unevolved rod-shaped ancestor, but remained spherical. The authors provide an impressively detailed investigation of the genetic and molecular changes that evolved. Their leading results are: 

      (1) The loss of fitness associated with MreB deletion causes high variation in cell volume among sibling cells after cell division. 

      (2) Fitness recovery is largely driven by a single, loss-of-function point mutation that evolves within the first ~250 generations that reduces the variability in cell volume among siblings. 

      (3) The main route to restoring fitness and reducing variability involves loss of function mutations causing a reduction of TPase and peptidoglycan cross-linking, leading to a disorganized cell wall architecture characteristic of spherical cells. 

      The inferences made in this paper are on the whole well supported by the data. The authors provide a uniquely comprehensive account of how a key genetic change leads to gains in fitness and the spectrum of phenotypes that are impacted and provide insight into the molecular mechanisms underlying models of cell shape. 

      Suggested improvements and clarifications include: 

      (1) A schematic of the molecular interactions governing cell wall formation could be useful in the introduction to help orient readers less familiar with the current state of knowledge and key molecular players. 

      We understand that this would be desirable, but there are numerous recent reviews with detailed schematics that we think the interested reader would be better consulting. These are referenced in the text.

      (2) More detail on the bioinformatics approaches to assembling genomes and identifying the key compensatory mutations are needed, particularly in the methods section. This whole subject remains something of an art, with many different tools used. Specifying these tools, and the parameter settings used, will improve transparency and reproducibility, should it be needed. 

      We overlooked providing this detail, which has now been corrected by provision of more information in the Materials and Methods. In short we used Breseq, the clonal option, with default parameters. Additional analyses were conducted using Genieous. The BreSeq output files are provided https://doi.org/10.17617/3.CU5SX1 (which include all read data).

      (3) Corrections for multiple comparisons should be used and reported whenever more than one construct or strain is compared to the common ancestor, as in Supplementary Figure 19A (relative PG density of different constructs versus the SBW25 ancestor). 

      The data presented in Supp Fig 19A (and Fig 4A) do not involve multiple comparisons. In each instance the comparison is between SBW25 and each of the different mutants. A paired t-test is thus appropriate.

      (4) The authors refrain from making strong claims about the nature of selection on cell shape, perhaps because their main interest is the molecular mechanisms responsible. However, I think more can be said on the evolutionary side, along two lines. First, they have good evidence that cell volume is a trait under strong stabilizing selection, with cells of intermediate volume having the highest fitness. This is notable because there are rather few examples of stabilizing selection where the underlying mechanisms responsible are so well characterized. Second, this paper succeeds in providing an explanation for how spherical cells can readily evolve from a rod-shaped ancestor but leaves open how rods evolved in the first place. Can the authors speculate as to how the complex, coordinated system leading to rods first evolved? Or why not all cells have lost rod shape and become spherical, if it is so easy to achieve? These are important evolutionary questions that remain unaddressed. The manuscript could be improved by at least flagging these as unanswered questions deserving of further attention. 

      These are interesting points, but our capacity to comment is entirely speculative. Nonetheless, we have added an additional paragraph to the Discussion that expresses an opinion that has yet to receive attention:

      “Given the complexity of the cell wall synthesis machinery that defines rod-shape in bacteria, it is hard to imagine how rods could have evolved prior to cocci. However, the cylindrical shape offers a number of advantages. For a given biomass (or cell volume), shape determines surface area of the cell envelope, which is the smallest surface area associated with the spherical shape. As shape sets the surface/volume ratio, it also determines the ratio between supply (proportional to the surface) and demand (proportional to cell volume). From this point of view, it is more efficient to be cylindrical (Young 2006). This also holds for surface attachment and biofilm formation (Young 2006). But above all, for growing cells, the ratio between supply and demand is constant in rod shaped bacteria, whereas it decreases for cocci. This requires that spherical cells evolve complex regulatory networks capable of maintaining the correct concentration of cellular proteins despite changes in surface/volume ratio. From this point of view, rod-shaped bacteria offer opportunities to develop unsophisticated regulatory networks.”

      why not all cells have lost rod shape and become spherical.

      Please see Kevin Young’s 2006 review on the adaptive significance of cell shape

      The value of this paper stems both from the insight it provides on the underlying molecular model for cell shape and from what it reveals about some key features of the evolutionary process. The paper, as it currently stands, provides more on which to chew for the molecular side than the evolutionary side. It provides valuable insights into the molecular architecture of how cells grow and what governs their shape. The evolutionary phenomena emphasized by the authors - the importance of loss-of-function mutations in driving rapid compensatory fitness gains and that multiple genetic and molecular routes to high fitness are often available, even in the relatively short time frame of a few hundred generations - are well understood phenomena and so arguably of less broad interest. The more compelling evolutionary questions concern the nature and cause of stabilizing selection (in this case cell volume) and the evolution of complexity. The paper misses an opportunity to highlight the former and, while claiming to shed light on the latter, provides rather little useful insight. 

      Thank you for these thoughts and comments. However, we disagree that the experimental results are an overlooked opportunity to discuss stabilising selection. Stabilising selection occurs when selection favours a particular phenotype causing a reduction in underpinning population-level genetic diversity. This is not happening when selection acts on SBW25 ∆mreB leading to a restoration of fitness. Driving the response are biophysical factors, primarily the critical need to balance elongation rate with rate of septation. This occurs without any change in underlying genetic diversity.  

      Recommendations for the authors:  

      Reviewer 1 (Recommendations for the Authors): 

      Hereby my suggestion for improvement of the quantification of the data, the figures, and the text. 

      -  p 14, what is the unit of elongation rate?  

      At first mention we have made clear that the unit is given in minutes^-1

      -  p 14, please give an error bar for both p=0.85 and f=0.77, to be able to conclude they are different 

      Error on the probability p is estimated at the 95% confidence interval by the formula:1.96 , where N is the total number of cells. This has been added in the paragraph p »probability » of the Image Analysis section in the Material and Methods. 

      We also added errors on p measurement in the main text.

      -  p 14, all the % differences need an errorbar 

      The error bars and means are given in Fig 3C and 3D.

      -  Figure 1B adds units to compactness, and what does it represent? Is the cell size the estimated volume (that is mentioned in the caption)? Shouldn't the datapoints have error bars? 

      Compactness is defined in the “Image Analysis” section of the Material and Methods. It is a dimensionless parameter. The distribution of individual cell shapes / sizes are depicted in Fig 1B. Error does arise from segmentation, but the degree of variance (few pixels) is much smaller than the representations of individual cells shown.

      -  Figure 1C caption, are the 50.000 cells? 

      Correct. Figure caption has been altered.

      -  Figure 1D, first the elongation rate is described as a volume per minute, but now, looking at the units it is a rate, how is it normalized? 

      Elongation rate is explained in the Materials and Methods (see the image analysis section) and is not volume per minute. It is dV/dt = r*V (the unit of r is min^-1). Page 9 includes specific mention of the unit of r.

      -  Figure 1E, how many cells (n) per replicate? 

      Our apologies. We have corrected the figure caption that now reads:

      “Proportion of live cells in ancestral SBW25 (black bar) and ΔmreB (grey bar) based on LIVE/DEAD BacLight Bacterial Viability Kit protocol. Cells were pelleted at 2,000 x g for 2 minutes to preserve ΔmreB cell integrity. Error bars are means and standard deviation of three biological replicates (n>100).”

      -  Figure 1G, how does this compare to the wildtype 

      The volume for wild type SBW25 is 3.27µm^3 (within the “white zone”). This is mentioned in the text.

      -  Figure 2B, is this really volume, not size? And can you add microscopy images? 

      The x-axis is volume (see Materials and Methods, subsection image analysis). Images are available in Supp. Fig. 9.

      -  Figure 3A what does L1, L4 and L7 refer too? Is it correct that these same lines are picked for WT and delta_mreB 

      Thank you for pointing this out. This was an earlier nomenclature. It was shorthand for the mutants that are specified everywhere else by genotype and has now been corrected. 

      -  Figure 3c: either way write out p, so which probability, or you need a simple cartoon that is plotted. 

      The value p is the probability to proceed to the next generation and is explained in Materials and Methods  subsection image analysis.  We feel this is intuitive and does not require a cartoon. We nonetheless added a sentence to the Materials and Methods to aid clarity.

      -  Figure 4B can you add a ladder to the gel? 

      No ladder was included, but the controls provide all the necessary information. The band corresponding to PBP1A is defined by presence in SBW25, but absence in SBW25 ∆pbp1A.

      -  Figure 4c, can you improve the quantification of these images? How were these selected and how well do they represent the community? 

      We apologise for the lack of quantitative description for data presented in Fig 4C. This has now been corrected. In brief, we measured the intensity of fluorescent signal from between 10 and 14 cells and computed the mean and standard deviation of pixel intensity for each cell. To rule out possible artifacts associated with variation of the mean intensity, we calculated the ratio of the standard deviation divided by the square root of the mean. These data reveal heterogeneity in cell wall synthesis and provide strong statistical support for the claim that cell wall synthesis in ∆mreB is significantly more heterogeneous than the control. The data are provided in new Supp. Fig. 21. 

      Minor comments: 

      -  It would be interesting if the findings of this experimental evolution study could be related to comparative studies (if these have ever been executed).  

      Little is possible, but Hendrickson and Yulo published a portion of the originally posted preprint separately. We include a citation to that paper. 

      -  p 13, halfway through the page, the second paragraph lacks a conclusion, why do we care about DNA content? 

      It is a minor observation that was included by way of providing a complete description of cell phenotype.  

      -  p 17, "suggesting that ... loss-of-function", I do no not understand what this is based upon. 

      We show that the fitness of a pbp1A deletion is indistinguishable from the fitness of one of the pbp1A point mutants. This fact establishes that the point mutation had the same effects as a gene deletion thus supporting the claim that the point mutations identified during the course of the selection experiment decrease (or destroy) PBP1A function.

      -  p 25, at the top of the page: do you have a reference for the statement that a disorganized cell wall architecture is suited to the topology of spherical cells? 

      The statement is a conclusion that comes from our reasoning. It stems from the fact that it is impossible to entirely map the surface of a sphere with parallel strands.

    1. Author Response

      The following is the authors’ response to the previous reviews.

      To the Senior Editor and the Reviewing Editor:

      We sincerely appreciate the valuable comments provided by the reviewers, the reviewing editor, and the senior editor. After carefully reviewing and considering the comments, we have addressed the key concerns raised by the reviewers and made appropriate modifications to the article in the revised manuscript.

      The main revisions made to the manuscript are as follows:

      1) We have added comparison experiments with TNDM (see Fig. 2 and Fig. S2).

      2) We conducted new synthetic experiments to demonstrate that our conclusions are not a by-product of d-VAE (see Fig. S2 and Fig. S11).

      3) We have provided a detailed explanation of how our proposed criteria, especially the second criterion, can effectively exclude the selection of unsuitable signals.

      4) We have included a semantic overview figure of d-VAE (Fig. S1) and a visualization plot of latent variables (Fig. S13).

      5) We have elaborated on the model details of d-VAE, as well as the hyperparameter selection and experimental settings of other comparison models.

      We believe these revisions have significantly improved the clarity and comprehensibility of the manuscript. Thank you for the opportunity to address these important points.

      Reviewer #1

      Q1: “First, the model in the paper is almost identical to an existing VAE model (TNDM) that makes use of weak supervision with behaviour in the same way [1]. This paper should at least be referenced. If the authors wish they could compare their model to TNDM, which combines a state space model with smoothing similar to LFADS. Given that TNDM achieves very good behaviour reconstructions, it may be on par with this model without the need for a Kalman filter (and hence may achieve better separation of behaviour-related and unrelated dynamics).”

      Our model significantly differs from TNDM in several aspects. While TNDM also constrains latent variables to decode behavioral information, it does not impose constraints to maximize behavioral information in the generated relevant signals. The trade-off between the decoding and reconstruction capabilities of generated relevant signals is the most significant contribution of our approach, which is not reflected in TNDM. In addition, the backbone network of signal extraction and the prior distribution of the two models are also different.

      It's worth noting that our method does not require a Kalman filter. Kalman filter is used for post hoc assessment of the linear decoding ability of the generated signals. Please note that extracting and evaluating relevant signals are two distinct stages.

      Heeding your suggestion, we have incorporated comparison experiments involving TNDM into the revised manuscript. Detailed information on model hyperparameters and training settings can be found in the Methods section in the revised manuscripts.

      Thank you for your valuable feedback.

      Q2: “Second, in my opinion, the claims regarding identifiability are overstated - this matters as the results depend on this to some extent. Recent work shows that VAEs generally suffer from identifiability problems due to the Gaussian latent space [2]. This paper also hints that weak supervision may help to resolve such issues, so this model as well as TNDM and CEBRA may indeed benefit from this. In addition however, it appears that the relative weight of the KL Divergence in the VAE objective is chosen very small compared to the likelihood (0.1%), so the influence of the prior is weak and the model may essentially learn the average neural trajectories while underestimating the noise in the latent variables. This, in turn, could mean that the model will not autoencode neural activity as well as it should, note that an average R2 in this case will still be high (I could not see how this is actually computed). At the same time, the behaviour R2 will be large simply because the different movement trajectories are very distinct. Since the paper makes claims about the roles of different neurons, it would be important to understand how well their single trial activities are reconstructed, which can perhaps best be investigated by comparing the Poisson likelihood (LFADS is a good baseline model). Taken together, while it certainly makes sense that well-tuned neurons contribute more to behaviour decoding, I worry that the very interesting claim that neurons with weak tuning contain behavioural signals is not well supported.”

      We don’t think our distilled signals are average neural trajectories without variability. The quality of reconstructing single trial activities can be observed in Figure 3i and Figure S4. Neural trajectories in Fig. 3i and Fig. S4 show that our distilled signals are not average neural trajectories. Furthermore, if each trial activity closely matched the average neural trajectory, the Fano Factor (FF) should theoretically approach 0. However, our distilled signals exhibit a notable departure from this expectation, as evident in Figure 3c, d, g, and f. Regarding the diminished influence of the KL Divergence: Given that the ground truth of latent variable distribution is unknown, even a learned prior distribution might not accurately reflect the true distribution. We found the pronounced impact of the KL divergence would prove detrimental to the decoding and reconstruction performance. As a result, we opt to reduce the weight of the KL divergence term. Even so, KL divergence can still effectively align the distribution of latent variables with the distribution of prior latent variables, as illustrated in Fig. S13. Notably, our goal is extracting behaviorally-relevant signals from given raw signals rather than generating diverse samples from the prior distribution. When aim to separating relevant signals, we recommend reducing the influence of KL divergence. Regarding comparing the Poisson likelihood: We compared Poisson log-likelihood among different methods (except PSID since their obtained signals have negative values), and the results show that d-VAE outperforms other methods.

      Author response image 1.

      Regarding how R2 is computed: , where and denote ith sample of raw signals, ith sample of distilled relevant signals, and the mean of raw signals. If the distilled signals exactly match the raw signals, the sum of squared error is zero, thus R2=1. If the distilled signals always are equal to R2=0. If the distilled signals are worse than the mean estimation, R2 is negative, negative R2 is set to zero.

      Thank you for your valuable feedback.

      Q3: “Third, and relating to this issue, I could not entirely follow the reasoning in the section arguing that behavioural information can be inferred from neurons with weak selectivity, but that it is not linearly decodable. It is right to test if weak supervision signals bleed into the irrelevant subspace, but I could not follow the explanations. Why, for instance, is the ANN decoder on raw data (I assume this is a decoder trained fully supervised) not equal in performance to the revenant distilled signals? Should a well-trained non-linear decoder not simply yield a performance ceiling? Next, if I understand correctly, distilled signals were obtained from the full model. How does a model perform trained only on the weakly tuned neurons? Is it possible that the subspaces obtained with the model are just not optimally aligned for decoding? This could be a result of limited identifiability or model specifics that bias reconstruction to averages (a well-known problem of VAEs). I, therefore, think this analysis should be complemented with tests that do not depend on the model.”

      Regarding “Why, for instance, is the ANN decoder on raw data (I assume this is a decoder trained fully supervised) not equal in performance to the relevant distilled signals? Should a well-trained non-linear decoder not simply yield a performance ceiling?”: In fact, the decoding performance of raw signals with ANN is quite close to the ceiling. However, due to the presence of significant irrelevant signals in raw signals, decoding models like deep neural networks are more prone to overfitting when trained on noisy raw signals compared to behaviorally-relevant signals. Consequently, we anticipate that the distilled signals will demonstrate superior decoding generalization. This phenomenon is evident in Fig. 2 and Fig. S1, where the decoding performance of the distilled signals surpasses that of the raw signals, albeit not by a substantial margin.

      Regarding “Next, if I understand correctly, distilled signals were obtained from the full model. How does a model perform trained only on the weakly tuned neurons? Is it possible that the subspaces obtained with the model are just not optimally aligned for decoding?”:Distilled signals (involving all neurons) are obtained by d-VAE. Subsequently, we use ANN to evaluate the performance of smaller and larger R2 neurons. Please note that separating and evaluating relevant signals are two distinct stages.

      Regarding the reasoning in the section arguing that smaller R2 neurons encode rich information, we would like to provide a detailed explanation:

      1) After extracting relevant signals through d-VAE, we specifically selected neurons characterized by smaller R2 values (Here, R2 signifies the proportion of neuronal activity variance explained by the linear encoding model, calculated using raw signals). Subsequently, we employed both KF and ANN to assess the decoding performance of these neurons. Remarkably, our findings revealed that smaller R2 neurons, previously believed to carry limited behavioral information, indeed encode rich information.

      2) In a subsequent step, we employed d-VAE to exclusively distill the raw signals of these smaller R2 neurons (distinct from the earlier experiment where d-VAE processed signals from all neurons). We then employed KF and ANN to evaluate the distilled smaller R2 neurons. Interestingly, we observed that we could not attain the same richness of information solely through the use of these smaller R2 neurons.

      3) Consequently, we put forth and tested two hypotheses: First, that larger R2 neurons introduce additional signals into the smaller R2 neurons that do not exist in the real smaller R2 neurons. Second, that larger R2 neurons aid in restoring the original appearance of impaired smaller R2 neurons. Our proposed criteria and synthetic experiments substantiate the latter scenario.

      Thank you for your valuable feedback.

      Q4: “Finally, a more technical issue to note is related to the choice to learn a non-parametric prior instead of using a conventional Gaussian prior. How is this implemented? Is just a single sample taken during a forward pass? I worry this may be insufficient as this would not sample the prior well, and some other strategy such as importance sampling may be required (unless the prior is not relevant as it weakly contributed to the ELBO, in which case this choice seems not very relevant). Generally, it would be useful to see visualisations of the latent variables to see how information about behaviour is represented by the model.”

      Regarding "how to implement the prior?": Please refer to Equation 7 in the revised manuscript; we have added detailed descriptions in the revised manuscript.

      Regarding "Generally, it would be useful to see visualizations of the latent variables to see how information about behavior is represented by the model.": Note that our focus is not on latent variables but on distilled relevant signals. Nonetheless, at your request, we have added the visualization of latent variables in the revised manuscript. Please see Fig. S13 for details.

      Thank you for your valuable feedback.

      Recommendations: “A minor point: the word 'distill' in the name of the model may be a little misleading - in machine learning the term refers to the construction of smaller models with the same capabilities.

      It should be useful to add a schematic picture of the model to ease comparison with related approaches.”

      In the context of our model's functions, it operates as a distillation process, eliminating irrelevant signals and retaining the relevant ones. Although the name of our model may be a little misleading, it faithfully reflects what our model does.

      I have added a schematic picture of d-VAE in the revised manuscript. Please see Fig. S1 for details.

      Thank you for your valuable feedback.

      Reviewer #2

      Q1: “Is the apparently increased complexity of encoding vs decoding so unexpected given the entropy, sparseness, and high dimensionality of neural signals (the "encoding") compared to the smoothness and low dimensionality of typical behavioural signals (the "decoding") recorded in neuroscience experiments? This is the title of the paper so it seems to be the main result on which the authors expect readers to focus. ”

      We use the term "unexpected" due to the disparity between our findings and the prior understanding concerning neural encoding and decoding. For neural encoding, as we said in the Introduction, in previous studies, weakly-tuned neurons are considered useless, and smaller variance PCs are considered noise, but we found they encode rich behavioral information. For neural decoding, the nonlinear decoding performance of raw signals is significantly superior to linear decoding. However, after eliminating the interference of irrelevant signals, we found the linear decoding performance is comparable to nonlinear decoding. Rooted in these findings, which counter previous thought, we employ the term "unexpected" to characterize our observations.

      Thank you for your valuable feedback.

      Q2: “I take issue with the premise that signals in the brain are "irrelevant" simply because they do not correlate with a fixed temporal lag with a particular behavioural feature hand-chosen by the experimenter. As an example, the presence of a reward signal in motor cortex [1] after the movement is likely to be of little use from the perspective of predicting kinematics from time-bin to time-bin using a fixed model across trials (the apparent definition of "relevant" for behaviour here), but an entire sub-field of neuroscience is dedicated to understanding the impact of these reward-related signals on future behaviour. Is there method sophisticated enough to see the behavioural "relevance" of this brief, transient, post-movement signal? This may just be an issue of semantics, and perhaps I read too much into the choice of words here. Perhaps the authors truly treat "irrelevant" and "without a fixed temporal correlation" as synonymous phrases and the issue is easily resolved with a clarifying parenthetical the first time the word "irrelevant" is used. But I remain troubled by some claims in the paper which lead me to believe that they read more deeply into the "irrelevancy" of these components.”

      In this paper, we employ terms like ‘behaviorally-relevant’ and ‘behaviorally-irrelevant’ only regarding behavioral variables of interest measured within a given task, such as arm kinematics during a motor control task. A similar definition can be found in the PSID[1].

      Thank you for your valuable feedback.

      [1] Sani, Omid G., et al. "Modeling behaviorally relevant neural dynamics enabled by preferential subspace identification." Nature Neuroscience 24.1 (2021): 140-149.

      Q3: “The authors claim the "irrelevant" responses underpin an unprecedented neuronal redundancy and reveal that movement behaviors are distributed in a higher-dimensional neural space than previously thought." Perhaps I just missed the logic, but I fail to see the evidence for this. The neural space is a fixed dimensionality based on the number of neurons. A more sparse and nonlinear distribution across this set of neurons may mean that linear methods such as PCA are not effective ways to approximate the dimensionality. But ultimately the behaviourally relevant signals seem quite low-dimensional in this paper even if they show some nonlinearity may help.”

      The evidence for the “useless” responses underpin an unprecedented neuronal redundancy is shown in Fig. 5a, d and Fig. S9a. Specifically, the sum of the decoding performance of smaller R2 neurons and larger R2 neurons is significantly greater than that of all neurons for relevant signals (red bar), demonstrating that movement parameters are encoded very redundantly in neuronal population. In contrast, we can not find this degree of neural redundancy in raw signals (purple bar).

      The evidence for the “useless” responses reveal that movement behaviors are distributed in a higher-dimensional neural space than previously thought is shown in the left plot (involving KF decoding) of Fig. 6c, f and Fig. S9f. Specifically, the improvement of KF using secondary signals is significantly higher than using raw signals composed of the same number of dimensions as the secondary signals. These results demonstrate that these dimensions, spanning roughly from ten to thirty, encode much information, suggesting that behavioral information exists in a higher-dimensional subspace than anticipated from raw signals.

      Thank you for your valuable feedback.

      Q5: “there is an apparent logical fallacy that begins in the abstract and persists in the paper: "Surprisingly, when incorporating often-ignored neural dimensions, behavioral information can be decoded linearly as accurately as nonlinear decoding, suggesting linear readout is performed in motor cortex." Don't get me wrong: the equivalency of linear and nonlinear decoding approaches on this dataset is interesting, and useful for neuroscientists in a practical sense. However, the paper expends much effort trying to make fundamental scientific claims that do not feel very strongly supported. This reviewer fails to see what we can learn about a set of neurons in the brain which are presumed to "read out" from motor cortex. These neurons will not have access to the data analyzed here. That a linear model can be conceived by an experimenter does not imply that the brain must use a linear model. The claim may be true, and it may well be that a linear readout is implemented in the brain. Other work [2,3] has shown that linear readouts of nonlinear neural activity patterns can explain some behavioural features. The claim in this paper, however, is not given enough”

      Due to the limitations of current observational methods and our incomplete understanding of brain mechanisms, it is indeed challenging to ascertain the specific data the brain acquires to generate behavior and whether it employs a linear readout. Conventionally, the neural data recorded in the motor cortex do encode movement behaviors and can be used to analyze neural encoding and decoding. Based on these data, we found that the linear decoder KF achieves comparable performance to that of the nonlinear decoder ANN on distilled relevant signals. This finding has undergone validation across three widely used datasets, providing substantial evidence. Furthermore, we conducted experiments on synthetic data to show that this conclusion is not a by-product of our model. In the revised manuscript, we added a more detailed description of this conclusion.

      Thank you for your valuable feedback.

      Q6: “Relatedly, I would like to note that the exercise of arbitrarily dividing a continuous distribution of a statistic (the "R2") based on an arbitrary threshold is a conceptually flawed exercise. The authors read too much into the fact that neurons which have a low R2 w.r.t. PDs have behavioural information w.r.t. other methods. To this reviewer, it speaks more about the irrelevance, so to speak, of the preferred direction metric than anything fundamental about the brain.”

      We chose the R2 threshold in accordance with the guidelines provided in reference [1]. It's worth mentioning that this threshold does not exert any significant influence on the overall conclusions.

      Thank you for your valuable feedback.

      [1] Inoue, Y., Mao, H., Suway, S.B., Orellana, J. and Schwartz, A.B., 2018. Decoding arm speed during reaching. Nature communications, 9(1), p.5243.

      Q7: “I am afraid I may be missing something, as I did not understand the fano factor analysis of Figure 3. In a sense the behaviourally relevant signals must have lower FF given they are in effect tied to the temporally smooth (and consistent on average across trials) behavioural covariates. The point of the original Churchland paper was to show that producing a behaviour squelches the variance; naturally these must appear in the behaviourally relevant components. A control distribution or reference of some type would possibly help here.”

      We agree that including reference signals could provide more context. The Churchland paper said stimulus onset can lead to a reduction in neural variability. However, our experiment focuses specifically on the reaching process, and thus, we don't have comparative experiments involving different types of signals.

      Thank you for your valuable feedback.

      Q8: “The authors compare the method to LFADS. While this is a reasonable benchmark as a prominent method in the field, LFADS does not attempt to solve the same problem as d-VAE. A better and much more fair comparison would be TNDM [4], an extension of LFADS which is designed to identify behaviourally relevant dimensions.”

      We have added the comparison experiments with TNDM in the revised manuscript (see Fig. 2 and Fig. S2). The details of model hyperparameters and training settings can be found in the Methods section in the revised manuscripts.

      Thank you for your valuable feedback.

      Reviewer #3

      Q1.1: “TNDM: LFADS is not the best baseline for comparison. The authors should have compared with TNDM (Hurwitz et al. 2021), which is an extension of LFADS that (unlike LFADS) actually attempts to extract behaviorally relevant factors by adding a behavior term to the loss. The code for TNDM is also available on Github. LFADS is not even supervised by behavior and does not aim to address the problem that d-VAE aims to address, so it is not the most appropriate comparison. ”

      We have added the comparison experiments with TNDM in the revised manuscript (see Fig. 2 and Fig. S2). The details of model hyperparameters and training settings can be found in the Methods section in the revised manuscripts.

      Thank you for your valuable feedback.

      Q1.2: “LFADS: LFADS is a sequential autoencoder that processes sections of data (e.g. trials). No explanation is given in Methods for how the data was passed to LFADS. Was the moving averaged smoothed data passed to LFADS or the raw spiking data (at what bin size)? Was a gaussian loss used or a poisson loss? What are the trial lengths used in each dataset, from which part of trials? For dataset C that has back-to-back reaches, was data chopped into segments? How long were these segments? Were the edges of segments overlapped and averaged as in (Keshtkaran et al. 2022) to avoid noisy segment edges or not? These are all critical details that are not explained. The same details would also be needed for a TNDM comparison (comment 1.1) since it has largely the same architecture as LFADS.

      It is also critical to briefly discuss these fundamental differences between the inputs of methods in the main text. LFADS uses a segment of data whereas VAE methods just use one sample at a time. What does this imply in the results? I guess as long as VAEs outperform LFADS it is ok, but if LFADS outperforms VAEs in a given metric, could it be because it received more data as input (a whole segment)? Why was the factor dimension set to 50? I presume it was to match the latent dimension of the VAE methods, but is the LFADS factor dimension the correct match for that to make things comparable?

      I am also surprised by the results. How do the authors justify LFADS having lower neural similarity (fig 2d) than VAE methods that operate on single time steps? LFADS is not supervised by behavior, so of course I don't expect it to necessarily outperform methods on behavior decoding. But all LFADS aims to do is to reconstruct the neural data so at least in this metric it should be able to outperform VAEs that just operate on single time steps? Is it because LFADS smooths the data too much? This is important to discuss and show examples of. These are all critical nuances that need to be discussed to validate the results and interpret them.”

      Regarding “Was the moving averaged smoothed data passed to LFADS or the raw spiking data (at what bin size)? Was a gaussian loss used or a poisson loss?”: The data used by all models was applied to the same preprocessing procedure. That is, using moving averaged smoothed data with three bins, where the bin size is 100ms. For all models except PSID, we used a Poisson loss.

      Regrading “What are the trial lengths used in each dataset, from which part of trials? For dataset C that has back-to-back reaches, was data chopped into segments? How long were these segments? Were the edges of segments overlapped and averaged as in (Keshtkaran et al. 2022) to avoid noisy segment edges or not?”:

      For datasets A and B, a trial length of eighteen is set. Trials with lengths below the threshold are zero-padded, while trials exceeding the threshold are truncated to the threshold length from their starting point. In dataset A, there are several trials with lengths considerably longer than that of most trials. We found that padding all trials with zeros to reach the maximum length (32) led to poor performance. Consequently, we chose a trial length of eighteen, effectively encompassing the durations of most trials and leading to the removal of approximately 9% of samples. For dataset B (center-out), the trial lengths are relatively consistent with small variation, and the maximum length across all trials is eighteen. For dataset C, we set the trial length as ten because we observed the video of this paradigm and found that the time for completing a single trial was approximately one second. The segments are not overlapped.

      Regarding “Why was the factor dimension set to 50? I presume it was to match the latent dimension of the VAE methods, but is the LFADS factor dimension the correct match for that to make things comparable?”: We performed a grid search for latent dimensions in {10,20,50} and found 50 is the best.

      Regarding “I am also surprised by the results. How do the authors justify LFADS having lower neural similarity (fig 2d) than VAE methods that operate on single time steps? LFADS is not supervised by behavior, so of course I don't expect it to necessarily outperform methods on behavior decoding. But all LFADS aims to do is to reconstruct the neural data so at least in this metric it should be able to outperform VAEs that just operate on single time steps? Is it because LFADS smooths the data too much?”: As you pointed out, we found that LFADS tends to produce excessively smooth and consistent data, which can lead to a reduction in neural similarity.

      Thank you for your valuable feedback.

      Q1.3: “PSID: PSID is linear and uses past input samples to predict the next sample in the output. Again, some setup choices are not well justified, and some details are left out in the 1-line explanation given in Methods.

      Why was a latent dimension of 6 chosen? Is this the behaviorally relevant latent dimension or the total latent dimension (for the use case here it would make sense to set all latent states to be behaviorally relevant)? Why was a horizon hyperparameter of 3 chosen? First, it is important to mention fundamental parameters such as latent dimension for each method in the main text (not just in methods) to make the results interpretable. Second, these hyperparameters should be chosen with a grid search in each dataset (within the training data, based on performance on the validation part of the training data), just as the authors do for their method (line 779). Given that PSID isn't a deep learning method, doing a thorough grid search in each fold should be quite feasible. It is important that high values for latent dimension and a wider range of other hyperparmeters are included in the search, because based on how well the residuals (x_i) for this method are shown predict behavior in Fig 2, the method seems to not have been used appropriately. I would expect ANN to improve decoding for PSID versus its KF decoding since PSID is fully linear, but I don't expect KF to be able to decode so well using the residuals of PSID if the method is used correctly to extract all behaviorally relevant information from neural data. The low neural reconstruction in Fid 2d could also partly be due to using too small of a latent dimension.

      Again, another import nuance is the input to this method and how differs with the input to VAE methods. The learned PSID model is a filter that operates on all past samples of input to predict the output in the "next" time step. To enable a fair comparison with VAE methods, the authors should make sure that the last sample "seen" by PSID is the same as then input sample seen by VAE methods. This is absolutely critical given how large the time steps are, otherwise PSID might underperform simply because it stopped receiving input 300ms earlier than the input received by VAE methods. To fix this, I think the authors can just shift the training and testing neural time series of PSID by 1 sample into the past (relative to the behavior), so that PSID's input would include the input of VAE methods. Otherwise, VAEs outperforming PSID is confounded by PSID's input not including the time step that was provided to VAE.”

      Thanks for your suggestions for letting PSID see the current neural observations. We did it per your suggestions and then performed a grid search for the hyperparameters for PSID. Specifically, we performed a grid search for the horizon hyperparameter in {2,3,4,5,6,7}. Since the relevant latent dimension should be lower than the horizon times the dimension of behavior variables (two-dimensional velocity in this paper) and increasing the dimension will reach performance saturation, we directly set the relevant latent dimensions as the maximum. The horizon number of datasets A, B, C, and synthetic datasets is 7, 6, 6 and 5, respectively.

      And thus the latent dimension of datasets A, B, and C and the synthetic dataset is 14, 12, 12 and 10, respectively.

      Our experiments show that KF can decode information from irrelevant signals obtained by PSID. Although PSID extracts the linear part of raw signals, KF can still use the linear part of the residuals for decoding. The low reconstruction performance of PSID may be because the relationship between latent variables and neural signals is linear, and the relationship between latent variables and behaviors is also linear; this is equivalent to the linear relationship between behaviors and neural signals, and linear models can only explain a small fraction of neural signals.

      Thank you for your valuable feedback.

      Q1.4: “CEBRA: results for CEBRA are incomplete. Similarity to raw signals is not shown. Decoding of behaviorally irrelevant residuals for CEBRA is not shown. Per Fig. S2, CEBRA does better or similar ANN decoding in datasets A and C, is only slightly worse in Dataset B, so it is important to show the other key metrics otherwise it is unclear whether d-VAE has some tangible advantage over CEBRA in those 2 datasets or if they are similar in every metric. Finally, it would be better if the authors show the results for CEBRA on Fig. 2, just as is done for other methods because otherwise it is hard to compare all methods.”

      CEBRA is a non-generative model, this model cannot generate behaviorally-relevant signals. Therefore, we only compared the decoding performance of latent embeddings of CEBRA and signals of d-VAE.

      Thank you for your valuable feedback.

      Q2: “Given the fact that d-VAE infers the latent (z) based on the population activity (x), claims about properties of the inferred behaviorally relevant signals (x_r) that attribute properties to individual neurons are confounded.

      The authors contrast their approach to population level approaches in that it infers behaviorally relevant signals for individual neurons. However, d-VAE is also a population method as it aggregates population information to infer the latent (z), from which behaviorally relevant part of the activity of each neuron (x_r) is inferred. The authors note this population level aggregation of information as a benefit of d-VAE, but only acknowledge it as a confound briefly in the context of one of their analyses (line 340): "The first is that the larger R2 neurons leak their information to the smaller R2 neurons, causing them contain too much behavioral information". They go on to dismiss this confounding possibility by showing that the inferred behaviorally relevant signal of each neuron is often most similar to its own raw signals (line 348-352) compared with all other neurons. They also provide another argument specific to that result section (i.e., residuals are not very behavior predictive), which is not general so I won't discuss it in depth here. These arguments however do not change the basic fact that d-VAE aggregates information from other neurons when extracting the behaviorally relevant activity of any given neuron, something that the authors note as a benefit of d-VAE in many instances. The fact that d-VAE aggregates population level info to give the inferred behaviorally relevant signal for each neuron confounds several key conclusions. For example, because information is aggregated across neurons, when trial to trial variability looks smoother after applying d-VAE (Fig 3i), or reveals better cosine tuning (Fig 3b), or when neurons that were not very predictive of behavior become more predictive of behavior (Fig 5), one cannot really attribute the new smoother single trial activity or the improved decoding to the same single neurons; rather these new signals/performances include information from other neurons. Unless the connections of the encoder network (z=f(x)) is zero for all other neurons, one cannot claim that the inferred rates for the neuron are truly solely associated with that neuron. I believe this a fundamental property of a population level VAE, and simply makes the architecture unsuitable for claims regarding inherent properties of single neurons. This confound is partly why the first claim in the abstract are not supported by data: observing that neurons that don't predict behavior very well would predict it much better after applying d-VAE does not prove that these neurons themselves "encode rich[er] behavioral information in complex nonlinear ways" (i.e., the first conclusion highlighted in the abstract) because information was also aggregated from other neurons. The other reason why this claim is not supported by data is the characterization of the encoding for smaller R2 neurons as "complex nonlinear", which the method is not well equipped to tease apart from linear mappings as I explain in my comment 3.”

      We acknowledge that we cannot obtain the exact single neuronal activity that does not contain any information from other neurons. However, we believe our model can extract accurate approximation signals of the ground truth relevant signals. These signals preserve the inherent properties of single neuronal activity to some extent and can be used for analysis at the single-neuron level.

      We believe d-VAE is a reasonable approach to extract effective relevant signals that preserve inherent properties of single neuronal activity for four key reasons:

      1) d-VAE is a latent variable model that adheres to the neural population doctrine. The neural population doctrine posits that information is encoded within interconnected groups of neurons, with the existence of latent variables (neural modes) responsible for generating observable neuronal activity [1, 2]. If we can perfectly obtain the true generative model from latent variables to neuronal activity, then we can generate the activity of each neuron from hidden variables without containing any information from other neurons. However, without a complete understanding of the brain’s encoding strategies (or generative model), we can only get the approximation signals of the ground truth signals.

      2) After the generative model is established, we need to infer the parameters of the generative model and the distribution of latent variables. During the inference process, inference algorithms such as variational inference or EM algorithms will be used. Generally, the obtained latent variables are also approximations of the real latent variables. When inferring the latent variables, it is inevitable to aggregation the information of the neural population, and latent variables are derived through weighted combinations of neuronal populations [3].

      This inference process is consistent with that of d-VAE (or VAE-based models).

      3) Latent variables are derived from raw neural signals and used to explain raw neural signals. Considering the unknown ground truth of latent variables and behaviorally-relevant signals, it becomes evident that the only reliable reference at the signal level is the raw signals. A crucial criterion for evaluating the reliability of latent variable models (including latent variables and generated relevant signals) is their capability to effectively explain the raw signals [3]. Consequently, we firmly maintain the belief that if the generated signals closely resemble the raw signals to the greatest extent possible, in accordance with an equivalence principle, we can claim that these obtained signals faithfully retain the inherent properties of single neurons. d-VAE explicitly constrains the generated signal to closely resemble the raw signals. These results demonstrate that d-VAE can extract effective relevant signals that preserve inherent properties of single neuronal activity.

      Based on the above reasons, we hold that generating single neuronal activities with the VAE framework is a reasonable approach. The remaining question is whether our model can obtain accurate relevant signals in the absence of ground truth. To our knowledge, in cases where the ground truth of relevant signals is unknown, there are typically two approaches to verifying the reliability of extracted signals:

      1) Conducting synthetic experiments where the ground truth is known.

      2) Validation based on expert knowledge (Three criteria were proposed in this paper). Both our extracted signals and key conclusions have been validated using these two approaches.

      Next, we will provide a detailed response to the concerns regarding our first key conclusion that smaller R2 neurons encode rich information.

      We acknowledge that larger R2 neurons play a role in aiding the reconstruction of signals in smaller R2 neurons through their neural activity. However, considering that neurons are correlated rather than independent entities, we maintain the belief that larger R2 neurons assist damaged smaller R2 neurons in restoring their original appearance. Taking image denoising as an example, when restoring noisy pixels to their original appearance, relying solely on the noisy pixels themselves is often impractical. Assistance from their correlated, clean neighboring pixels becomes necessary.

      The case we need to be cautious of is that the larger R2 neurons introduce additional signals (m) that contain substantial information to smaller R2 neurons, which they do not inherently possess. We believe this case does not hold for two reasons. Firstly, logically, adding extra signals decreases the reconstruction performance, and the information carried by these additional signals is redundant for larger R2 neurons, thus they do not introduce new information that can enhance the decoding performance of the neural population. Therefore, it seems unlikely and unnecessary for neural networks to engage in such counterproductive actions. Secondly, even if this occurs, our second criterion can effectively exclude the selection of these signals. To clarify, if we assume that x, y, and z denote the raw, relevant, and irrelevant signals of smaller R2 neurons, with x=y+z, and the extracted relevant signals become y+m, the irrelevant signals become z-m in this case. Consequently, the irrelevant signals contain a significant amount of information. It's essential to emphasize that this criterion holds significant importance in excluding undesirable signals.

      Furthermore, we conducted a synthetic experiment to show that d-VAE can indeed restore the damaged information of smaller R2 neurons with the help of larger R2 neurons, and the restored neuronal activities are more similar to ground truth compared to damaged raw signals. Please see Fig. S11a,b for details.

      Thank you for your valuable feedback.

      [1] Saxena, S. and Cunningham, J.P., 2019. Towards the neural population doctrine. Current opinion in neurobiology, 55, pp.103-111.

      [2] Gallego, J.A., Perich, M.G., Miller, L.E. and Solla, S.A., 2017. Neural manifolds for the control of movement. Neuron, 94(5), pp.978-984.

      [3] Cunningham, J.P. and Yu, B.M., 2014. Dimensionality reduction for large-scale neural recordings. Nature neuroscience, 17(11), pp.1500-1509.

      Q3: “Given the nonlinear architecture of the VAE, claims about the linearity or nonlinearity of cortical readout are confounded and not supported by the results.

      The inference of behaviorally relevant signals from raw signals is a nonlinear operation, that is x_r=g(f(x)) is nonlinear function of x. So even when a linear KF is used to decode behavior from the inferred behaviorally relevant signals, the overall decoding from raw signals to predicted behavior (i.e., KF applied to g(f(x))) is nonlinear. Thus, the result that decoding of behavior from inferred behaviorally relevant signals (x_r) using a linear KF and a nonlinear ANN reaches similar accuracy (Fig 2), does not suggest that a "linear readout is performed in the motor cortex", as the authors claim (line 471). The authors acknowledge this confound (line 472) but fail to address it adequately. They perform a simulation analysis where the decoding gap between KF and ANN remains unchanged even when d-VAE is used to infer behaviorally relevant signals in the simulation. However, this analysis is not enough for "eliminating the doubt" regarding the confound. I'm sure the authors can also design simulations where the opposite happens and just like in the data, d-VAE can improve linear decoding to match ANN decoding. An adequate way to address this concern would be to use a fully linear version of the autoencoder where the f(.) and g(.) mappings are fully linear. They can simply replace these two networks in their model with affine mappings, redo the modeling and see if the model still helps the KF decoding accuracy reach that of the ANN decoding. In such a scenario, because the overall KF decoding from original raw signals to predicted behavior (linear d-VAE + KF) is linear, then they could move toward the claim that the readout is linear. Even though such a conclusion would still be impaired by the nonlinear reference (d-VAE + ANN decoding) because the achieved nonlinear decoding performance could always be limited by network design and fitting issues. Overall, the third conclusion highlighted in the abstract is a very difficult claim to prove and is unfortunately not supported by the results.”

      We aim to explore the readout mechanism of behaviorally-relevant signals, rather than raw signals. Theoretically, the process of removing irrelevant signals should not be considered part of the inherent decoding mechanisms of the relevant signals. Assuming that the relevant signals we extracted are accurate, the conclusion of linear readout is established. On the synthetic data where the ground truth is known, our distilled signals show a significant improvement in neural similarity to the ground truth when compared to raw signals (refer to Fig. S2l). This observation demonstrates that our distilled signals are accurate approximations of the ground truth. Furthermore, on the three widely-used real datasets, our distilled signals meet the stringent criteria we have proposed (see Fig. 2), also providing strong evidence for their accuracy.

      Regarding the assertion that we could create simulations in which d-VAE can make signals that are inherently nonlinearly decodable into linearly decodable ones: In reality, we cannot achieve this, as the second criterion can rule out the selection of such signals. Specifically,z=x+y=n^2+y, where z, x, y, and n denote raw signals, relevant signals, irrelevant signals and latent variables. If the relevant signals obtained by d-VAE are n, then these signals can be linear decoded accurately. However, the corresponding irrelevant signals are n^2-n+z; thus, irrelevant signals will have much information, and these extracted relevant signals will not be selected. Furthermore, our synthetic experiments offer additional evidence supporting the conclusion that d-VAE does not make inherently nonlinearly decodable signals become linearly decodable ones. As depicted in Fig. S11c, there exists a significant performance gap between KF and ANN when decoding the ground truth signals of smaller R2 neurons. KF exhibits notably low performance, leaving substantial room for compensation by d-VAE. However, following processing by d-VAE, KF's performance of distilled signals fails to surpass its already low ground truth performance and remains significantly inferior to ANN's performance. These results collectively confirm that our approach does not convert signals that are inherently nonlinearly decodable into linearly decodable ones, and the conclusion of linear readout is not a by-product by d-VAE.

      Regarding the suggestion of using linear d-VAE + KF, as discussed in the Discussion section, removing the irrelevant signals requires a nonlinear operation, and linear d-VAE can not effectively separate relevant and irrelevant signals.

      Thank you for your valuable feedback.

      Q4: “The authors interpret several results as indications that "behavioral information is distributed in a higher-dimensional subspace than expected from raw signals", which is the second main conclusion highlighted in the abstract. However, several of these arguments do not convincingly support that conclusion.

      4.1) The authors observe that behaviorally relevant signals for neurons with small principal components (referred to as secondary) have worse decoding with KF but better decoding with ANN (Fig. 6b,e), which also outperforms ANN decoding from raw signals. This observation is taken to suggest that these secondary behaviorally relevant signals encode behavior information in highly nonlinear ways and in a higher dimensions neural space than expected (lines 424 and 428). These conclusions however are confounded by the fact that A) d-VAE uses nonlinear encoding, so one cannot conclude from ANN outperforming KF that behavior is encoded nonlinearly in the motor cortex (see comment 3 above), and B) d-VAE aggregates information across the population so one cannot conclude that these secondary neurons themselves had as much behavior information (see comment 2 above).

      4.2) The authors observe that the addition of the inferred behaviorally relevant signals for neurons with small principal components (referred to as secondary) improves the decoding of KF more than it improves the decoding of ANN (red curves in Fig 6c,f). This again is interpreted similarly as in 4.1, and is confounded for similar reasons (line 439): "These results demonstrate that irrelevant signals conceal the smaller variance PC signals, making their encoded information difficult to be linearly decoded, suggesting that behavioral information exists in a higher-dimensional subspace than anticipated from raw signals". This is confounded by because of the two reasons explained in 4.1. To conclude nonlinear encoding based on the difference in KF and ANN decoding, the authors would need to make the encoding/decoding in their VAE linear to have a fully linear decoder on one hand (with linear d-VAE + KF) and a nonlinear decoder on the other hand (with linear d-VAE + ANN), as explained in comment 3.

      4.3) From S Fig 8, where the authors compare cumulative variance of PCs for raw and inferred behaviorally relevant signals, the authors conclude that (line 554): "behaviorally-irrelevant signals can cause an overestimation of the neural dimensionality of behaviorally-relevant responses (Supplementary Fig. S8)." However, this analysis does not really say anything about overestimation of "behaviorally relevant" neural dimensionality since the comparison is done with the dimensionality of "raw" signals. The next sentence is ok though: "These findings highlight the need to filter out relevant signals when estimating the neural dimensionality.", because they use the phrase "neural dimensionality" not "neural dimensionality of behaviorally-relevant responses".”

      Questions 4.1 and 4.2 are a combination of Q2 and Q3. Please refer to our responses to Q2 and Q3.

      Regarding question 4.3 about “behaviorally-irrelevant signals can cause an overestimation of the neural dimensionality of behaviorally-relevant responses”: Previous studies usually used raw signals to estimate the neural dimensionality of specific behaviors. We mean that using raw signals, which include many irrelevant signals, will cause an overestimation of the neural dimensionality. We have modified this sentence in the revised manuscripts.

      Thank you for your valuable feedback.

      Q5: “Imprecise use of language in many places leads to inaccurate statements. I will list some of these statements”

      5.1) In the abstract: "One solution is to accurately separate behaviorally-relevant and irrelevant signals, but this approach remains elusive due to the unknown ground truth of behaviorally-relevant signals". This statement is not accurate because it implies no prior work does this. The authors should make their statement more specific and also refer to some goal that existing linear (e.g., PSID) and nonlinear (e.g., TNDM) methods for extracting behaviorally relevant signals fail to achieve.

      5.2) In the abstract: "we found neural responses previously considered useless encode rich behavioral information" => what does "useless" mean operationally? Low behavior tuning? More precise use of language would be better.

      5.3) "... recent studies (Glaser 58 et al., 2020; Willsey et al., 2022) demonstrate nonlinear readout outperforms linear readout." => do these studies show that nonlinear "readout" outperforms linear "readout", or just that nonlinear models outperform linear models?

      5.4) Line 144: "The first criterion is that the decoding performance of the behaviorally-relevant signals (red bar, Fig.1) should surpass that of raw signals (the red dotted line, Fig.1).". Do the authors mean linear decoding here or decoding in general? If the latter, how can something extracted from neural surpass decoding of neural data, when the extraction itself can be thought of as part of decoding? The operational definition for this "decoding performance" should be clarified.

      5.5) Line 311: "we found that the dimensionality of primary subspace of raw signals (26, 64, and 45 for datasets A, B, and C) is significantly higher than that of behaviorally-relevant signals (7, 13, and 9), indicating that behaviorally-irrelevant signals lead to an overestimation of the neural dimensionality of behaviorally-relevant signals." => here the dimensionality of the total PC space (i.e., primary subspace of raw signals) is being compared with that of inferred behaviorally-relevant signals, so the former being higher does not indicate that neural dimensionality of behaviorally-relevant signals was overestimated. The former is simply not behavioral so this conclusion is not accurate.

      5.6) Section "Distilled behaviorally-relevant signals uncover that smaller R2 neurons encode rich behavioral information in complex nonlinear ways". Based on what kind of R2 are the neurons grouped? Behavior decoding R2 from raw signals? Using what mapping? Using KF? If KF is used, the result that small R2 neurons benefit a lot from d-VAE could be somewhat expected, given the nonlinearity of d-VAE: because only ANN would have the capacity to unwrap the nonlinear encoding of d-VAE as needed. If decoding performance that is used to group neurons is based on data, regression to the mean could also partially explain the result: the neurons with worst raw decoding are most likely to benefit from a change in decoder, than neurons that already had good decoding. In any case, the R2 used to partition and sort neurons should be more clearly stated and reminded throughout the text and I Fig 3.

      5.7) Line 346 "...it is impossible for our model to add the activity of larger R2 neurons to that of smaller R2 neurons" => Is it really impossible? The optimization can definitely add small-scale copies of behaviorally relevant information to all neurons with minimal increase in the overall optimization loss, so this statement seems inaccurate.

      5.8) Line 490: "we found that linear decoders can achieve comparable performance to that of nonlinear decoders, providing compelling evidence for the presence of linear readout in the motor cortex." => inaccurate because no d-VAE decoding is really linear, as explained in comment 3 above.

      5.9) Line 578: ". However, our results challenge this idea by showing that signals composed of smaller variance PCs nonlinearly encode a significant amount of behavioral information." => inaccurate as results are confounded by nonlinearity of d-VAE as explained in comment 3 above.

      5.10) Line 592: "By filtering out behaviorally-irrelevant signals, our study found that accurate decoding performance can be achieved through linear readout, suggesting that the motor cortex may perform linear readout to generate movement behaviors." => inaccurate because it us confounded by the nonlinearity of d-VAE as explained in comment 3 above.”

      Regarding “5.1) In the abstract: "One solution is to accurately separate behaviorally-relevant and irrelevant signals, but this approach remains elusive due to the unknown ground truth of behaviorally-relevant signals". This statement is not accurate because it implies no prior work does this. The authors should make their statement more specific and also refer to some goal that existing linear (e.g., PSID) and nonlinear (e.g., TNDM) methods for extracting behaviorally relevant signals fail to achieve”:

      We believe our statement is accurate. Our primary objective is to extract accurate behaviorally-relevant signals that closely approximate the ground truth relevant signals. To achieve this, we strike a balance between the reconstruction and decoding performance of the generated signals, aiming to effectively capture the relevant signals. This crucial aspect of our approach sets it apart from other methods. In contrast, other methods tend to emphasize the extraction of valuable latent neural dynamics. We have provided elaboration on the distinctions between d-VAE and other approaches in the Introduction and Discussion sections.

      Thank you for your valuable feedback.

      Regarding “5.2) In the abstract: "we found neural responses previously considered useless encode rich behavioral information" => what does "useless" mean operationally? Low behavior tuning? More precise use of language would be better.”:

      In the analysis of neural signals, smaller variance PC signals are typically seen as noise and are often discarded. Similarly, smaller R2 neurons are commonly thought to be dominated by noise and are not further analyzed. Given these considerations, we believe that the term "considered useless" is appropriate in this context. Thank you for your valuable feedback.

      Regarding “5.3) "... recent studies (Glaser 58 et al., 2020; Willsey et al., 2022) demonstrate nonlinear readout outperforms linear readout." => do these studies show that nonlinear "readout" outperforms linear "readout", or just that nonlinear models outperform linear models?”:

      In this paper, we consider the two statements to be equivalent. Thank you for your valuable feedback.

      Regarding “5.4) Line 144: "The first criterion is that the decoding performance of the behaviorally-relevant signals (red bar, Fig.1) should surpass that of raw signals (the red dotted line, Fig.1).". Do the authors mean linear decoding here or decoding in general? If the latter, how can something extracted from neural surpass decoding of neural data, when the extraction itself can be thought of as part of decoding? The operational definition for this "decoding performance" should be clarified.”:

      We mean the latter, as we said in the section “Framework for defining, extracting, and separating behaviorally-relevant signals”, since raw signals contain too many behaviorally-irrelevant signals, deep neural networks are more prone to overfit raw signals than relevant signals. Therefore the decoding performance of relevant signals should surpass that of raw signals. Thank you for your valuable feedback.

      Regarding “5.5) Line 311: "we found that the dimensionality of primary subspace of raw signals (26, 64, and 45 for datasets A, B, and C) is significantly higher than that of behaviorally-relevant signals (7, 13, and 9), indicating that behaviorally-irrelevant signals lead to an overestimation of the neural dimensionality of behaviorally-relevant signals." => here the dimensionality of the total PC space (i.e., primary subspace of raw signals) is being compared with that of inferred behaviorally-relevant signals, so the former being higher does not indicate that neural dimensionality of behaviorally-relevant signals was overestimated. The former is simply not behavioral so this conclusion is not accurate.”: In practice, researchers usually used raw signals to estimate the neural dimensionality. We mean that using raw signals to do this would overestimate the neural dimensionality. Thank you for your valuable feedback.

      Regarding “5.6) Section "Distilled behaviorally-relevant signals uncover that smaller R2 neurons encode rich behavioral information in complex nonlinear ways". Based on what kind of R2 are the neurons grouped? Behavior decoding R2 from raw signals? Using what mapping? Using KF? If KF is used, the result that small R2 neurons benefit a lot from d-VAE could be somewhat expected, given the nonlinearity of d-VAE: because only ANN would have the capacity to unwrap the nonlinear encoding of d-VAE as needed. If decoding performance that is used to group neurons is based on data, regression to the mean could also partially explain the result: the neurons with worst raw decoding are most likely to benefit from a change in decoder, than neurons that already had good decoding. In any case, the R2 used to partition and sort neurons should be more clearly stated and reminded throughout the text and I Fig 3.”:

      When employing R2 to characterize neurons, it indicates the extent to which neuronal activity is explained by the linear encoding model [1-3]. Smaller R2 neurons have a lower capacity for linearly tuning (encoding) behaviors, while larger R2 neurons have a higher capacity for linearly tuning (encoding) behaviors. Specifically, the approach involves first establishing an encoding relationship from velocity to neural signal using a linear model, i.e., y=f(x), where f represents a linear regression model, x denotes velocity, and y denotes the neural signal. Subsequently, R2 is utilized to quantify the effectiveness of the linear encoding model in explaining neural activity. We have provided a comprehensive explanation in the revised manuscript. Thank you for your valuable feedback.

      [1] Collinger, J.L., Wodlinger, B., Downey, J.E., Wang, W., Tyler-Kabara, E.C., Weber, D.J., McMorland, A.J., Velliste, M., Boninger, M.L. and Schwartz, A.B., 2013. High-performance neuroprosthetic control by an individual with tetraplegia. The Lancet, 381(9866), pp.557-564.

      [2] Wodlinger, B., et al. "Ten-dimensional anthropomorphic arm control in a human brain− machine interface: difficulties, solutions, and limitations." Journal of neural engineering 12.1 (2014): 016011.

      [3] Inoue, Y., Mao, H., Suway, S.B., Orellana, J. and Schwartz, A.B., 2018. Decoding arm speed during reaching. Nature communications, 9(1), p.5243.

      Regarding Questions 5.7, 5.8, 5.9, and 5.10:

      We believe our conclusions are solid. The reasons can be found in our replies in Q2 and Q3. Thank you for your valuable feedback.

      Q6: “Imprecise use of language also sometimes is not inaccurate but just makes the text hard to follow.

      6.1) Line 41: "about neural encoding and decoding mechanisms" => what is the definition of encoding/decoding and how do these differ? The definitions given much later in line 77-79 is also not clear.

      6.2) Line 323: remind the reader about what R2 is being discussed, e.g., R2 of decoding behavior using KF. It is critical to know if linear or nonlinear decoding is being discussed.

      6.3) Line 488: "we found that neural responses previously considered trivial encode rich behavioral information in complex nonlinear ways" => "trivial" in what sense? These phrases would benefit from more precision, for example: "neurons that may seem to have little or no behavior information encoded". The same imprecise word ("trivial") is also used in many other places, for example in the caption of Fig S9.

      6.4) Line 611: "The same should be true for the brain." => Too strong of a statement for an unsupported claim suggesting the brain does something along the lines of nonlin VAE + linear readout.

      6.5) In Fig 1, legend: what is the operational definition of "generating performance"? Generating what? Neural reconstruction?”

      Regarding “6.1) Line 41: "about neural encoding and decoding mechanisms" => what is the definition of encoding/decoding and how do these differ? The definitions given much later in line 77-79 is also not clear.”:

      We would like to provide a detailed explanation of neural encoding and decoding. Neural encoding means how neuronal activity encodes the behaviors, that is, y=f(x), where y denotes neural activity and, x denotes behaviors, f is the encoding model. Neural decoding means how the brain decodes behaviors from neural activity, that is, x=g(y), where g is the decoding model. For further elaboration, please refer to [1]. We have included references that discuss the concepts of encoding and decoding in the revised manuscript. Thank you for your valuable feedback.

      [1] Kriegeskorte, Nikolaus, and Pamela K. Douglas. "Interpreting encoding and decoding models." Current opinion in neurobiology 55 (2019): 167-179.

      Regarding “6.2) Line 323: remind the reader about what R2 is being discussed, e.g., R2 of decoding behavior using KF. It is critical to know if linear or nonlinear decoding is being discussed.”:

      This question is the same as Q5.6. Please refer to the response to Q5.6. Thank you for your valuable feedback.

      Regarding “6.3) Line 488: "we found that neural responses previously considered trivial encode rich behavioral information in complex nonlinear ways" => "trivial" in what sense? These phrases would benefit from more precision, for example: "neurons that may seem to have little or no behavior information encoded". The same imprecise word ("trivial") is also used in many other places, for example in the caption of Fig S9.”:

      We have revised this statement in the revised manuscript. Thanks for your recommendation.

      Regarding “6.4) Line 611: "The same should be true for the brain." => Too strong of a statement for an unsupported claim suggesting the brain does something along the lines of nonlin VAE + linear readout.”

      We mean that removing the interference of irrelevant signals and decoding the relevant signals should logically be two stages. We have revised this statement in the revised manuscript. Thank you for your valuable feedback.

      Regarding “6.5) In Fig 1, legend: what is the operational definition of "generating performance"? Generating what? Neural reconstruction?””:

      We have replaced “generating performance” with “reconstruction performance” in the revised manuscript. Thanks for your recommendation.

      Q7: “In the analysis presented starting in line 449, the authors compare improvement gained for decoding various speed ranges by adding secondary (small PC) neurons to the KF decoder (Fig S11). Why is this done using the KF decoder, when earlier results suggest an ANN decoder is needed for accurate decoding from these small PC neurons? It makes sense to use the more accurate nonlinear ANN decoder to support the fundamental claim made here, that smaller variance PCs are involved in regulating precise control”

      Because when the secondary signal is superimposed on the primary signal, the enhancement in KF performance is substantial. We wanted to explore in which aspect of the behavior the KF performance improvement is mainly reflected. In comparison, the improvement of ANN by the secondary signal is very small, rendering the exploration of the aforementioned questions inconsequential. Thank you for your valuable feedback.

      Q8: “A key limitation of the VAE architecture is that it doesn't aggregate information over multiple time samples. This may be why the authors decided to use a very large bin size of 100ms and beyond that smooth the data with a moving average. This limitation should be clearly stated somewhere in contrast with methods that can aggregate information over time (e.g., TNDM, LFADS, PSID) ”

      We have added this limitation in the Discussion in the revised manuscript. Thanks for your recommendation.

      Q9: “Fig 5c and parts of the text explore the decoding when some neurons are dropped. These results should come with a reminder that dropping neurons from behaviorally relevant signals is not technically possible since the extraction of behaviorally relevant signals with d-VAE is a population level aggregation that requires the raw signal from all neurons as an input. This is also important to remind in some places in the text for example:

      • Line 498: "...when one of the neurons is destroyed."

      • Line 572: "In contrast, our results show that decoders maintain high performance on distilled signals even when many neurons drop out."”

      We want to explore the robustness of real relevant signals in the face of neuron drop-out. The signals our model extracted are an approximation of the ground truth relevant signals and thus serve as a substitute for ground truth to study this problem. Thank you for your valuable feedback.

      Q10: “Besides the confounded conclusions regarding the readout being linear (see comment 3 and items related to it in comment 5), the authors also don't adequately discuss prior works that suggest nonlinearity helps decoding of behavior from the motor cortex. Around line 594, a few works are discussed as support for the idea of a linear readout. This should be accompanied by a discussion of works that support a nonlinear encoding of behavior in the motor cortex, for example (Naufel et al. 2019; Glaser et al. 2020), some of which the authors cite elsewhere but don't discuss here.”

      We have added this discussion in the revised manuscript. Thanks for your recommendation.

      Q11: “Selection of hyperparameters is not clearly explained. Starting line 791, the authors give some explanation for one hyperparameter, but not others. How are the other hyperparameters determined? What is the search space for the grid search of each hyperparameter? Importantly, if hyperparameters are determined only based on the training data of each fold, why is only one value given for the hyperparameter selected in each dataset (line 814)? Did all 5 folds for each dataset happen to select exactly the same hyperparameter based on their 5 different training/validation data splits? That seems unlikely.”

      We perform a grid search in {0.001, 0.01,0.1,1} for hyperparameter beta. And we found that 0.001 is the best for all datasets. As for the model parameters, such as hidden neuron numbers, this model capacity has reached saturation decoding performance and does not influence the results.

      Regarding “Importantly, if hyperparameters are determined only based on the training data of each fold, why is only one value given for the hyperparameter selected in each dataset (line 814)? Did all 5 folds for each dataset happen to select exactly the same hyperparameter based on their 5 different training/validation data splits”: We selected the hyperparameter based on the average performance of 5 folds data on validation sets. The selected value denotes the one that yields the highest average performance across the 5 folds data.

      Thank you for your valuable feedback.

      Q12: “d-VAE itself should also be explained more clearly in the main text. Currently, only the high-level idea of the objective is explained. The explanation should be more precise and include the idea of encoding to latent state, explain the relation to pip-VAE, explain inputs and outputs, linearity/nonlinearity of various mappings, etc. Also see comment 1 above, where I suggest adding more details about other methods in the main text.”

      Our primary objective is to delve into the encoding and decoding mechanisms using the separated relevant signals. Therefore, providing an excessive amount of model details could potentially distract from the main focus of the paper. In response to your suggestion, we have included a visual representation of d-VAE's structure, input, and output (see Fig. S1) in the revised manuscript, which offers a comprehensive and intuitive overview. Additionally, we have expanded on the details of d-VAE and other methods in the Methods section.

      Thank you for your valuable feedback.

      Q13: “In Fig 1f and g, shouldn't the performance plots be swapped? The current plots seem counterintuitive. If there is bias toward decoding (panel g), why is the irrelevant residual so good at decoding?”

      The placement of the performance plots in Fig. 1f and 1g is accurate. When the model exhibits a bias toward decoding, it prioritizes extracting the most relevant features (latent variables) for decoding purposes. As a consequence, the model predominantly generates signals that are closely associated with these extracted features. This selective signal extraction and generation process may result in the exclusion of other potentially useful information, which will be left in the residuals. To illustrate this concept, consider the example of face recognition: if a model can accurately identify an individual using only the person's eyes (assuming these are the most useful features), other valuable information, such as details of the nose or mouth, will be left in the residuals, which could also be used to identify the individual.

      Thank you for your valuable feedback.

    1. Author Response:

      The following is the authors’ response to the previous reviews.

      We carefully read through the second-round reviews and the additional reviews. To us, the review process is somewhat unusual and very much dominated by referee 2, who aggressively insists that we mixed up the trigeminal nucleus and inferior olive and that as a consequence our results are meaningless. We think the stance of referee 2 and the focus on one single issue (the alleged mix-up of trigeminal nucleus and inferior olive) is somewhat unfortunate, leaves out much of our findings and we debated at length on how to deal with further revisions. In the end, we decided to again give priority to addressing the criticism of referees 2, because it is hard to go on with a heavily attacked paper without resolving the matter at stake. The following is a summary of, what we did:

      Additional experimental work:

      (1) We checked if the peripherin-antibody indeed reliably identifies climbing fibers.

      To this end, we sectioned the elephant cerebellum and stained sections with the peripherin-antibody. We find: (i) the cerebellar white matter is strongly reactive for peripherin-antibodies, (ii) cerebellar peripherin-antibody staining of has an axonal appearance. (iii) Cerebellar Purkinje cell somata appear to be ensheated by peripherin-antibody staining. (iv) We observed that the peripherin-antibody reactivity gradually decreases from Purkinje cell somata to the pia in the cerebellar molecular layer. This work is shown in our revised Figure 2. All these four features align with the distribution of climbing fibers (which arrive through the white matter, are axons, ensheat Purkinje cell somata, and innervate Purkinje cell proximally not reaching the pia). In line with previous work, which showed similar cerebellar staining patterns in several species (Errante et al. 1998), we conclude that elephant climbing fibers are strongly reactive for peripherin-antibodies.

      (2) We delineated the elephant olivo-cerebellar tract.

      The strong peripherin-antibody reactivity of elephant climbing fibers enabled us to delineate the elephant olivo-cerebellar tract. We find the elephant olivo-cerebellar tract is a strongly peripherin-antibody reactive, well-delineated fiber tract several millimeters wide and about a centimeter in height. The unstained olivo-cerebellar tract has a greyish appearance. In the anterior regions of the olivo-cerebellar tract, we find that peripherin-antibody reactive fibers run in the dorsolateral brainstem and approach the cerebellar peduncle, where the tract gradually diminishes in size, presumably because climbing fibers discharge into the peduncle. Indeed, peripherin-antibody reactive fibers can be seen entering the cerebellar peduncle. Towards the posterior end of the peduncle, the olivo-cerebellar disappears (in the dorsal brainstem directly below the peduncle. We note that the olivo-cerebellar tract was referred to as the spinal trigeminal tract by Maseko et al. 2013. We think the tract in question cannot be the spinal trigeminal tract for two reasons: (i) This tract is the sole brainstem source of peripherin-positive climbing fibers entering the peduncle/ the cerebellum; this is the defining characteristic of the olivo-cerebellar tract. (ii) The tract in question is much smaller than the trigeminal nerve, disappears posterior to where the trigeminal nerve enters the brainstem (see below), and has no continuity with the trigeminal nerve; the continuity with the trigeminal nerve is the defining characteristic of the spinal trigeminal tract, however.

      The anterior regions of the elephant olivo-cerebellar tract are similar to the anterior regions of olivo-cerebellar tract of other mammals in its dorsolateral position and the relation to the cerebellar peduncle. In its more posterior parts, the elephant olivo-cerebellar tract continues for a long distance (~1.5 cm) in roughly the same dorsolateral position and enters the serrated nucleus that we previously identified as the elephant inferior olive. The more posterior parts of the elephant olivo-cerebellar tract therefore differ from the more posterior parts of the olivo-cerebellar tract of other mammals, which follows a ventromedial trajectory towards a ventromedially situated inferior olive. The implication of our delineation of the elephant olivo-cerebellar tract is that we correctly identified the elephant inferior olive.

      (3) An in-depth analysis of peripherin-antibody reactivity also indicates that the trigeminal nucleus receives no climbing fiber input.

      We also studied the peripherin-antibody reactivity in and around the trigeminal nucleus. We had also noted in the previous submission that the trigeminal nucleus is weakly positive for peripherin, but that the staining pattern is uniform and not the type of axon bundle pattern that is seen in the inferior olive of other mammals. To us, this observation already argued against the presence of climbing fibers in the trigeminal nucleus. We also noted that the myelin stripes of the trigeminal nucleus were peripherin-antibody-negative. In the context of our olivo-cerebellar tract tracing we now also scrutinized the surroundings of the trigeminal nucleus for peripherin-antibody reactivity. We find that the ventral brainstem surrounding the trigeminal nucleus is devoid of peripherin-antibody reactivity. Accordingly, no climbing fibers, (which we have shown to be strongly peripherin-antibody-positive, see our point 1) arrive at the trigeminal nucleus. The absence of climbing fiber input indicates that previous work that identified the (trigeminal) nucleus as the inferior olive (Maseko et al 2013) is unlikely to be correct.

      (4) We characterized the entry of the trigeminal nerve into the elephant brain.

      To better understand how trigeminal information enters the elephant’s brain, we characterized the entry of the trigeminal nerve. This analysis indicated to us that the trigeminal nerve is not continuous with the olivo-cerebellar tract (the spinal trigeminal tract of Maseko et al. 2013) as previously claimed by Maseko et al. 2013. We show some of this evidence in Referee-Figure 1 below. The reason we think the trigeminal nerve is discontinuous with the olivo-cerebellar tract is the size discrepancy between the two structures. We first show this for the tracing data of Maseko et al. 2013. In the Maseko et al. 2013 data the trigeminal nerve (Referee-Figure 1A, their plate Y) has 3-4 times the diameter of the olivocerebellar tract (the alleged spinal trigeminal tract, Referee-Figure 1B, their plate Z). Note that most if not all trigeminal fibers are thought to continue from the nerve into the trigeminal tract (see our rat data below). We plotted the diameter of the trigeminal nerve and diameter of the olivo-cerebellar (the spinal trigeminal tract according to Maseko et al. 2013) from the Maseko et al. 2013 data (Referee-Figure 1C) and we found that the olivocerebellar tract has a fairly consistent diameter (46 ± 9 mm2, mean ± SD). Statistical considerations and anatomical evidence suggest that the tracing of the trigeminal nerve into the olivo-cerebellar (the spinal trigeminal tract according to Maseko et al. 2013) is almost certainly wrong. The most anterior point of the alleged spinal trigeminal tract has a diameter of 51 mm2 which is more than 15 standard deviations different from the most posterior diameter (194 mm2) of the trigeminal tract. For this assignment to be correct three-quarters of trigeminal nerve fibers would have to spontaneously disappear, something that does not happen in the brain. We also made similar observations in the African elephant Bibi, where the trigeminal nerve (Referee-Figure 1D) is much larger in diameter than the olivocerebellar tract (Referee-Figure 1E). We could also show that the olivocerebellar tract disappears into the peduncle posterior to where the trigeminal nerve enters (Referee-Figure 1F). Our data are very similar to Maseko et al. indicating that their outlining of structures was done correctly. What appears to have been oversimplified, is the assignment of structures as continuous. We also quantified the diameter of the trigeminal nerve and the spinal trigeminal tract in rats (from the Paxinos & Watson atlas; Referee-Figure 1D); as expected we found the trigeminal nerve and spinal trigeminal tract diameters are essentially continuous.

      In our hands, the trigeminal nerve does not continue into a well-defined tract that could be traced after its entry. In this regard, it differs both from the olivo-cerebellar tract of the elephant or the spinal trigeminal tract of the rodent, both of which are well delineated. We think the absence of a well-delineated spinal trigeminal tract in elephants might have contributed to the putative tracing error highlighted in our Referee-Figure 1A-C.

      We conclude that a size mismatch indicates trigeminal fibers do not run in the olivo-cerebellar tract (the spinal trigeminal tract according to Maseko et al. 2013).

      Author response image 1.

      The trigeminal nerve is discontinuous with the olivo-cerebellar tract (the spinal trigeminal tract according to Maseko et al. 2013). A, Trigeminal nerve (orange) in the brain of African elephant LAX as delineated by Maseko et al. 2013 (coronal section; their plate Y). B, Most anterior appearance of the spinal trigeminal tract of Maseko et al. 2013 (blue; coronal section; their plate Z). Note the much smaller diameter of the spinal trigeminal tract compared to the trigeminal nerve shown in C, which argues against the continuity of the two structures. Indeed, our peripherin-antibody staining showed that the spinal trigeminal tract of Maseko corresponds to the olivo-cerebellar tract and is discontinuous with the trigeminal nerve. C, Plot of the trigeminal nerve and olivo-cerebellar tracts (the spinal trigeminal tract according to Maseko et al. 2013) diameter along the anterior-posterior axis. The trigeminal nerve is much larger in diameter than the olivocerebellar tract (the spinal trigeminal tract according to Maseko et al. 2013). C, D measurements, for which sections are shown in panels C and D respectively. The olivocerebellar tract (the spinal trigeminal tract according to Maseko et al. 2013) has a consistent diameter; data replotted from Maseko et al. 2013. At mm 25 the inferior olive appears. D, Trigeminal nerve entry in the brain of African elephant Bibi; our data, coronal section, the trigeminal nerve is outlined in orange, note the large diameter. E, Most anterior appearance of the olivo-cerebellar tract in the brain of African elephant Bibi; our data, coronal section, approximately 3 mm posterior to the section shown in A, the olivocerebellar tract is outlined in blue. Note the smaller diameter of the olivo-cerebellar tract compared to the trigeminal nerve, which argues against the continuity of the two structures. F, Plot of the trigeminal nerve and olivo-cerebellar tract diameter along the anterior-posterior axis. The nerve and olivo-cerebellar tract are discontinuous and the trigeminal nerve is much larger in diameter than the olivocerebellar tract (the spinal trigeminal tract according to Maseko et al. 2013); our data. D, E measurements, for which sections are shown in panels D and E respectively. At mm 27 the inferior olive appears. G, In the rat the trigeminal nerve is continuous in size with the spinal trigeminal tract. Data replotted from Paxinos and Watson.

      Reviewer 2 (Public Review):

      As indicated in my previous review of this manuscript (see above), it is my opinion that the authors have misidentified, and indeed switched, the inferior olivary nuclear complex (IO) and the trigeminal nuclear complex (Vsens). It is this specific point only that I will address in this second review, as this is the crucial aspect of this paper - if the identification of these nuclear complexes in the elephant brainstem by the authors is incorrect, the remainder of the paper does not have any scientific validity.

      Comment: We agree with the referee that it is most important to sort out, the inferior olivary nuclear complex (IO) and the trigeminal nuclear complex, respectively.Change: We did additional experimental work to resolve this matter as detailed at the beginning of our response. Specifically, we ascertained that elephant climbing fibers are strongly peripherin-positive. Based on elephant climbing fiber peripherin-reactivity we delineated the elephant olivo-cerebellar tract. We find that the olivo-cerebellar connects to the structure we refer to as inferior olive to the cerebellum (the referee refers to this structure as the trigeminal nuclear complex). We also found that the trigeminal nucleus (the structure the referee refers to as inferior olive) appears to receive no climbing fibers. We provide indications that the tracing of the trigeminal nerve into the olivo-cerebellar tract by Maseko et al. 2023 was erroneous (Author response image 1). These novel findings support our ideas but are very difficult to reconcile with the referee’s partitioning scheme.

      The authors, in their response to my initial review, claim that I "bend" the comparative evidence against them. They further claim that as all other mammalian species exhibit a "serrated" appearance of the inferior olive, and as the elephant does not exhibit this appearance, that what was previously identified as the inferior olive is actually the trigeminal nucleus and vice versa. 

      For convenience, I will refer to IOM and VsensM as the identification of these structures according to Maseko et al (2013) and other authors and will use IOR and VsensR to refer to the identification forwarded in the study under review. <br /> The IOM/VsensR certainly does not have a serrated appearance in elephants. Indeed, from the plates supplied by the authors in response (Referee Fig. 2), the cytochrome oxidase image supplied and the image from Maseko et al (2013) shows a very similar appearance. There is no doubt that the authors are identifying structures that closely correspond to those provided by Maseko et al (2013). It is solely a contrast in what these nuclear complexes are called and the functional sequelae of the identification of these complexes (are they related to the trunk sensation or movement controlled by the cerebellum?) that is under debate.

      Elephants are part of the Afrotheria, thus the most relevant comparative data to resolve this issue will be the identification of these nuclei in other Afrotherian species. Below I provide images of these nuclear complexes, labelled in the standard nomenclature, across several Afrotherian species. 

      (A) Lesser hedgehog tenrec (Echinops telfairi) 

      Tenrecs brains are the most intensively studied of the Afrotherian brains, these extensive neuroanatomical studies undertaken primarily by Heinz Künzle. Below I append images (coronal sections stained with cresol violet) of the IO and Vsens (labelled in the standard mammalian manner) in the lesser hedgehog tenrec. It should be clear that the inferior olive is located in the ventral midline of the rostral medulla oblongata (just like the rat) and that this nucleus is not distinctly serrated. The Vsens is located in the lateral aspect of the medulla skirted laterally by the spinal trigeminal tract (Sp5). These images and the labels indicating structures correlate precisely with that provide by Künzle (1997, 10.1016, see his Figure 1K,L. Thus, in the first case of a related species, there is no serrated appearance of the inferior olive, the location of the inferior olive is confirmed through connectivity with the superior colliculus (a standard connection in mammals) by Künzle (1997), and the location of Vsens is what is considered to be typical for mammals. This is in agreement with the authors, as they propose that ONLY the elephants show the variations they report. 

      (B) Giant otter shrew (Potomogale velox) 

      The otter shrews are close relatives of the Tenrecs. Below I append images of cresyl violet (left column) and myelin (right column) stained coronal sections through the brainstem with the IO, Vsens and Sp5 labelled as per standard mammalian anatomy. Here we see hints of the serration of the IO as defined by the authors, but we also see many myelin stripes across the IO. Vsens is located laterally and skirted by the Sp5. This is in agreement with the authors, as they propose that ONLY the elephants show the variations they report.

      (C) Four-toed sengi (Petrodromus tetradactylus) 

      The sengis are close relatives of the Tenrecs and otter shrews, these three groups being part of the Afroinsectiphilia, a distinct branch of the Afrotheria. Below I append images of cresyl violet (left column) and myelin (right column) stained coronal sections through the brainstem with the IO, Vsens and Sp5 labelled as per standard mammalian anatomy. Here we see vague hints of the serration of the IO (as defined by the authors), and we also see many myelin stripes across the IO. Vsens is located laterally and skirted by the Sp5. This is in agreement with the authors, as they propose that ONLY the elephants show the variations they report. 

      (D) Rock hyrax (Procavia capensis) 

      The hyraxes, along with the sirens and elephants form the Paenungulata branch of the Afrotheria. Below I append images of cresyl violet (left column) and myelin (right column) stained coronal sections through the brainstem with the IO, Vsens and Sp5 labelled as per the standard mammalian anatomy. Here we see hints of the serration of the IO (as defined by the authors), but we also see evidence of a more "bulbous" appearance of subnuclei of the IO (particularly the principal nucleus), and we also see many myelin stripes across the IO. Vsens is located laterally and skirted by the Sp5. This is in agreement with the authors, as they propose that ONLY the elephants show the variations they report. 

      (E) West Indian manatee (Trichechus manatus) 

      The sirens are the closest extant relatives of the elephants in the Afrotheria. Below I append images of cresyl violet (top) and myelin (bottom) stained coronal sections (taken from the University of Wisconsin-Madison Brain Collection, https://brainmuseum.org, and while quite low in magnification they do reveal the structures under debate) through the brainstem with the IO, Vsens and Sp5 labelled as per standard mammalian anatomy. Here we see the serration of the IO (as defined by the authors). Vsens is located laterally and skirted by the Sp5. This is in agreement with the authors, as they propose that ONLY the elephants show the variations they report.

      These comparisons and the structural identification, with which the authors agree as they only distinguish the elephants from the other Afrotheria, demonstrate that the appearance of the IO can be quite variable across mammalian species, including those with a close phylogenetic affinity to the elephants. Not all mammal species possess a "serrated" appearance of the IO. Thus, it is more than just theoretically possible that the IO of the elephant appears as described prior to this study. 

      So what about elephants? Below I append a series of images from coronal sections through the African elephant brainstem stained for Nissl, myelin, and immunostained for calretinin. These sections are labelled according to standard mammalian nomenclature. In these complete sections of the elephant brainstem, we do not see a serrated appearance of the IOM (as described previously and in the current study by the authors). Rather the principal nucleus of the IOM appears to be bulbous in nature. In the current study, no image of myelin staining in the IOM/VsensR is provided by the authors. However, in the images I provide, we do see the reported myelin stripes in all stains - agreement between the authors and reviewer on this point. The higher magnification image to the bottom left of the plate shows one of the IOM/VsensR myelin stripes immunostained for calretinin, and within the myelin stripes axons immunopositive for calretinin are seen (labelled with an arrow). The climbing fibres of the elephant cerebellar cortex are similarly calretinin immunopositive (10.1159/000345565). In contrast, although not shown at high magnification, the fibres forming the Sp5 in the elephant (in the Maseko description, unnamed in the description of the authors) show no immunoreactivity to calretinin. 

      Comment: We appreciate the referee’s additional comments. We concede the possibility that some relatives of elephants have a less serrated inferior olive than most other mammals. We maintain, however, that the elephant inferior olive (our Figure 1J) has the serrated appearance seen in the vast majority of mammals.

      Change: None.

      Peripherin Immunostaining 

      In their revised manuscript the authors present immunostaining of peripherin in the elephant brainstem. This is an important addition (although it does replace the only staining of myelin provided by the authors which is unusual as the word myelin is in the title of the paper) as peripherin is known to specifically label peripheral nerves. In addition, as pointed out by the authors, peripherin also immunostains climbing fibres (Errante et al., 1998). The understanding of this staining is important in determining the identification of the IO and Vsens in the elephant, although it is not ideal for this task as there is some ambiguity. Errante and colleagues (1998; Fig. 1) show that climbing fibres are peripherin-immunopositive in the rat. But what the authors do not evaluate is the extensive peripherin staining in the rat Sp5 in the same paper (Errante et al, 1998, Fig. 2). The image provided by the authors of their peripherin immunostaining (their new Figure 2) shows what I would call the Sp5 of the elephant to be strongly peripherin immunoreactive, just like the rat shown in Errant et al (1998), and more over in the precise position of the rat Sp5! This makes sense as this is where the axons subserving the "extraordinary" tactile sensitivity of the elephant trunk would be found (in the standard model of mammalian brainstem anatomy). Interestingly, the peripherin immunostaining in the elephant is clearly lamellated...this coincides precisely with the description of the trigeminal sensory nuclei in the elephant by Maskeo et al (2013) as pointed out by the authors in their rebuttal. Errante et al (1998) also point out peripherin immunostaining in the inferior olive, but according to the authors this is only "weakly present" in the elephant IOM/VsensR. This latter point is crucial. Surely if the elephant has an extraordinary sensory innervation from the trunk, with 400 000 axons entering the brain, the VsensR/IOM should be highly peripherin-immunopositive, including the myelinated axon bundles?! In this sense, the authors argue against their own interpretation - either the elephant trunk is not a highly sensitive tactile organ, or the VsensR is not the trigeminal nuclei it is supposed to be. 

      Comment: We made sure that elephant climbing fibers are strongly peripherin-positive (our revised Figure 2). As we noted in already our previous ms, we see weak diffuse peripherin-reactivity in the trigeminal nucleus (the inferior olive according to the referee), but no peripherin-reactive axon bundles (i.e. climbing fibers) that are seen in the inferior olive of other species. We also see no peripherin-reactive axon bundles (i.e. the olivo-cerebellar tract) arriving in the trigeminal nucleus as the tissue surrounding the trigeminal nucleus is devoid of peripherin-reactivity. Again, this finding is incompatible with the referee’s ideas. As far as we can tell, the trigeminal fibers are not reactive for peripherin in the elephant, i.e. we did not observe peripherin-reactivity very close to the nerve entry, but unfortunately, we did not stain for peripherin-reactivity into the nerve. As the referee alludes to the absence of peripherin-reactivity in the trigeminal tract is a difference between rodents and elephants.

      Change: Our novel Figure 2.

      Summary: 

      (1) Comparative data of species closely related to elephants (Afrotherians) demonstrates that not all mammals exhibit the "serrated" appearance of the principal nucleus of the inferior olive. 

      (2) The location of the IO and Vsens as reported in the current study (IOR and VsensR) would require a significant, and unprecedented, rearrangement of the brainstem in the elephants independently. I argue that the underlying molecular and genetic changes required to achieve this would be so extreme that it would lead to lethal phenotypes. Arguing that the "switcheroo" of the IO and Vsens does occur in the elephant (and no other mammals) and thus doesn't lead to lethal phenotypes is a circular argument that cannot be substantiated. 

      (3) Myelin stripes in the subnuclei of the inferior olivary nuclear complex are seen across all related mammals as shown above. Thus, the observation made in the elephant by the authors in what they call the VsensR, is similar to that seen in the IO of related mammals, especially when the IO takes on a more bulbous appearance. These myelin stripes are the origin of the olivocerebellar pathway, and are indeed calretinin immunopositive in the elephant as I show. 

      (4) What the authors see aligns perfectly with what has been described previously, the only difference being the names that nuclear complexes are being called. But identifying these nuclei is important, as any functional sequelae, as extensively discussed by the authors, is entirely dependent upon accurately identifying these nuclei. 

      (4) The peripherin immunostaining scores an own goal - if peripherin is marking peripheral nerves (as the authors and I believe it is), then why is the VsensR/IOM only "weakly positive" for this stain? This either means that the "extraordinary" tactile sensitivity of the elephant trunk is non-existent, or that the authors have misinterpreted this staining. That there is extensive staining in the fibre pathway dorsal and lateral to the IOR (which I call the spinal trigeminal tract), supports the idea that the authors have misinterpreted their peripherin immunostaining.

      (5) Evolutionary expediency. The authors argue that what they report is an expedient way in which to modify the organisation of the brainstem in the elephant to accommodate the "extraordinary" tactile sensitivity. I disagree. As pointed out in my first review, the elephant cerebellum is very large and comprised of huge numbers of morphologically complex neurons. The inferior olivary nuclei in all mammals studied in detail to date, give rise to the climbing fibres that terminate on the Purkinje cells of the cerebellar cortex. It is more parsimonious to argue that, in alignment with the expansion of the elephant cerebellum (for motor control of the trunk), the inferior olivary nuclei (specifically the principal nucleus) have had additional neurons added to accommodate this cerebellar expansion. Such an addition of neurons to the principal nucleus of the inferior olive could readily lead to the loss of the serrated appearance of the principal nucleus of the inferior olive, and would require far less modifications in the developmental genetic program that forms these nuclei. This type of quantitative change appears to be the primary way in which structures are altered in the mammalian brainstem. 

      Comment: We still disagree with the referee. We note that our conclusions rest on the analysis of 8 elephant brainstems, which we sectioned in three planes and stained with a variety of metabolic and antibody stains and in which assigned two structures (the inferior olive and the trigeminal nucleus). Most of the evidence cited by the referee stems from a single paper, in which 147 structures were identified based on the analysis of a single brainstem sectioned in one plane and stained with a limited set of antibodies. Our synopsis of the evidence is the following.

      (1) We agree with the referee that concerning brainstem position our scheme of a ventromedial trigeminal nucleus and a dorsolateral inferior olive deviates from the usual mammalian position of these nuclei (i.e. a dorsolateral trigeminal nucleus and a ventromedial inferior olive).

      (2) Cytoarchitectonics support our partitioning scheme. The compact cellular appearance of our ventromedial trigeminal nucleus is characteristic of trigeminal nuclei. The serrated appearance of our dorsolateral inferior olive is characteristic of the mammalian inferior olive; we acknowledge that the referee claims exceptions here. To our knowledge, nobody has described a mammalian trigeminal nucleus with a serrated appearance (which would apply to the elephant in case the trigeminal nucleus is situated dorsolaterally).

      (3) Metabolic staining (Cyto-chrome-oxidase reactivity) supports our partitioning scheme. Specifically, our ventromedial trigeminal nucleus shows intense Cyto-chrome-oxidase reactivity as it is seen in the trigeminal nuclei of trigeminal tactile experts.

      (4) Isomorphism. The myelin stripes on our ventromedial trigeminal nucleus are isomorphic to trunk wrinkles. Isomorphism is a characteristic of somatosensory brain structures (barrel, barrelettes, nose-stripes, etc) and we know of no case, where such isomorphism was misleading.

      (5) The large-scale organization of our ventromedial trigeminal nuclei in anterior-posterior repeats is characteristic of the mammalian trigeminal nuclei. To our knowledge, no such organization has ever been reported for the inferior olive.

      (6) Connectivity analysis supports our partitioning scheme. According to our delineation of the elephant olivo-cerebellar tract, our dorsolateral inferior olive is connected via peripherin-positive climbing fibers to the cerebellum. In contrast, our ventromedial trigeminal nucleus (the referee’s inferior olive) is not connected via climbing fibers to the cerebellum.

      Change: As discussed, we advanced further evidence in this revision. Our partitioning scheme (a ventromedial trigeminal nucleus and a dorsolateral inferior olive) is better supported by data and makes more sense than the referee’s suggestion (a dorsolateral trigeminal nucleus and a ventromedial inferior olive). It should be published.

      Reviewer #3 (Public Review):

      Summary: 

      The study claims to investigate trunk representations in elephant trigeminal nuclei located in the brainstem. The researchers identify large protrusions visible from the ventral surface of the brainstem, which they examined using a range of histological methods. However, this ventral location is usually where the inferior olivary complex is found, which challenges the author's assertions about the nucleus under analysis. They find that this brainstem nucleus of elephants contains repeating modules, with a focus on the anterior and largest unit which they define as the putative nucleus principalis trunk module of the trigeminal. The nucleus exhibits low neuron density, with glia outnumbering neurons significantly. The study also utilizes synchrotron X-ray phase contrast tomography to suggest that myelin-stripe-axons traverse this module. The analysis maps myelin-rich stripes in several specimens and concludes that based on their number and patterning that they likely correspond with trunk folds; however this conclusion is not well supported if the nucleus has been misidentified. 

      Comment: The referee provides a summary of our work. The referee also notes that the correct identification of the trigeminal nucleus is critical to the message of our paper.

      Change: In line with these assessments we focused our revision efforts on the issue of trigeminal nucleus identification, please see our introductory comments and our response to Referee 2.

      Strengths: 

      The strength of this research lies in its comprehensive use of various anatomical methods, including Nissl staining, myelin staining, Golgi staining, cytochrome oxidase labeling, and synchrotron X-ray phase contrast tomography. The inclusion of quantitative data on cell numbers and sizes, dendritic orientation and morphology, and blood vessel density across the nucleus adds a quantitative dimension. Furthermore, the research is commendable for its high-quality and abundant images and figures, effectively illustrating the anatomy under investigation.

      Comment: We appreciate this positive assessment.

      Change: None

      Weaknesses: 

      While the research provides potentially valuable insights if revised to focus on the structure that appears to be inferior olivary nucleus, there are certain additional weaknesses that warrant further consideration. First, the suggestion that myelin stripes solely serve to separate sensory or motor modules rather than functioning as an "axonal supply system" lacks substantial support due to the absence of information about the neuronal origins and the termination targets of the axons. Postmortem fixed brain tissue limits the ability to trace full axon projections. While the study acknowledges these limitations, it is important to exercise caution in drawing conclusions about the precise role of myelin stripes without a more comprehensive understanding of their neural connections. 

      Comment: We understand these criticisms and the need for cautious interpretation. As we noted previously, we think that the Elife-publishing scheme, where critical referee commentary is published along with our ms, will make this contribution particularly valuable.

      Change: Our additional efforts to secure the correct identification of the trigeminal nucleus.

      Second, the quantification presented in the study lacks comparison to other species or other relevant variables within the elephant specimens (i.e., whole brain or brainstem volume). The absence of comparative data to different species limits the ability to fully evaluate the significance of the findings. Comparative analyses could provide a broader context for understanding whether the observed features are unique to elephants or more common across species. This limitation in comparative data hinders a more comprehensive assessment of the implications of the research within the broader field of neuroanatomy. Furthermore, the quantitative comparisons between African and Asian elephant specimens should include some measure of overall brain size as a covariate in the analyses. Addressing these weaknesses would enable a richer interpretation of the study's findings. 

      Comment: We understand, why the referee asks for additional comparative data, which would make our study more meaningful. We note that we already published a quantitative comparison of African and Asian elephant facial nuclei (Kaufmann et al. 2022). The quantitative differences between African and Asian elephant facial nuclei are similar in magnitude to what we observed here for the trigeminal nucleus, i.e. African elephants have about 10-15% more facial nucleus neurons than Asian elephants. The referee also notes that data on overall elephant brain size might be important for interpreting our data. We agree with this sentiment and we are preparing a ms on African and Asian elephant brain size. We find – unexpectedly given the larger body size of African elephants – that African elephants have smaller brains than Asian elephants. The finding might imply that African elephants, which have more facial nucleus neurons and more trigeminal nucleus trunk module neurons, are neurally more specialized in trunk control than Asian elephants.

      Change: We are preparing a further ms on African and Asian elephant brain size, a first version of this work has been submitted.

      Reviewer #4 (Public Review): 

      Summary: 

      The authors report a novel isomorphism in which the folds of the elephant trunk are recognizably mapped onto the principal sensory trigeminal nucleus in the brainstem. Further, they identifiy the enlarged nucleus as being situated in this species in an unusual ventral midline position. 

      Comment: The referee summarizes our work.

      Change: None.

      Strengths: 

      The identity of the purported trigeminal nucleus and the isomorphic mapping with the trunk folds is supported by multiple lines of evidence: enhanced staining for cytochrome oxidase, an enzyme associated with high metabolic activity; dense vascularization, consistent with high metabolic activity; prominent myelinated bundles that partition the nucleus in a 1:1 mapping of the cutaneous folds in the trunk periphery; near absence of labeling for the anti-peripherin antibody, specific for climbing fibers, which can be seen as expected in the inferior olive; and a high density of glia.

      Comment: The referee again reviews some of our key findings.

      Change: None. 

      Weaknesses: 

      Despite the supporting evidence listed above, the identification of the gross anatomical bumps, conspicuous in the ventral midline, is problematic. This would be the standard location of the inferior olive, with the principal trigeminal nucleus occupying a more dorsal position. This presents an apparent contradiction which at a minimum needs further discussion. Major species-specific specializations and positional shifts are well-documented for cortical areas, but nuclear layouts in the brainstem have been considered as less malleable. 

      Comment: The referee notes that our discrepancy with referee 2, needs to be addressed with further evidence and discussion, given the unusual position of both inferior olive and trigeminal nucleus in the partitioning scheme and that the mammalian brainstem tends to be positionally conservative. We agree with the referee. We note that – based on the immense size of the elephant trigeminal ganglion (50 g), half the size of a monkey brain – it was expected that the elephant trigeminal nucleus ought to be exceptionally large.

      Change: We did additional experimental work to resolve this matter: (i) We ascertained that elephant climbing fibers are strongly peripherin-positive. (ii) Based on elephant climbing fiber peripherin-reactivity we delineated the elephant olivo-cerebellar tract. We find that the olivo-cerebellar connects to the structure we refer to as inferior olive to the cerebellum. (iii) We also found that the trigeminal nucleus (the structure the referee refers to as inferior olive) appears to receive no climbing fibers. (iv) We provide indications that the tracing of the trigeminal nerve into the olivo-cerebellar tract by Maseko et al. 2023 was erroneous (Referee-Figure 1). These novel findings support our ideas.

      Reviewer #5 (Public Review): 

      After reading the manuscript and the concerns raised by reviewer 2 I see both sides of the argument - the relative location of trigeminal nucleus versus the inferior olive is quite different in elephants (and different from previous studies in elephants), but when there is a large disproportionate magnification of a behaviorally relevant body part at most levels of the nervous system (certainly in the cortex and thalamus), you can get major shifting in location of different structures. In the case of the elephant, it looks like there may be a lot of shifting. Something that is compelling is that the number of modules separated but the myelin bands correspond to the number of trunk folds which is different in the different elephants. This sort of modular division based on body parts is a general principle of mammalian brain organization (demonstrated beautifully for the cuneate and gracile nucleus in primates, VP in most of species, S1 in a variety of mammals such as the star nosed mole and duck-billed platypus). I don't think these relative changes in the brainstem would require major genetic programming - although some surely exists. Rodents and elephants have been independently evolving for over 60 million years so there is a substantial amount of time for changes in each l lineage to occur.

      I agree that the authors have identified the trigeminal nucleus correctly, although comparisons with more out groups would be needed to confirm this (although I'm not suggesting that the authors do this). I also think the new figure (which shows previous divisions of the brainstem versus their own) allows the reader to consider these issues for themselves. When reviewing this paper, I actually took the time to go through atlases of other species and even look at some of my own data from highly derived species. Establishing homology across groups based only on relative location is tough especially when there appears to be large shifts in relative location of structures. My thoughts are that the authors did an extraordinary amount of work on obtaining, processing and analyzing this extremely valuable tissue. They document their work with images of the tissue and their arguments for their divisions are solid. I feel that they have earned the right to speculate - with qualifications - which they provide. 

      Comment: The referee summarizes our work and appears to be convinced by the line of our arguments. We are most grateful for this assessment. We add, again, that the skeptical assessment of referee 2 will be published as well and will give the interested reader the possibility to view another perspective on our work.

      Change: None. 

      Recommendations for the authors: 

      Reviewer #1 (Recommendations For The Authors):

      With this manuscript being virtually identical to the previous version, it is possible that some of the definitive conclusions about having identified the elephant trigeminal nucleus and trunk representation should be moderated in a more nuanced manner, especially given the careful and experienced perspective from reviewers with first hand knowledge elephant neuroanatomy.

      Comment: We agree that both our first and second revisions were very much centered on the debate of the correct identification of the trigeminal nucleus and that our ms did not evolve as much in other regards. This being said we agree with Referee 2 that we needed to have this debate. We also think we advanced important novel data in this context (the delineation of elephant olivo-cerebellar tract through the peripherin-antibody).

      Changes: Our revised Figure 2. 

      The peripherin staining adds another level of argument to the authors having identified the trigeminal brainstem instead of the inferior olive, if differential expression of peripherin is strong enough to distinguish one structure from the other.

      Comment: We think we showed too little peripherin-antibody staining in our previous revision. We have now addressed this problem.

      Changes: Our revised Figure 2, i.e. the delineation of elephant olivo-cerebellar tract through the peripherin-antibody).

      There are some minor corrections to be made with the addition of Fig. 2., including renumbering the figures in the manuscript (e.g., 406, 521). 

      I continue to appreciate this novel investigation of the elephant brainstem and find it an interesting and thorough study, with the use of classical and modern neuroanatomical methods.

      Comment: We are thankful for this positive assessment.

      Reviewer #2 (Recommendations For The Authors):

      I do realise the authors are very unhappy with me and the reviews I have submitted. I do apologise if feelings have been hurt, and I do understand the authors put in a lot of hard work and thought to develop what they have; however, it is unfortunate that the work and thoughts are not correct. Science is about the search for the truth and sometimes we get it wrong. This is part of the scientific process and why most journals adhere to strict review processes of scientific manuscripts. As I said previously, the authors can use their data to write a paper describing and quantifying Golgi staining of neurons in the principal olivary nucleus of the elephant that should be published in a specialised journal and contextualised in terms of the motor control of the trunk and the large cerebellum of the elephant. 

      Comment: We appreciate the referee’s kind words. Also, no hard feelings from our side, this is just a scientific debate. In our experience, neuroanatomical debates are resolved by evidence and we note that we provide evidence strengthening our identification of the trigeminal nucleus and inferior olive. As far as we can tell from this effort and the substantial evidence accumulated, the referee is wrong.

      Reviewer #4 (Recommendations For The Authors):

      As a new reviewer, I have benefited from reading the previous reviews and Author response, even while having several new comments to add. 

      (1) The identification of the inferior olive and trigeminal nuclei is obviously center stage. An enlargement of the trigeminal nuclei is not necessarily problematic, given the published reports on the dramatic enlargement of the trigeminal nerve (Purkart et al., 2022). At issue is the conspicuous relocation of the trigeminal nuclei that is being promoted by Reveyaz et al. Conspicuous rearrangements are not uncommon; for example, primary sensory cortical fields in different species (fig. 1 in H.H.A. Oelschlager for dolphins; S. De Vreese et al. (2023) for cetaceans, L. Krubitzer on various species, in the context of evolution). The difficult point here concerns what looks like a rather conspicuous gross anatomical rearrangement, in BRAINSTEM - the assumption being that the brainstem bauplan is going to be specifically conservative and refractory to gross anatomical rearrangement. 

      Comment: We agree with the referee that the brainstem rearrangements are unexpected. We also think that the correct identification of nuclei needs to be at the center of our revision efforts.

      Change: Our revision provided further evidence (delineation of the olivo-cerebellar tract, characterization of the trigeminal nerve entry) about the identity of the nuclei we studied.

      Why would a major nucleus shift to such a different location? and how? Can ex vivo DTI provide further support of the correct identification? Is there other "disruption" in the brainstem? What occupies the traditional position of the trigeminal nuclei? An atlas-equivalent coronal view of the entire brainstem would be informative. The Authors have assembled multiple criteria to support their argument that the ventral "bumps" are in fact a translocated trigeminal principal nucleus: enhanced CO staining, enhanced vascularization, enhanced myelination (via Golgi stains and tomography), very scant labeling for a climbing fiber specific antibody ( anti-peripherin), vs. dense staining of this in the alternative structure that they identify as IO; and a high density of glia. Admittedly, this should be sufficient, but the proposed translocation (in the BRAINSTEM) is sufficiently startling that this is arguably NOT sufficient. <br /> The terminology of "putative" is helpful, but a more cogent presentation of the results and more careful discussion might succeed in winning over at least some of a skeptical readership. 

      Comment: We do not know, what led to the elephant brainstem rearrangements we propose. If the trigeminal nuclei had expanded isometrically in elephants from the ancestral pattern, one would have expected a brain with big lateral bumps, not the elephant brain with its big ventromedial bumps. We note, however, that very likely the expansion of the elephant trigeminal nuclei did not occur isometrically. Instead, the neural representation of the elephant nose expanded dramatically and in rodents the nose is represented ventromedially in the brainstem face representation. Thus, we propose a ‘ventromedial outgrowth model’ according to which the elephant ventromedial trigeminal bumps result from a ventromedially direct outgrowth of the ancestral ventromedial nose representation.

      We advanced substantially more evidence to support our partitioning scheme, including the delineation of the olivo-cerebellar tract based on peripherin-reactivity. We also identified problems in previous partitioning schemes, such as the claim that the trigeminal nerve continues into the ~4x smaller olivocerebellar tract (Referee-Figure 1C, D); we think such a flow of fibers, (which is also at odds with peripherin-antibody-reactivity and the appearance of nerve and olivocerebellar tract), is highly unlikely if not physically impossible. With all that we do not think that we overstate our case in our cautiously presented ms.

      Change: We added evidence on the identification of elephant trigeminal nuclei and inferior olive.

      (2) Role of myelin. While the photos of myelin are convincing, it would be nice to have further documentation. Gallyas? Would antibodies to MBP work? What is the myelin distribution in the "standard" trigeminal nuclei (human? macaque or chimpanzee?). What are alternative sources of the bundles? Regardless, I think it would be beneficial to de-emphasize this point about the role of myelin in demarcating compartments. <br /> I would in fact suggest an alternative (more neutral) title that might highlight instead the isomorphic feature; for example, "An isomorphic representation of Trunk folds in the Elephant Trigeminal Nucleus." The present title stresses myelin, but figure 1 already focuses on CO. Additionally, the folds are actually mentioned almost in passing until later in the manuscript. I recommend a short section on these at the beginning of the Results to serve as a useful framework.

      Here I'm inclined to agree with the Reviewer, that the Authors' contention that the myelin stipes serve PRIMARILY to separate trunk-fold domains is not particularly compelling and arguably a distraction. The point can be made, but perhaps with less emphasis. After all, the fact that myelin has multiple roles is well-established, even if frequently overlooked. In addition, the Authors might make better use of an extensive relevant literature related to myelin as a compartmental marker; for example, results and discussion in D. Haenelt....N. Weiskopf (eLife, 2023), among others. Another example is the heavily myelinated stria of Gennari in primate visual cortex, consisting of intrinsic pyramidal cell axons, but where the role of the myelination has still not been elucidated. 

      Comment: (1) Documentation of myelin. We note that we show further identification of myelinated fibers by the fluorescent dye fluomyelin in Figure 4B. We also performed additional myelin stains as the gold-myelin stain after the protocol of Schmued (Referee-Figure 2). In the end, nothing worked quite as well to visualize myelin-stripes as the bright-field images shown in Figure 4A and it is only the images that allowed us to match myelin-stripes to trunk folds. Hence, we focus our presentation on these images.

      (2) Title: We get why the referee envisions an alternative title. This being said, we would like to stick with our current title, because we feel it highlights the major novelty we discovered.

      (3) We agree with many of the other comments of the referee on myelin phenomenology. We missed the Haenelt reference pointed out by the referee and think it is highly relevant to our paper

      Change: 1. Review image 2. Inclusion of the Haenelt-reference.

      Author response image 2.

      Myelin stripes of the elephant trunk module visualized by Gold-chloride staining according to Schmued. A, Low magnification micrograph of the trunk module of African elephant Indra stained with AuCl according to Schmued. The putative finger is to the left, proximal is to the right. Myelin stripes can easily be recognized. The white box indicates the area shown in B. B, high magnification micrograph of two myelin stripes. Individual gold-stained (black) axons organized in myelin stripes can be recognized.

      Schmued, L. C. (1990). A rapid, sensitive histochemical stain for myelin in frozen brain sections. Journal of Histochemistry & Cytochemistry,38(5), 717-720.

      Are the "bumps" in any way "analogous" to the "brain warts" seen in entorhinal areas of some human brains (G. W. van Hoesen and A. Solodkin (1993)? 

      Comment: We think this is a similar phenomenon.

      Change: We included the Hoesen and A. Solodkin (1993) reference in our discussion.

      At least slightly more background (ie, a separate section or, if necessary, supplement) would be helpful, going into more detail on the several subdivisions of the ION and if these undergo major alterations in the elephant.

      Comment: The strength of the paper is the detailed delineation of the trunk module, based on myelin stripes and isomorphism. We don’t think we have strong evidence on ION subdivisions, because it appears the trigeminal tract cannot be easily traced in elephants. Accordingly, we find it difficult to add information here.

      Change: None.

      Is there evidence from the literature of other conspicuous gross anatomical translocations, in any species, especially in subcortical regions? 

      Comment: The best example that comes to mind is the star-nosed mole brainstem. There is a beautiful paper comparing the star-nosed mole brainstem to the normal mole brainstem (Catania et al 2011). The principal trigeminal nucleus in the star-nosed mole is far more rostral and also more medial than in the mole; still, such rearrangements are minor compared to what we propose in elephants.

      Catania, Kenneth C., Duncan B. Leitch, and Danielle Gauthier. "A star in the brainstem reveals the first step of cortical magnification." PloS one 6.7 (2011): e22406.

      Change: None.

      (3) A major point concerns the isomorphism between the putative trigeminal nuclei and the trunk specialization. I think this can be much better presented, at least with more discussion and other examples. The Authors mention about the rodent "barrels," but it seemed strange to me that they do not refer to their own results in pig (C. Ritter et al., 2023) nor the work from Ken Catania, 2002 (star-nosed mole; "fingerprints in the brain") or other that might be appropriate. I concur with the Reviewer that there should be more comparative data. 

      Comment: We agree.

      Change: We added a discussion of other isomorphisms including the the star-nosed mole to our paper.

      (4) Textual organization could be improved. 

      The Abstract all-important Introduction is a longish, semi "run-on" paragraph. At a minimum this should be broken up. The last paragraph of the Introduction puts forth five issues, but these are only loosely followed in the Results section. I think clarity and good organization is of the upmost importance in this manuscript. I recommend that the Authors begin the Results with a section on the trunk folds (currently figure 5, and discussion), continue with the several points related to the identification of the trigeminal nuclei, and continue with a parallel description of ION with more parallel data on the putative trigeminal and IO structures (currently referee Table 1, but incorporate into the text and add higher magnification of nucleus-specific cell types in the IO and trigeminal nuclei). Relevant comparative data should be included in the Discussion.

      Comment: 1. We agree with the referee that our abstract needed to be revised. 2. We also think that our ms was heavily altered by the insertion of the new Figure 2, which complemented Figure 1 from our first submission and is concerned with the identification of the inferior olive. From a standpoint of textual flow such changes were not ideal, but the revisions massively added to the certainty with which we identify the trigeminal nuclei. Thus, although we are not as content as we were with the flow, we think the ms advanced in the revision process and we would like to keep the Figure sequence as is. 3. We already noted above that we included additional comparative evidence.

      Change: 1. We revised our abstract. 2. We added comparative evidence.

      Reviewer #5 (Recommendations For The Authors): 

      The data is invaluable and provides insights into some of the largest mammals on the planet. 

      Comment: We are incredibly thankful for this positive assessment.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      eLife Assessment

      This neuroimaging and electrophysiology study in a small cohort of congenital cataract patients with sight recovery aims to characterize the effects of early visual deprivation on excitatory and inhibitory balance in visual cortex. While contrasting sight-recovery with visually intact controls suggested the existence of persistent alterations in Glx/GABA ratio and aperiodic EEG signals, it provided only incomplete evidence supporting claims about the effects of early deprivation itself. The reported data were considered valuable, given the rare study population. However, the small sample sizes, lack of a specific control cohort and multiple methodological limitations will likely restrict usefulness to scientists working in this particular subfield.

      We thank the reviewing editors for their consideration and updated assessment of our manuscript after its first revision.

      In order to assess the effects of early deprivation, we included an age-matched, normally sighted control group recruited from the same community, measured in the same scanner and laboratory. This study design is analogous to numerous studies in permanently congenitally blind humans, which typically recruited sighted controls, but hardly ever individuals with a different, e.g. late blindness history. In order to improve the specificity of our conclusions, we used a frontal cortex voxel in addition to a visual cortex voxel (MRS). Analogously, we separately analyzed occipital and frontal electrodes (EEG).

      Moreover, we relate our findings in congenital cataract reversal individuals to findings in the literature on permanent congenital blindness. Note, there are, to the best of our knowledge, neither MRS nor resting-state EEG studies in individuals with permanent late blindness.

      Our participants necessarily have nystagmus and low visual acuity due to their congenital deprivation phase, and the existence of nystagmus is a recruitment criterion to diagnose congenital cataracts.

      It might be interesting for future studies to investigate individuals with transient late blindness. However, such a study would be ill-motivated had we not found differences between the most “extreme” of congenital visual deprivation conditions and normally sighted individuals (analogous to why earlier research on permanent blindness investigated permanent congenitally blind humans first, rather than permanently late blind humans, or both in the same study). Any result of these future work would need the reference to our study, and neither results in these additional groups would invalidate our findings.

      Since all our congenital cataract reversal individuals by definition had visual impairments, we included an eyes closed condition, both in the MRS and EEG assessment. Any group effect during the eyes closed condition cannot be due to visual acuity deficits changing the bottom-up driven visual activation.

      As we detail in response to review 3, our EEG analyses followed the standards in the field.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary

      In this human neuroimaging and electrophysiology study, the authors aimed to characterise effects of a period of visual deprivation in the sensitive period on excitatory and inhibitory balance in the visual cortex. They attempted to do so by comparing neurochemistry conditions ('eyes open', 'eyes closed') and resting state, and visually evoked EEG activity between ten congenital cataract patients with recovered sight (CC), and ten age-matched control participants (SC) with normal sight.

      First, they used magnetic resonance spectroscopy to measure in vivo neurochemistry from two locations, the primary location of interest in the visual cortex, and a control location in the frontal cortex. Such voxels are used to provide a control for the spatial specificity of any effects, because the single-voxel MRS method provides a single sampling location. Using MR-visible proxies of excitatory and inhibitory neurotransmission, Glx and GABA+ respectively, the authors report no group effects in GABA+ or Glx, no difference in the functional conditions 'eyes closed' and 'eyes open'. They found an effect of group in the ratio of Glx/GABA+ and no similar effect in the control voxel location. They then perform multiple exploratory correlations between MRS measures and visual acuity, and report a weak positive correlation between the 'eyes open' condition and visual acuity in CC participants.

      The same participants then took part in an EEG experiment. The authors selected two electrodes placed in the visual cortex for analysis and report a group difference in an EEG index of neural activity, the aperiodic intercept, as well as the aperiodic slope, considered a proxy for cortical inhibition. Control electrodes in the frontal region did not present with the same pattern. They report an exploratory correlation between the aperiodic intercept and Glx in one out of three EEG conditions.

      The authors report the difference in E/I ratio, and interpret the lower E/I ratio as representing an adaptation to visual deprivation, which would have initially caused a higher E/I ratio. Although intriguing, the strength of evidence in support of this view is not strong. Amongst the limitations are the low sample size, a critical control cohort that could provide evidence for higher E/I ratio in CC patients without recovered sight for example, and lower data quality in the control voxel. Nevertheless, the study provides a rare and valuable insight into experience-dependent plasticity in the human brain.

      Strengths of study

      How sensitive period experience shapes the developing brain is an enduring and important question in neuroscience. This question has been particularly difficult to investigate in humans. The authors recruited a small number of sight-recovered participants with bilateral congenital cataracts to investigate the effect of sensitive period deprivation on the balance of excitation and inhibition in the visual brain using measures of brain chemistry and brain electrophysiology. The research is novel, and the paper was interesting and well written.

      Limitations

      Low sample size. Ten for CC and ten for SC, and further two SC participants were rejected due to lack of frontal control voxel data. The sample size limits the statistical power of the dataset and increases the likelihood of effect inflation.

      In the updated manuscript, the authors have provided justification for their sample size by pointing to prior studies and the inherent difficulties in recruiting individuals with bilateral congenital cataracts. Importantly, this highlights the value the study brings to the field while also acknowledging the need to replicate the effects in a larger cohort.

      Lack of specific control cohort. The control cohort has normal vision. The control cohort is not specific enough to distinguish between people with sight loss due to different causes and patients with congenital cataracts with co-morbidities. Further data from a more specific populations, such as patients whose cataracts have not been removed, with developmental cataracts, or congenitally blind participants, would greatly improve the interpretability of the main finding. The lack of a more specific control cohort is a major caveat that limits a conclusive interpretation of the results.

      In the updated version, the authors have indicated that future studies can pursue comparisons between congenital cataract participants and cohorts with later sight loss.

      MRS data quality differences. Data quality in the control voxel appears worse than in the visual cortex voxel. The frontal cortex MRS spectrum shows far broader linewidth than the visual cortex (Supplementary Figures). Compared to the visual voxel, the frontal cortex voxel has less defined Glx and GABA+ peaks; lower GABA+ and Glx concentrations, lower NAA SNR values; lower NAA concentrations. If the data quality is a lot worse in the FC, then small effects may not be detectable.

      In the updated version, the authors have added more information that informs the reader of the MRS quality differences between voxel locations. This increases the transparency of their reporting and enhances the assessment of the results.

      Because of the direction of the difference in E/I, the authors interpret their findings as representing signatures of sight improvement after surgery without further evidence, either within the study or from the literature. However, the literature suggests that plasticity and visual deprivation drives the E/I index up rather than down. Decreasing GABA+ is thought to facilitate experience dependent remodelling. What evidence is there that cortical inhibition increases in response to a visual cortex that is over-sensitised to due congenital cataracts? Without further experimental or literature support this interpretation remains very speculative.

      The updated manuscript contains key reference from non-human work to justify their interpretation.

      Heterogeneity in patient group. Congenital cataract (CC) patients experienced a variety of duration of visual impairment and were of different ages. They presented with co-morbidities (absorbed lens, strabismus, nystagmus). Strabismus has been associated with abnormalities in GABAergic inhibition in the visual cortex. The possible interactions with residual vision and confounds of co-morbidities are not experimentally controlled for in the correlations, and not discussed.

      The updated document has addressed this caveat.

      Multiple exploratory correlations were performed to relate MRS measures to visual acuity (shown in Supplementary Materials), and only specific ones shown in the main document. The authors describe the analysis as exploratory in the 'Methods' section. Furthermore, the correlation between visual acuity and E/I metric is weak, not corrected for multiple comparisons. The results should be presented as preliminary, as no strong conclusions can be made from them. They can provide a hypothesis to test in a future study.

      This has now been done throughout the document and increases the transparency of the reporting.

      P.16 Given the correlation of the aperiodic intercept with age ("Age negatively correlated with the aperiodic intercept across CC and SC individuals, that is, a flattening of the intercept was observed with age"), age needs to be controlled for in the correlation between neurochemistry and the aperiodic intercept. Glx has also been shown to negatively correlates with age.

      This caveat has been addressed in the revised manuscript.

      Multiple exploratory correlations were performed to relate MRS to EEG measures (shown in Supplementary Materials), and only specific ones shown in the main document. Given the multiple measures from the MRS, the correlations with the EEG measures were exploratory, as stated in the text, p.16, and in Fig.4. yet the introduction said that there was a prior hypothesis "We further hypothesized that neurotransmitter changes would relate to changes in the slope and intercept of the EEG aperiodic activity in the same subjects." It would be great if the text could be revised for consistency and the analysis described as exploratory.

      This has been done throughout the document and increases the transparency of the reporting.

      The analysis for the EEG needs to take more advantage of the available data. As far as I understand, only two electrodes were used, yet far more were available as seen in their previous study (Ossandon et al., 2023). The spatial specificity is not established. The authors could use the frontal cortex electrode (FP1, FP2) signals as a control for spatial specificity in the group effects, or even better, all available electrodes and correct for multiple comparisons. Furthermore, they could use the aperiodic intercept vs Glx in SC to evaluate the specificity of the correlation to CC.

      This caveat has been addressed. The authors have added frontal electrodes to their analysis, providing an essential regional control for the visual cortex location.

      Comments on the latest version:

      The authors have made reasonable adjustments to their manuscript that addressed most of my comments by adding further justification for their methodology, essential literature support, pointing out exploratory analyses, limitations and adding key control analyses. Their revised manuscript has overall improved, providing valuable information, though the evidence that supports their claims is still incomplete.

      We thank the reviewer for suggesting ways to improve our manuscript and carefully reassessing our revised manuscript.

      Reviewer #2 (Public review):

      Summary:

      The study examined 10 congenitally blind patients who recovered vision through the surgical removal of bilateral dense cataracts, measuring neural activity and neuro chemical profiles from the visual cortex. The declared aim is to test whether restoring visual function after years of complete blindness impacts excitation/inhibition balance in the visual cortex.

      Strengths:

      The findings are undoubtedly useful for the community, as they contribute towards characterising the many ways in which this special population differs from normally sighted individuals. The combination of MRS and EEG measures is a promising strategy to estimate a fundamental physiological parameter - the balance between excitation and inhibition in the visual cortex, which animal studies show to be heavily dependent upon early visual experience. Thus, the reported results pave the way for further studies, which may use a similar approach to evaluate more patients and control groups.

      Weaknesses:

      The main methodological limitation is the lack of an appropriate comparison group or condition to delineate the effect of sight recovery (as opposed to the effect of congenital blindness). Few previous studies suggested that Excitation/Inhibition ratio in the visual cortex is increased in congenitally blind patients; the present study reports that E/I ratio decreases instead. The authors claim that this implies a change of E/I ratio following sight recovery. However, supporting this claim would require showing a shift of E/I after vs. before the sight-recovery surgery, or at least it would require comparing patients who did and did not undergo the sight-recovery surgery (as common in the field).

      We thank the reviewer for suggesting ways to improve our manuscript and carefully reassessing our revised manuscript.

      Since we have not been able to acquire longitudinal data with the experimental design of the present study in congenital cataract reversal individuals, we compared the MRS and EEG results of congenital cataract reversal individuals  to published work in congenitally permanent blind individuals. We consider this as a resource saving approach. We think that the results of our cross-sectional study now justify the costs and enormous efforts (and time for the patients who often have to travel long distances) associated with longitudinal studies in this rare population.

      There are also more technical limitations related to the correlation analyses, which are partly acknowledged in the manuscript. A bland correlation between GLX/GABA and the visual impairment is reported, but this is specific to the patients group (N=10) and would not hold across groups (the correlation is positive, predicting the lowest GLX/GABA ratio values for the sighted controls - opposite of what is found). There is also a strong correlation between GLX concentrations and the EEG power at the lowest temporal frequencies. Although this relation is intriguing, it only holds for a very specific combination of parameters (of the many tested): only with eyes open, only in the patients group.

      Given the exploratory nature of the correlations, we do not base the majority of our conclusions on this analysis. There are no doubts that the reported correlations need replication; however, replication is only possible after a first report. Thus, we hope to motivate corresponding analyses in further studies.

      It has to be noted that in the present study significance testing for correlations were corrected for multiple comparisons, and that some findings replicate earlier reports (e.g. effects on EEG aperiodic slope, alpha power, and correlations with chronological age).

      Conclusions:

      The main claim of the study is that sight recovery impacts the excitation/inhibition balance in the visual cortex, estimated with MRS or through indirect EEG indices. However, due to the weaknesses outlined above, the study cannot distinguish the effects of sight recovery from those of visual deprivation. Moreover, many aspects of the results are interesting but their validation and interpretation require additional experimental work.

      We interpret the group differences between individuals tested years after congenital visual deprivation and normally sighted individuals as supportive of the E/I ratio being impacted by congenital visual deprivation. In the absence of a sensitive period for the development of an E/I ratio, individuals with a transient phase of congenital blindness might have developed a visual system indistinguishable  from normally sighted individuals. As we demonstrate, this is not so. Comparing the results of congenitally blind humans with those of congenitally permanently blind humans (from previous studies) allowed us to identify changes of E/I ratio, which add to those found for congenital blindness.  

      We thank the reviewer for the helpful comments and suggestions related to the first submission and first revision of our manuscript. We are keen to translate some of them into future studies.

      Reviewer #3 (Public review):

      This manuscript examines the impact of congenital visual deprivation on the excitatory/inhibitory (E/I) ratio in the visual cortex using Magnetic Resonance Spectroscopy (MRS) and electroencephalography (EEG) in individuals whose sight was restored. Ten individuals with reversed congenital cataracts were compared to age-matched, normally sighted controls, assessing the cortical E/I balance and its interrelationship and to visual acuity. The study reveals that the Glx/GABA ratio in the visual cortex and the intercept and aperiodic signal are significantly altered in those with a history of early visual deprivation, suggesting persistent neurophysiological changes despite visual restoration.

      First of all, I would like to disclose that I am not an expert in congenital visual deprivation, nor in MRS. My expertise is in EEG (particularly in the decomposition of periodic and aperiodic activity) and statistical methods.

      Although the authors addressed some of the concerns of the previous version, major concerns and flaws remain in terms of methodological and statistical approaches along with the (over)interpretation of the results. Specific concerns include:

      (1 3.1) Response to Variability in Visual Deprivation<br /> Rather than listing the advantages and disadvantages of visual deprivation, I recommend providing at least a descriptive analysis of how the duration of visual deprivation influenced the measures of interest. This would enhance the depth and relevance of the discussion.

      Although Review 2 and Review 3 (see below) pointed out problems in interpreting multiple correlational analyses in small samples, we addressed this request by reporting such correlations between visual deprivation history and measured EEG/MRS outcomes.

      Calculating the correlation between duration of visual deprivation and behavioral or brain measures is, in fact, a common suggestion. The existence of sensitive periods, which are typically assumed to not follow a linear gradual decline of neuroplasticity, does not necessary allow predicting a correlation with duration of blindness. Daphne Maurer has additionally worked on the concept of “sleeper effects” (Maurer et al., 2007), that is, effects on the brain and behavior by early deprivation which are observed only later in life when the function/neural circuits matures.

      In accordance with this reasoning, we did not observe a significant correlation between duration of visual deprivation and any of our dependent variables.

      (2 3.2) Small Sample Size<br /> The issue of small sample size remains problematic. The justification that previous studies employed similar sample sizes does not adequately address the limitation in the current study. I strongly suggest that the correlation analyses should not feature prominently in the main manuscript or the abstract, especially if the discussion does not substantially rely on these correlations. Please also revisit the recommendations made in the section on statistical concerns.

      In the revised manuscript, we explicitly mention that our sample size is not atypical for the special group investigated, but that a replication of our results in larger samples would foster their impact. We only explicitly mention correlations that survived stringent testing for multiple comparisons in the main manuscript.

      Given the exploratory nature of the correlations, we have not based the majority of our claims on this analysis.

      (3 3.3) Statistical Concerns<br /> While I appreciate the effort of conducting an independent statistical check, it merely validates whether the reported statistical parameters, degrees of freedom (df), and p-values are consistent. However, this does not address the appropriateness of the chosen statistical methods.

      We did not intend for the statcheck report to justify the methods used for statistics, which we have done in a separate section with normality and homogeneity testing (Supplementary Material S9), and references to it in the descriptions of the statistical analyses (Methods, Page 13, Lines 326-329 and Page 15, Lines 400-402).

      Several points require clarification or improvement:<br /> (4) Correlation Methods: The manuscript does not specify whether the reported correlation analyses are based on Pearson or Spearman correlation.

      The depicted correlations are Pearson correlations. We will add this information to the Methods.

      (5) Confidence Intervals: Include confidence intervals for correlations to represent the uncertainty associated with these estimates.

      We have added the confidence intervals for all measured correlations to the second revision of our manuscript.

      (6) Permutation Statistics: Given the small sample size, I recommend using permutation statistics, as these are exact tests and more appropriate for small datasets.

      Our study focuses on a rare population, with a sample size limited by the availability of participants. Our findings provide exploratory insights rather than make strong inferential claims. To this end, we have ensured that our analysis adheres to key statistical assumptions (Shapiro-Wilk as well as Levene’s tests, Supplementary Material S9), and reported our findings with effect sizes, appropriate caution and context.

      (7) Adjusted P-Values: Ensure that reported Bonferroni corrected p-values (e.g., p > 0.999) are clearly labeled as adjusted p-values where applicable.

      In the revised manuscript, we have changed Figure 4 to say ‘adjusted p,’  which we indeed reported.

      (8) Figure 2C

      Figure 2C still lacks crucial information that the correlation between Glx/GABA ratio and visual acuity was computed solely in the control group (as described in the rebuttal letter). Why was this analysis restricted to the control group? Please provide a rationale.

      Figure 2C depicts the correlation between Glx/GABA+ ratio and visual acuity in the congenital cataract reversal group, not the control group. This is mentioned in the Figure 2 legend, as well as in the main text where the figure is referred to (Page 18, Line 475).

      The correlation analyses between visual acuity and MRS/EEG measures were only performed in the congenital cataract reversal group since the sighed control group comprised of individuals with vision in the normal range; thus this analyses would not make sense. Table 1 with the individual visual acuities for all participants, including the normally sighted controls, shows the low variance in the latter group.  

      For variables in which no apiori group differences in variance were predicted, we performed the correlation analyses across groups (see Supplementary Material S12, S15).

      We have now highlighted these motivations more clearly in the Methods of the revised manuscript (Page 16, Lines 405-410).

      (9 3.4) Interpretation of Aperiodic Signal

      Relying on previous studies to interpret the aperiodic slope as a proxy for excitation/inhibition (E/I) does not make the interpretation more robust.

      How to interpret aperiodic EEG activity has been subject of extensive investigation. We cite studies which provide evidence from multiple species (monkeys, humans) and measurements (EEG, MEG, ECoG), including studies which pharmacologically manipulated E/I balance.

      Whether our findings are robust, in fact, requires a replication study. Importantly, we analyzed the intercept of the aperiodic activity fit as well, and discuss results related to the intercept.

      Quote:

      “(3.4) Interpretation of aperiodic signal:

      - Several recent papers demonstrated that the aperiodic signal measured in EEG or ECoG is related to various important aspects such as age, skull thickness, electrode impedance, as well as cognition. Thus, currently, very little is known about the underlying effects which influence the aperiodic intercept and slope. The entire interpretation of the aperiodic slope as a proxy for E/I is based on a computational model and simulation (as described in the Gao et al. paper).

      Apart from the modeling work from Gao et al., multiple papers which have also been cited which used ECoG, EEG and MEG and showed concomitant changes in aperiodic activity with pharmacological manipulation of the E/I ratio (Colombo et al., 2019; Molina et al., 2020; Muthukumaraswamy & Liley, 2018). Further, several prior studies have interpreted changes in the aperiodic slope as reflective of changes in the E/I ratio, including studies of developmental groups (Favaro et al., 2023; Hill et al., 2022; McSweeney et al., 2023; Schaworonkow & Voytek, 2021) as well as patient groups (Molina et al., 2020; Ostlund et al., 2021).

      - The authors further wrote: We used the slope of the aperiodic (1/f) component of the EEG spectrum as an estimate of E/I ratio (Gao et al., 2017; Medel et al., 2020; Muthukumaraswamy & Liley, 2018). This is a highly speculative interpretation with very little empirical evidence. These papers were conducted with ECoG data (mostly in animals) and mostly under anesthesia. Thus, these studies only allow an indirect interpretation by what the 1/f slope in EEG measurements is actually influenced.

      Note that Muthukumaraswamy et al. (2018) used different types of pharmacological manipulations and analyzed periodic and aperiodic MEG activity in humans, in addition to monkey ECoG (Muthukumaraswamy & Liley, 2018). Further, Medel et al. (now published as Medel et al., 2023) compared EEG activity in addition to ECoG data after propofol administration. The interpretation of our results are in line with a number of recent studies in developing (Hill et al., 2022; Schaworonkow & Voytek, 2021) and special populations using EEG. As mentioned above, several prior studies have used the slope of the 1/f component/aperiodic activity as an indirect measure of the E/I ratio (Favaro et al., 2023; Hill et al., 2022; McSweeney et al., 2023; Molina et al., 2020; Ostlund et al., 2021; Schaworonkow & Voytek, 2021), including studies using scalp-recorded EEG from humans.

      In the introduction of the revised manuscript, we have made more explicit that this metric is indirect (Page 3, Line 91), (additionally see Discussion, Page 24, Lines 644-645, Page 25, Lines 650-657).

      While a full understanding of aperiodic activity needs to be provided, some convergent ideas have emerged. We think that our results contribute to this enterprise, since our study is, to the best of our knowledge, the first which assessed MRS measured neurotransmitter levels and EEG aperiodic activity. “

      (10) Additionally, the authors state:

      "We cannot think of how any of the exploratory correlations between neurophysiological measures and MRS measures could be accounted for by a difference e.g. in skull thickness."

      (11) This could be addressed directly by including skull thickness as a covariate or visualizing it in scatterplots, for instance, by representing skull thickness as the size of the dots.

      We are not aware of any study that would justify such an analysis.

      Our analyses were based on previous findings in the literature.

      Since to the best of our knowledge, no evidence exists that congenital cataracts go together with changes in skull thickness, and that skull thickness might selectively modulate visual cortex Glx/GABA+ but not NAA measures, we decided against following this suggestion.

      Notably, the neurotransmitter concentration reported here is after tissue segmentation of the voxel region. The tissue fraction was shown to not differ between groups in the MRS voxels (Supplementary Material S4). The EEG electrode impedance was lowered to <10 kOhm in every participant (Methods, Page 13, Line 344), and preparation was identical across groups.

      (12 3.5) Problems with EEG Preprocessing and Analysis

      Downsampling: The decision to downsample the data to 60 Hz "to match the stimulation rate" is problematic. This choice conflates subsequent spectral analyses due to aliasing issues, as explained by the Nyquist theorem. While the authors cite prior studies (Schwenk et al., 2020; VanRullen & MacDonald, 2012) to justify this decision, these studies focused on alpha (8-12 Hz), where aliasing is less of a concern compared of analyzing aperiodic signal. Furthermore, in contrast, the current study analyzes the frequency range from 1-20 Hz, which is too narrow for interpreting the aperiodic signal as E/I. Typically, this analysis should include higher frequencies, spanning at least 1-30 Hz or even 1-45 Hz (not 20-40 Hz).

      As previously mentied in the Methods (Page 15 Line 376) and the previous response, the pop_resample function used by EEGLAB applies an anti-aliasing filter, at half the resampling frequency (as per the Nyquist theorem

      https://eeglab.org/tutorials/05_Preprocess/resampling.html). The upper cut off of the low pass filter set by EEGlab prior to down sampling (30 Hz) is still far above the frequency of interest in the current study  (1-20 Hz), thus allowing us to derive valid results.

      Quote:

      “- The authors downsampled the data to 60Hz to "to match the stimulation rate". What is the intention of this? Because the subsequent spectral analyses are conflated by this choice (see Nyquist theorem).

      This data were collected as part of a study designed to evoke alpha activity with visual white-noise, which ranged in luminance with equal power at all frequencies from 1-60 Hz, restricted by the refresh rate of the monitor on which stimuli were presented (Pant et al., 2023). This paradigm and method was developed by VanRullen and colleagues (Schwenk et al., 2020; Vanrullen & MacDonald, 2012), wherein the analysis requires the same sampling rate between the presented frequencies and the EEG data. The downsampling function used here automatically applies an anti-aliasing filter (EEGLAB 2019) .”

      Moreover, the resting-state data were not resampled to 60 Hz. We have made this clearer in the Methods of the second revision (Page 15, Line 367).

      Our consistent results of group differences across all three EEG conditions, thus, exclude any possibility that they were driven by aliasing artifacts.

      The expected effects of this anti-aliasing filter can be seen in the attached Author response image 1, showing an example participant’s spectrum in the 1-30 Hz range (as opposed to the 1-20 Hz plotted in the manuscript), clearly showing a 30-40 dB drop at 30 Hz. Any aliasing due to, for example, remaining line noise, would additionally be visible in this figure (as well as Figure 3) as a peak.

      Author response image 1.

      Power spectral density of one congenital cataract-reversal (CC) participant in the visual stimulation condition across all channels. The reduced power at 30 Hz shows the effects of the anti-aliasing filter applied by EEGLAB’s pop_resample function.

      As we stated in the manuscript, and in previous reviews, so far there has been no consensus on the exact range of measuring aperiodic activity. We made a principled decision based on the literature (showing a knee in aperiodic fits of this dataset at 20 Hz) (Medel et al., 2023; Ossandón et al., 2023), data quality (possible contamination by line noise at higher frequencies) and the purpose of the visual stimulation experiment (to look at the lower frequency range by stimulating up to 60 Hz, thereby limiting us to quantifying below 30 Hz), that 1-20 Hz would be the fit range in this dataset.

      Quote:

      “(3) What's the underlying idea of analyzing two separate aperiodic slopes (20-40Hz and 1-19Hz). This is very unusual to compute the slope between 20-40 Hz, where the SNR is rather low.

      "Ossandón et al. (2023), however, observed that in addition to the flatter slope of the aperiodic power spectrum in the high frequency range (20-40 Hz), the slope of the low frequency range (1-19 Hz) was steeper in both, congenital cataract-reversal individuals, as well as in permanently congenitally blind humans."

      The present manuscript computed the slope between 1-20 Hz. Ossandón et al. as well as Medel et al. (2023) found a “knee” of the 1/f distribution at 20 Hz and describe further the motivations for computing both slope ranges. For example, Ossandón et al. used a data driven approach and compared single vs. dual fits and found that the latter fitted the data better. Additionally, they found the best fit if a knee at 20 Hz was used. We would like to point out that no standard range exists for the fitting of the 1/f component across the literature and, in fact, very different ranges have been used (Gao et al., 2017; Medel et al., 2023; Muthukumaraswamy & Liley, 2018). “

      (13) Baseline Removal: Subtracting the mean activity across an epoch as a baseline removal step is inappropriate for resting-state EEG data. This preprocessing step undermines the validity of the analysis. The EEG dataset has fundamental flaws, many of which were pointed out in the previous review round but remain unaddressed. In its current form, the manuscript falls short of standards for robust EEG analysis. If I were reviewing for another journal, I would recommend rejection based on these flaws.

      The baseline removal step from each epoch serves to remove the DC component of the recording and detrend the data. This is a standard preprocessing step (included as an option in preprocessing pipelines recommended by the EEGLAB toolbox, FieldTrip toolbox and MNE toolbox), additionally necessary to improve the efficacy of ICA decomposition (Groppe et al., 2009).

      In the previous review round, a clarification of the baseline timing was requested, which we added. Beyond this request, there was no mention of the appropriateness of the baseline removal and/or a request to provide reasons for why it might not undermine the validity of the analysis.

      Quote:

      “- "Subsequently, baseline removal was conducted by subtracting the mean activity across the length of an epoch from every data point." The actual baseline time segment should be specified.

      The time segment was the length of the epoch, that is, 1 second for the resting state conditions and 6.25 seconds for the visual stimulation conditions. This has been explicitly stated in the revised manuscript (Page 13, Line 354).”

      Prior work in the time (not frequency) domain on event-related potential (ERP) analysis has suggested that the baselining step might cause spurious effects (Delorme, 2023) (although see (Tanner et al., 2016)). We did not perform ERP analysis at any stage. One recent study suggests spurious group differences in the 1/f signal might be driven by an inappropriate dB division baselining method (Gyurkovics et al., 2021), which we did not perform.

      Any effect of our baselining procedure on the FFT spectrum would be below the 1 Hz range, which we did not analyze.  

      Each of the preprocessing steps in the manuscript match pipelines described and published in extensive prior work. We document how multiple aspects of our EEG results replicate prior findings (Supplementary Material S15, S18, S19), reports of other experimenters, groups and locations, validating that our results are robust.

      We therefore reject the claim of methodological flaws in our EEG analyses in the strongest possible terms.

      Quote:

      “(3.5) Problems with EEG preprocessing and analysis:

      - It seems that the authors did not identify bad channels nor address the line noise issue (even a problem if a low pass filter of below-the-line noise was applied).

      As pointed out in the methods and Figure 1, we only analyzed data from two occipital channels, O1 and O2 neither of which were rejected for any participant. Channel rejection was performed for the larger dataset, published elsewhere (Ossandón et al., 2023; Pant et al., 2023). As control sites we added the frontal channels FP1 and Fp2 (see Supplementary Material S14)

      Neither Ossandón et al. (2023) nor Pant et al. (2023) considered frequency ranges above 40 Hz to avoid any possible contamination with line noise. Here, we focused on activity between 0 and 20 Hz, definitely excluding line noise contaminations (Methods, Page 14, Lines 365-367). The low pass filter (FIR, 1-45 Hz) guaranteed that any spill-over effects of line noise would be restricted to frequencies just below the upper cutoff frequency.

      Additionally, a prior version of the analysis used spectrum interpolation to remove line noise; the group differences remained stable (Ossandón et al., 2023). We have reported this analysis in the revised manuscript (Page 14, Lines 364-357).

      Further, both groups were measured in the same lab, making line noise (~ 50 Hz) as an account for the observed group effects in the 1-20 Hz frequency range highly unlikely. Finally, any of the exploratory MRS-EEG correlations would be hard to explain if the EEG parameters would be contaminated with line noise.

      - What was the percentage of segments that needed to be rejected due to the 120μV criteria? This should be reported specifically for EO & EC and controls and patients.

      The mean percentage of 1 second segments rejected for each resting state condition and the percentage of 6.25 long segments rejected in each group for the visual stimulation condition have been added to the revised manuscript (Supplementary Material S10), and referred to in the Methods on Page 14, Lines 372-373).

      - The authors downsampled the data to 60Hz to "to match the stimulation rate". What is the intention of this? Because the subsequent spectral analyses are conflated by this choice (see Nyquist theorem).

      This data were collected as part of a study designed to evoke alpha activity with visual white-noise, which changed in luminance with equal power at all frequencies from 1-60 Hz, restricted by the refresh rate of the monitor on which stimuli were presented (Pant et al., 2023). This paradigm and method was developed by VanRullen and colleagues (Schwenk et al., 2020; VanRullen & MacDonald, 2012), wherein the analysis requires the same sampling rate between the presented frequencies and the EEG data. The downsampling function used here automatically applies an anti-aliasing filter (EEGLAB 2019) .

      - "Subsequently, baseline removal was conducted by subtracting the mean activity across the length of an epoch from every data point." The actual baseline time segment should be specified.

      The time segment was the length of the epoch, that is, 1 second for the resting state conditions and 6.25 seconds for the visual stimulation conditions. This has now been explicitly stated in the revised manuscript (Page 14, Lines 379-380).

      - "We excluded the alpha range (8-14 Hz) for this fit to avoid biasing the results due to documented differences in alpha activity between CC and SC individuals (Bottari et al., 2016; Ossandón et al., 2023; Pant et al., 2023)." This does not really make sense, as the FOOOF algorithm first fits the 1/f slope, for which the alpha activity is not relevant.

      We did not use the FOOOF algorithm/toolbox in this manuscript. As stated in the Methods, we used a 1/f fit to the 1-20 Hz spectrum in the log-log space, and subtracted this fit from the original spectrum to obtain the corrected spectrum. Given the pronounced difference in alpha power between groups (Bottari et al., 2016; Ossandón et al., 2023; Pant et al., 2023), we were concerned it might drive differences in the exponent values. Our analysis pipeline had been adapted from previous publications of our group and other labs (Ossandón et al., 2023; Voytek et al., 2015; Waschke et al., 2017).

      We have conducted the analysis with and without the exclusion of the alpha range, as well as using the FOOOF toolbox both in the 1-20 Hz and 20-40 Hz ranges (Ossandón et al., 2023). The findings of a steeper slope in the 1-20 Hz range as well as lower alpha power in CC vs SC individuals remained stable. In Ossandón et al., the comparison between the piecewise fits and FOOOF fits led the authors to use the former, as it outperformed the FOOOF algorithm for their data.

      - The model fits of the 1/f fitting for EO, EC, and both participant groups should be reported.

      In Figure 3 of the manuscript, we depicted the mean spectra and 1/f fits for each group.

      In the revised manuscript, we added the fit quality metrics (average R<sup>2</sup> values > 0.91 for each group and condition) (Methods Page 15, Lines 395-396; Supplementary Material S11) and additionally show individual subjects’ fits (Supplementary Material S11). “

      (14) The authors mention:

      "The EEG data sets reported here were part of data published earlier (Ossandón et al., 2023; Pant et al., 2023)." Thus, the statement "The group differences for the EEG assessments corresponded to those of a larger sample of CC individuals (n=38) " is a circular argument and should be avoided."

      The authors addressed this comment and adjusted the statement. However, I do not understand, why not the full sample published earlier (Ossandón et al., 2023) was used in the current study?

      The recording of EEG resting state data stated in 2013, while MRS testing could only be set up by the second half of 2019. Moreover, not all subjects who qualify for EEG recording qualify for being scanned (e.g. due to MRI safety, claustrophobia)

      References

      Bottari, D., Troje, N. F., Ley, P., Hense, M., Kekunnaya, R., & Röder, B. (2016). Sight restoration after congenital blindness does not reinstate alpha oscillatory activity in humans. Scientific Reports. https://doi.org/10.1038/srep24683

      Colombo, M. A., Napolitani, M., Boly, M., Gosseries, O., Casarotto, S., Rosanova, M., Brichant, J. F., Boveroux, P., Rex, S., Laureys, S., Massimini, M., Chieregato, A., & Sarasso, S. (2019). The spectral exponent of the resting EEG indexes the presence of consciousness during unresponsiveness induced by propofol, xenon, and ketamine. NeuroImage, 189(September 2018), 631–644. https://doi.org/10.1016/j.neuroimage.2019.01.024

      Delorme, A. (2023). EEG is better left alone. Scientific Reports, 13(1), 2372. https://doi.org/10.1038/s41598-023-27528-0

      Favaro, J., Colombo, M. A., Mikulan, E., Sartori, S., Nosadini, M., Pelizza, M. F., Rosanova, M., Sarasso, S., Massimini, M., & Toldo, I. (2023). The maturation of aperiodic EEG activity across development reveals a progressive differentiation of wakefulness from sleep. NeuroImage, 277. https://doi.org/10.1016/J.NEUROIMAGE.2023.120264

      Gao, R., Peterson, E. J., & Voytek, B. (2017). Inferring synaptic excitation/inhibition balance from field potentials. NeuroImage, 158(March), 70–78. https://doi.org/10.1016/j.neuroimage.2017.06.078

      Groppe, D. M., Makeig, S., & Kutas, M. (2009). Identifying reliable independent components via split-half comparisons. NeuroImage, 45(4), 1199–1211. https://doi.org/10.1016/j.neuroimage.2008.12.038

      Gyurkovics, M., Clements, G. M., Low, K. A., Fabiani, M., & Gratton, G. (2021). The impact of 1/f activity and baseline correction on the results and interpretation of time-frequency analyses of EEG/MEG data: A cautionary tale. NeuroImage, 237. https://doi.org/10.1016/j.neuroimage.2021.118192

      Hill, A. T., Clark, G. M., Bigelow, F. J., Lum, J. A. G., & Enticott, P. G. (2022). Periodic and aperiodic neural activity displays age-dependent changes across early-to-middle childhood. Developmental Cognitive Neuroscience, 54, 101076. https://doi.org/10.1016/J.DCN.2022.101076

      Maurer, D., Mondloch, C. J., & Lewis, T. L. (2007). Sleeper effects. In Developmental Science. https://doi.org/10.1111/j.1467-7687.2007.00562.x

      McSweeney, M., Morales, S., Valadez, E. A., Buzzell, G. A., Yoder, L., Fifer, W. P., Pini, N., Shuffrey, L. C., Elliott, A. J., Isler, J. R., & Fox, N. A. (2023). Age-related trends in aperiodic EEG activity and alpha oscillations during early- to middle-childhood. NeuroImage, 269, 119925. https://doi.org/10.1016/j.neuroimage.2023.119925

      Medel, V., Irani, M., Crossley, N., Ossandón, T., & Boncompte, G. (2023). Complexity and 1/f slope jointly reflect brain states. Scientific Reports, 13(1), 21700. https://doi.org/10.1038/s41598-023-47316-0

      Molina, J. L., Voytek, B., Thomas, M. L., Joshi, Y. B., Bhakta, S. G., Talledo, J. A., Swerdlow, N. R., & Light, G. A. (2020). Memantine Effects on Electroencephalographic Measures of Putative Excitatory/Inhibitory Balance in Schizophrenia. Biological Psychiatry: Cognitive Neuroscience and Neuroimaging, 5(6), 562–568. https://doi.org/10.1016/j.bpsc.2020.02.004

      Muthukumaraswamy, S. D., & Liley, D. T. (2018). 1/F electrophysiological spectra in resting and drug-induced states can be explained by the dynamics of multiple oscillatory relaxation processes. NeuroImage, 179(November 2017), 582–595. https://doi.org/10.1016/j.neuroimage.2018.06.068

      Ossandón, J. P., Stange, L., Gudi-Mindermann, H., Rimmele, J. M., Sourav, S., Bottari, D., Kekunnaya, R., & Röder, B. (2023). The development of oscillatory and aperiodic resting state activity is linked to a sensitive period in humans. NeuroImage, 275, 120171. https://doi.org/10.1016/J.NEUROIMAGE.2023.120171

      Ostlund, B. D., Alperin, B. R., Drew, T., & Karalunas, S. L. (2021). Behavioral and cognitive correlates of the aperiodic (1/f-like) exponent of the EEG power spectrum in adolescents with and without ADHD. Developmental Cognitive Neuroscience, 48, 100931. https://doi.org/10.1016/j.dcn.2021.100931

      Pant, R., Ossandón, J., Stange, L., Shareef, I., Kekunnaya, R., & Röder, B. (2023). Stimulus-evoked and resting-state alpha oscillations show a linked dependence on patterned visual experience for development. NeuroImage: Clinical, 103375. https://doi.org/10.1016/J.NICL.2023.103375

      Schaworonkow, N., & Voytek, B. (2021). Longitudinal changes in aperiodic and periodic activity in electrophysiological recordings in the first seven months of life. Developmental Cognitive Neuroscience, 47. https://doi.org/10.1016/j.dcn.2020.100895

      Schwenk, J. C. B., VanRullen, R., & Bremmer, F. (2020). Dynamics of Visual Perceptual Echoes Following Short-Term Visual Deprivation. Cerebral Cortex Communications, 1(1). https://doi.org/10.1093/TEXCOM/TGAA012

      Tanner, D., Norton, J. J. S., Morgan-Short, K., & Luck, S. J. (2016). On high-pass filter artifacts (they’re real) and baseline correction (it’s a good idea) in ERP/ERMF analysis. Journal of Neuroscience Methods, 266, 166–170. https://doi.org/10.1016/j.jneumeth.2016.01.002

      Vanrullen, R., & MacDonald, J. S. P. (2012). Perceptual echoes at 10 Hz in the human brain. Current Biology. https://doi.org/10.1016/j.cub.2012.03.050

      Voytek, B., Kramer, M. A., Case, J., Lepage, K. Q., Tempesta, Z. R., Knight, R. T., & Gazzaley, A. (2015). Age-related changes in 1/f neural electrophysiological noise. Journal of Neuroscience, 35(38). https://doi.org/10.1523/JNEUROSCI.2332-14.2015

      Waschke, L., Wöstmann, M., & Obleser, J. (2017). States and traits of neural irregularity in the age-varying human brain. Scientific Reports 2017 7:1, 7(1), 1–12. https://doi.org/10.1038/s41598-017-17766-4

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      For many years, there has been extensive electrophysiological research investigating the relationship between local field potential patterns and individual cell spike patterns in the hippocampus. In this study, using state-ofthe-art imaging techniques, they examined spike synchrony of hippocampal cells during locomotion and immobility states. In contrast to conventional understanding of the hippocampus, the authors demonstrated that hippocampal place cells exhibit prominent synchronous spikes locked to theta oscillations.

      Strengths:

      The voltage imaging used in this study is a highly novel method that allows recording not only suprathreshold-level spikes but also subthreshold-level activity. With its high frame rate, it offers time resolution comparable to electrophysiological recordings.

      We thank the reviewer for a thorough review of our manuscript and for recognizing the strength of our study.

      Reviewer #2 (Public review):

      Summary:

      This study employed voltage imaging in the CA1 region of the mouse hippocampus during the exploration of a novel environment. The authors report synchronous activity, involving almost half of the imaged neurons, occurred during periods of immobility. These events did not correlate with SWRs, but instead, occurred during theta oscillations and were phased locked to the trough of theta. Moreover, pairs of neurons with high synchronization tended to display non-overlapping place fields, leading the authors to suggest these events may play a role in binding a distributed representation of the context.

      Strengths:

      Technically this is an impressive study, using an emerging approach that allow single-cell resolution voltage imaging in animals, that while head-fixed, can move through a real environment. The paper is written clearly and suggests novel observations about population-level activity in CA1.

      We thank the reviewer for a thorough review of our manuscript and for recognizing the strength of our study.

      Weaknesses:

      The evidence provided is weak, with the authors making surprising population-level claims based on a very sparse data set (5 data sets, each with less than 20 neurons simultaneously recorded) acquired with exciting, but less tested technology. Further, while the authors link these observations to the novelty of the context, both in the title and text, they do not include data from subsequent visits to support this. Detailed comments are below:

      (1) My first question for the authors, which is not addressed in the discussion, is why these events have not been observed in the countless extracellular recording experiments conducted in rodent CA1 during exploration of novel environments. Those data sets often have 10x the neurons simultaneously recording compared to these present data, thus the highly synchronous firing should be very hard to miss. Ideally, the authors could confirm their claims via the analysis of publicly available electrophysiology data sets. Further, the claim of high extra-SWR synchrony is complicated by the observation that their recorded neurons fail to spike during the limited number of SWRs recorded during behavior- again, not agreeing with much of the previous electrophysiological recordings.

      (2) The authors posit that these events are linked to the novelty of the context, both in the text, as well as in the title and abstract. However they do not include any imaging data from subsequent days to demonstrate the failure to see this synchrony in a familiar environment. If these data are available it would strengthen the proposed link to novelty is they were included.

      (3) In the discussion the authors begin by speculating the theta present during these synchronous events may be slower type II or attentional theta. This can be supported by demonstrating a frequency shift in the theta recording during these events/immobility versus the theta recording during movement. (4) The authors mention in the discussion that they image deep layer PCs in CA1, however this is not mentioned in the text or methods. They should include data, such as imaging of a slice of a brain post-recording with immunohistochemistry for a layer specific gene to support this.

      Comments on revisions:

      I have no further major requests and thank the authors for the additional data and analyses.

      We thank the reviewer for recognizing our efforts in revising the manuscript.

      Reviewer #3 (Public review):

      Summary:

      In the present manuscript, the authors use a few minutes of voltage imaging of CA1 pyramidal cells in head-fixed mice running on a track while local field potentials (LFPs) are recorded. The authors suggest that synchronous ensembles of neurons are differentially associated with different types of LFP patterns, theta and ripples. The experiments are flawed in that the LFP is not "local" but rather collected the other side of the brain.

      Strengths:

      The authors use a cutting-edge technique.

      We thank the reviewer for a thoughtful review of our manuscript and for pointing out the technical strength of our study.

      Weaknesses:

      The two main messages of the manuscript indicated in the title are not supported by the data. The title gives two messages that relate to CA1 pyramidal neurons in behaving head-fixed mice: (1) synchronous ensembles are associated with theta (2) synchronous ensembles are not associated with ripples. The main problem with the work is that the theta and ripple signals were recorded using electrophysiology from the opposite hemisphere to the one in which the spiking was monitored. However, both rhythms exhibit profound differences as a function of location.

      Theta phase changes with the precise location along the proximo-distal and dorso-ventral axes, and importantly, even reverses with depth. Because the LFP was recorded using a single-contact tungsten electrode, there is no way to know whether the electrode was exactly in the CA1 pyramidal cell layer, or in the CA1 oriens, CA1 radiatum, or perhaps even CA3 - which exhibits ripples and theta which are weakly correlated and in anti-phase with the CA1 rhythms, respectively. Thus, there is no way to know whether the theta phase used in the analysis is the phase of the local CA1 theta.

      Although the occurrence of CA1 ripples is often correlated across parts of the hippocampus, ripples are inherently a locally-generated rhythm. Independent ripples occur within a fraction of a millimeter within the same hemisphere. Ripples are also very sensitive to the precise depth - 100 micrometers up or down, and only a positive deflection/sharp wave is evident. Thus, even if the LFP was recorded from the center of the CA1 pyramidal layer in the contralateral hemisphere, it would not suffice for the claim made in the title.

      We thank the reviewer for pointing out the issue regarding the claim made in the title. We have revised the manuscript to clarify that the theta and ripple oscillations referenced in the title refer to specific frequency bands of intracellular and contralaterally recorded field potentials rather than field potentials recorded at the same site as the neuronal activity.

      Abstract (line19):

      “… Notably, these synchronous ensembles were not associated with contralateral ripple oscillations but were instead phase-locked to theta waves recorded in the contralateral CA1 region. Moreover, the subthreshold membrane potentials of neurons exhibited coherent intracellular theta oscillations with a depolarizing peak at the moment of synchrony.”

      Introduction (line68):

      “… Surprisingly, these synchronous ensembles occurred outside of contralateral ripples and were phase-locked to intracellular theta oscillations as well as extracellular theta oscillations recorded from the contralateral CA1 region.”

      To address concerns about electrode placement, we have now included posthoc histological verification of electrode locations, confirming that they were positioned in the contralateral CA1 pyramidal layer (Author response image 1). 

      Author response image 1.

      Post-hoc histological section showing the location of a DiI-coated electrode in the contralateral CA1 pyramidal layer. Scale bar: 300 μm.

      While we appreciate that theta and ripple oscillations exhibit regional variations in phase and amplitude, previous studies have demonstrated a strong co-occurrence and synchrony of these oscillations between both hippocampi1-3. Given that our primary objective was to examine how neuronal ensembles relate to large-scale hippocampal oscillation states rather than local microcircuit-level fluctuations, we recorded theta and ripple oscillations from the contralateral CA1 region.

      However, we acknowledge that contralateral recordings do not capture all ipsilateral-specific dynamics. Theta phases vary with depth and precise location, and local ripple events may be independently generated across small spatial scales. To reflect this, we have now explicitly acknowledged these considerations in the discussion. 

      Discussion (line527):

      While contralateral LFP recordings reliably capture large-scale hippocampal theta and ripple oscillations, they may not fully account for ipsilateral-specific dynamics, such as variations in theta phase alignment or locally generated ripple events. Although contralateral recordings serve as a well-established proxy for large-scale hippocampal oscillatory states, incorporating simultaneous ipsilateral field potential recordings in future studies could refine our understanding of local-global network interactions. Despite these considerations, our findings provide robust evidence for the existence of synchronous neuronal ensembles and their role in coordinating newly formed place cells. These results advance our understanding of how synchronous neuronal ensembles contribute to spatial memory acquisition and hippocampal network coordination.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      The authors have provided sufficient experimental and analytical data addressing my comments, particularly regarding consistency with past electrophysiological data and the exclusion of potential imaging artifacts.

      We thank the reviewer for recognizing our efforts in revising the manuscript.

      Minor comment: In Figure 2C and Figure 5-figure supplement 1, 'paired Student's t-test' is not entirely appropriate. More precisely, either 'paired t-test' or 'Student's t-test' would better indicate the correct statistical method. Please verify whether these data comparisons are within-group or between-group.

      Thank you for the comment. We have revised the manuscript as suggested.

      Reviewer #2 (Recommendations for the authors):

      I have no further major requests and thank the authors for the additional data and analyses.

      We thank the reviewer for recognizing our efforts in revising the manuscript.

      Minor points- line 169- typo, correct grant to grand

      Thank you for pointing it out. The typo has been corrected.

      (1) Buzsaki, G. et al. Hippocampal network patterns of activity in the mouse. Neuroscience 116, 201-211 (2003). https://doi.org:10.1016/s03064522(02)00669-3

      (2) Szabo, G. G. et al. Ripple-selective GABAergic projection cells in the hippocampus. Neuron 110, 1959-1977 e1959 (2022). https://doi.org:10.1016/j.neuron.2022.04.002

      (3) Huang, Y. C. et al. Dynamic assemblies of parvalbumin interneurons in brain oscillations. Neuron 112, 2600-2613 e2605 (2024). https://doi.org:10.1016/j.neuron.2024.05.015

    1. Author response:

      The following is the authors’ response to the previous reviews

      Response to the reviewer #2 (Public review):

      We greatly appreciate the reviewer’s high evaluation of our paper and helpful comments and suggestions.

      Regarding in vivo Treg homing assay, we did not exclude doublets and dead cells from the analysis of Kaede-expressing Tregs migrated to the aorta, which may affect the results. We described this issue as the limitation of this study in the revised manuscript. Nonetheless, we believe the reliability of our findings because we repeated this experiment three times and obtained similar results.

      There is no evidence to support the clinical relevance of our findings. Future clinical research on this topic is highly desired.

      Response to the reviewer #3 (Public review):

      We greatly appreciate the reviewer’s high evaluation of our paper and helpful comments and suggestions.

      Despite the controversial role of Th17 cells in atherosclerosis, we understand the possible involvement of Th17 cells and the Th1 cell/Th17 cell balance in lymphoid tissues and aortic lesions in accelerated inflammation and atherosclerosis in Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice. Although we could not completely evaluate the changes in these immune responses in detail, future study may elucidate interesting mechanisms mediated by Th17 cell responses.

      As the reviewer suggested, we understand that it is necessary to provide in vivo evidence for the Treg suppressive effects on DC activation. Based on the results of in vitro experiments, we described the discussion on the in vivo evidence in the revised manuscript.

      We understand methodological limitations for flow cytometric analysis of immune cells in the aorta and in vivo Treg homing assay. We described this issue as the limitation of this study in the revised manuscript. Regarding in vivo Treg homing assay, we statistically re-analyzed the combined data from multiple experiments and observed a tendency toward reduction in the proportion of CCR4-deficient Kaede-expressing Tregs in the aorta of recipient Apoe<sup>-/-</sup> mice, though there was no statistically significant difference in the migratory capacity of CCR4-intact or CCR4-deficient Kaede-expressing Tregs. Accordingly, we toned down our claim that CCR4 expression on Tregs plays a critical role in mediating Treg migration to the atherosclerotic aorta under hypercholesterolemia.

      The reviewer requested us to evaluate aortic inflammation in Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice injected with CCR4-intact or CCR4-deficient Tregs. However, we think that this experiment will provide marginal information because Treg transfer experiments in Apoe<sup>-/-</sup> mice have already shown the protective role of CCR4 in Tregs against aortic inflammation and early atherosclerosis.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      (1) #1 and #2: CD103 and CD86 expression should be discussed on the text and not only in the response to reviewer.

      In accordance with the reviewer’s suggestion, we added a discussion on the downregulated CD103 expression in peripheral LN Tregs and upregulated CD86 expression on DCs in Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice in the discussion section in the revised manuscript.

      (2) #5: Authors response is not satisfactory. No gate percentage is shown. As it currently is, the difference in the number of cells shown in the figure could be due to differences in events recorded. Furthermore, the gate strategy is not thorough. Considering the very low frequency of Kaede + cells detected, it is crucial to properly exclude doublets and dead cells.

      Authors reported a dramatic difference in Kaede + Tregs cells in the aorta across experiments. This could be addressed by normalization followed by appropriate statistical analysis (One sample t-test).

      The data shown is not strong enough to conclude that there is a reduced migration to the aorta.

      We understand the importance of reviewer’s suggestion. We described the percentage of Kaede+ Tregs in the aorta of Apoe<sup>-/-</sup> mice receiving transfer of Kaede-expressing CCR4-intact or CCR4-deficient Tregs in Figure 5I.

      As the reviewer pointed out, we understand that it would be important to properly exclude doublets and dead cells in in vivo Treg homing assay. However, it is difficult for us to resolve this issue because we need to perform the same experiments again which will require a great number of additional mice and substantial amount of time. We deeply regret that these important experimental procedures were not performed. We described this issue as the limitation of this study.

      In accordance with the reviewer’s suggestion, we re-analyzed the combined data from multiple experiments using one-sample t-test. We observed a tendency toward reduction in the proportion of CCR4-deficient Kaede-expressing Tregs in the aorta of recipient Apoe<sup>-/-</sup> mice, though there was no statistically significant difference in the migratory capacity of CCR4-intact or CCR4-deficient Kaede-expressing Tregs. By modifying the corresponding descriptions in the manuscript, we toned down our claim that CCR4 expression on Tregs plays a critical role in mediating Treg migration to the atherosclerotic aorta under hypercholesterolemia.

      (3) #8: There are still several not shown data

      In accordance with the reviewer’s suggestion, we showed the data on the responses of Tregs and effector memory T cells in 8-week-old wild-type or Ccr4<sup>-/-</sup> mice and Ccr4 mRNA expression in Tregs and non-Tregs from Apoe<sup>-/-</sup> or Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice in Supplementary Figures 4 and 7.

      Reviewer #3 (Recommendations for the authors):

      (1) Issue 1. For future studies, I recommend not omitting viability controls during cell staining. Removal of dead cells and doublets should always be included during the gating strategy to avoid undesirable artefacts, especially when analysing less-represented cell populations. According to your previous report (ref #40), I agree that isotype controls were unnecessary using the same staining protocol. FMO controls should always be included in flow cytometry analysis (not mentioned in the methodology description and ref#40).

      As the reviewer suggested, we understand that it would be important to properly exclude dead cells and doublets and to prepare FMO controls in flow cytometric analysis. We deeply regret that these important experimental procedures were not performed. We described this issue as the limitation of this study.

      (2) Issue 3. Although Th17's role in atherosclerosis remains controversial, the data obtained in this work could provide valuable insights if discussed appropriately. As noted in my public review, I found it noteworthy that ROR γ t+ cells represented around 13% of effector TCD45+CD3+CD4+ lymphocytes in the aorta of Apoe<sup>-/-</sup> mice while Th1 less than 5% (Fig 4H and F, respectively). I recognise that differences in cell staining sensibility and robustness for different transcription factors may influence these percentages. However, analysing how CCR4 deficiency influences the Th1/TI h17 balance would yield interesting data, similar to what was done for the Th1/Treg ratio.

      Considering the higher proportion of Th17 cells than Th1 or Th2 cells in atherosclerotic aorta, we understand the importance of reviewer’s suggestion. However, we could not evaluate the effect of CCR4 deficiency on the Th1/Th17 balance in aorta because we did not perform flow cytometric analysis of aortic Th1 and Th17 cells in the same mice. Meanwhile, we could examine the Th1/Th17 balance in peripheral lymphoid tissues by flow cytometry. We found a significant increase in the Th1/Th17 ratio in the peripheral LNs of Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice, while there were no changes in its ratio in the spleen or para-aortic LNs of these mice, which limits the contribution of the Th1/Th17 balance to exacerbated atherosclerosis. We showed these data below.

      Author response image 1.

      (3) Issue 4. I appreciate the authors for sharing data on the flow cytometry analysis of Tregs in para-aortic LNs of Apoe<sup>-/-</sup> and Ccr4<sup>-/-</sup> Apoe<sup>-/-</sup> mice, which would have been included as a Supplementary figure. These results reinforce the notion that Treg dysfunction in CCR4-deficient mice may not be due to the downregulation of regulatory cell surface receptors.

      We showed the data on the expression of CTLA-4, CD103, and PD1 in Tregs in the para-aortic LNs of Apoe<sup>-/-</sup> and Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice in Supplementary Figure 8.

      (4) Issue 5. I agree that CD4+ T cell responses are substantially regulated by DCs. While CD80 and CD86 on DC primarily serve as costimulatory signals for T-cell activation, cytokines secreted by DCs are primordial signals for determining the differentiation phenotype of effector Th cells. Since the analysis of DC phenotype in lymphoid tissues of Apoe<sup>-/-</sup> and Ccr4<sup>-/-</sup> Apoe<sup>-/-</sup> mice could not be addressed in this study, it is not possible to differentiate which processes may be mainly affected by CCR4-deficiency during CD4+ T cell activation. In this scenario, and considering in vitro studies, the results suggest a possible role of CCR4 in controlling the extent of activation of CD4+T cells rather than shifting the CD4+T cell differentiation profile in peripheral lymphoid tissues, where a predominant Th1 profile was already established in Apoe<sup>-/-</sup> mice. Therefore, I advise caution when concluding about shifts in CD4+ T cell responses.

      We thank the reviewer for providing us thoughtful comments. As the reviewer pointed out, we understand that we should carefully interpret the mechanisms for the shift of CD4+ T cell responses by CCR4 deficiency.

      (5) Regarding migration studies in the revised manuscript. I fully understand that Treg transference assays are challenging. The results do not suggest that CCR4 was critical for Treg migration to lymphoid tissues in the conditions assayed. Concerning migration to the aorta, I found the results inconclusive since the authors mention that: i) there was a dramatic difference in the absolute numbers of Kaede-expressing Tregs that migrated to the aorta impairing statistical analysis; ii) the number of Kaede-expressing Tregs that migrated to the aorta was extremely low; iii) dead cells and doublets were not removed in the flow cytometry analysis. In this context, I do not agree with the following statements and recommend revising them:

      - "CCR4 deficiency in Tregs impaired their migration to the atherosclerotic aorta" (lines 36-7),

      - "…we found a significant reduction in the proportion of CCR4 deficient Kaede-expressing Tregs in the aorta of recipient Apoe<sup>-/-</sup> mice" (lines 356-7),

      - "CCR4 expression on Tregs regulates the development of early atherosclerosis by....... mediating Treg migration to the atherosclerotic aorta" (lines 409-411),

      - "…we found that CCR4 expression on Tregs is critical for regulating atherosclerosis by mediating their migration to the atherosclerotic aorta" (lines 437-438),

      - "CCR4 protects against early atherosclerosis by mediating Treg migration to the aorta.... (lines 464-465),

      - "We showed that CCR4 expression on Tregs is critical for ...... mediating Treg migration to the atherosclerotic aorta" (503-505).

      We understand the importance of the reviewer’s suggestion. We described this issue as the limitation of this study. In accordance with the reviewer’s suggestion, we modified the above descriptions and toned down our claim that CCR4 expression on Tregs plays a critical role in mediating Treg migration to the atherosclerotic aorta under hypercholesterolemia.

      (6) Line 206: Mention the increased expression of CD86 by DCs

      We mentioned this result in the revised manuscript. We also added a discussion on the upregulated CD86 expression on DCs in Ccr4<sup>-/-</sup>Apoe<sup>-/-</sup> mice in the discussion section in the revised manuscript.

      (7) Lines 304-305. According to Fig 4F-H, a selective accumulation of Th1 cells seems to have occurred only in the aorta, coinciding with a higher Th1/Treg ratio. No selective accumulation of Th1 cells was observed in para-aortic lymph nodes. These results could be clarified.

      We modified the above description in the revised manuscript.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Public Reviews

      Reviewer #1 (Public Review):

      Comment: The fact that there are Arid1a transcripts that escape the Cre system in the Arid1a KO mouse model might difficult the interpretation of the data. The phenotype of the Arid1a knockout is probably masked by the fact that many of the sequencing techniques used here are done on a heterogeneous population of knockout and wild type spermatocytes. In relation to this, I think that the use of the term "pachytene arrest" might be overstated, since this is not the phenotype truly observed. Knockout mice produce sperm, and probably litters, although a full description of the subfertility phenotype is lacking, along with identification of the stage at which cell death is happening by detection of apoptosis.

      Response: As the reviewer indicates, we did not observe a complete arrest at Pachynema. In fact, the histology shows the presence of spermatids and sperm in seminiferous tubules and epididymides (Fig. Sup. 3). However, our data argue that the wild-type haploid gametes produced were derived from spermatocyte precursors that have likely escaped Cre mediated activity (Fig. Sup. 4). Furthermore, diplotene and metaphase-I spermatocytes lacking ARID1A protein by IF were undetectable in the Arid1acKO testes (Fig. S4B). Therefore, although we do not demonstrate a strict pachytene arrest, it is reasonable to conclude that ARID1A is necessary to progress beyond pachynema. We have revised the manuscript to reflect this point (Abstract lines 17,18; Results lines 153,154)

      Comment: It is clear from this work that ARID1a is part of the protein network that contributes to silencing of the sex chromosomes. However, it is challenging to understand the timing of the role of ARID1a in the context of the well-known DDR pathways that have been described for MSCI.

      Response: With respect to the comment on the lack of clarity as to which stage of meiosis we observe cell death, our data do suggest that it is reasonable to conclude that mutant spermatocytes (ARID1A-) undergo cell death at pachynema given their inability to execute MSCI, which is a well-established phenotype.

      Comment: Staining of chromosome spreads with Arid1a antibody showed localization at the sex chromosomes by diplonema; however, analysis of gene expression in Arid1a KO was performed on pachytene spermatocytes. Therefore, is not very clear how the chromatin remodeling activity of Arid1a in diplonema is affecting gene expression of a previous stage. CUTnRUN showed that ARID1a is present at the sex chromatin in earlier stages, leading to hypothesize that immunofluorescence with ARID1a antibody might not reflect ARID1a real localization.

      Response: It is unclear what the reviewer means about not understanding how ARID1A activity at diplonema affects gene expression at earlier stages. Our interpretations were not based solely on the observation of ARID1A associations with the XY body at diplonema. In fact, mRNA expression and CUT&RUN analyses were performed on pachytene-enriched populations. ARID1A's association with the XY body is not exclusive to diplonema. Based on both CUT&RUN and IF data, ARID1A associates with XY chromatin as early as pachynema. Only at late diplonema did we observe ARID1A hyperaccumulation on the XY body by IF.

      Reviewer #2 (Public Review):

      Comment: The inefficient deletion of ARID1A in this mouse model does not allow any detailed analysis in a quantitative manner.

      Response: As explained in our response to these comments in the first revision, we respectfully disagree with this reviewer’s conclusions. We have been quantitative by co-staining for ARID1A, ensuring that we can score mutant pachytene spermatocytes from escapers. Additionally, we provide data to show the efficiency of ARID1A loss in the purified pachytene populations sampled in our genomic assays.

      Reviewer #3 (Public Review):

      Comment: The data demonstrate that the mutant cells fail to progress past pachytene, although it is unclear whether this specifically reflects pachytene arrest, as accumulation in other stages of Prophase also is suggested by the data in Table 1. The western blot showing ARID1A expression in WT vs. cKO spermatocytes (Fig. S2) is supportive of the cKO model but raises some questions. The blot shows many bands that are at lower intensity in the cKO, at MWs from 100-250kDa. The text and accompanying figure legend have limited information. Are the various bands with reduced expression different isoforms of ARID1A, or something else? What is the loading control 'NCL'? How was quantification done given the variation in signal across a large range of MWs?

      Response: The loading control is Nucleolin. With respect to the other bands in the range of 100-250 kDa, it is difficult to say whether they represent ARID1A isoforms. The Uniprot entry for Mouse ARID1A only indicates a large mol. wt sequence of ~242 kDa; therefore, the band corresponding to that size was quantified. There is no evidence to suggest that lower molecular weight isoforms may be translated. Although speculative, it is possible that the lower molecular weight bands represent proteolytic/proteasomal degradation products or products of antibody non-specificity. These points are addressed in the revised manuscript (Legend to Fig S2, lines 926-931). Blots were scanned on a LI-COR Odyssey CLx imager and viewed and quantified using Image Studio Version 5.2.5 (Methods, lines 640-642).

      Comment: An additional weakness relates to how the authors describe the relationship between ARID1A and DNA damage response (DDR) signaling. The authors don't see defects in a few DDR markers in ARID1A CKO cells (including a low-resolution assessment of ATR), suggesting that ARID1A may not be required for meiotic DDR signaling. However, as previously noted the data do not rule out the possibility that ARID1A is downstream of DDR signaling and the authors even indicate that "it is reasonable to hypothesize that DDR signaling might recruit BAF-A to the sex chromosomes (lines 509-510)." It therefore is difficult to understand why the authors continue to state that "...the mechanisms underlying ARID1A-mediated repression of the sex-linked transcription are mutually exclusive to DDR pathways regulating sex body formation" (p. 8) and that "BAF-A-mediated transcriptional repression of the sex chromosomes occurs independently of DDR signaling" (p. 16). The data provided do not justify these conclusions, as a role for DDR signaling upstream of ARID1A would mean that these mechanisms are not mutually exclusive or independent of one another.

      Response: The reviewer’s argument is reasonable, and we have made the recommended changes (Results, lines 212-215; Discussion, lines 499-500).

      Comment: A final comment relates to the impacts of ARID1A loss on DMC1 focus formation and the interesting observation of reduced sex chromosome association by DMC1. The authors additionally assess the related recombinase RAD51 and suggest that it is unaffected by ARID1A loss. However, only a single image of RAD51 staining in the cKO is provided (Fig. S11) and there are no associated quantitative data provided. The data are suggestive but it would be appropriate to add a qualifier to the conclusion regarding RAD51 in the discussion which states that "...loss of ARID1a decreases DMC1 foci on the XY chromosomes without affecting RAD51" given that the provided RAD51 data are not rigorous. In the long-term it also would be interesting to quantitatively examine DMC1 and RAD51 focus formation on autosomes as well.

      Response: We agree with the reviewer’s comment and have made the recommended changes (Discussion, lines 518-519).

      Response to non-public recommendations

      Reviewer 2:

      Comment: Meiotic arrest is usually judged based on testicular phenotypes. If mutant testes do not have any haploid spermatids, we can conclude that meiotic arrest is a phenotype. In this case, mutant testes have haploid spermatids and are fertile. The authors cannot conclude meiotic arrest. The mutant cells appear to undergo cell death in the pachytene stage, but the authors cannot say "meiotic arrest."

      Response: We disagree with this comment. By IF, we see that ~70% of the spermatocytes have deleted ARID1A. Furthermore, we never observed diplotene spermatocytes that lacked ARID1A. The conclusion that the absence of ARID1A results in a pachynema arrest and that the escapers produce the haploid spermatids is firm.

      Comment: Fig. S2 and S3 have wrong figure legends.

      Response: The figure legends for Fig. S2 and S3 are correct.

      Comment: The authors do not appear to evaluate independent mice for scoring (the result is about 74% deletion above, Table S1). Sup S2: how many independent mice did the authors examine?

      Response:These were Sta-Put purified fractions obtained from 14-15 WT and mutant mice. It is difficult to isolate pachytene spermatocytes by Sta-Put at the required purity in sufficient yields using one mouse at a time. We used three technical replicates to quantify the band intensity, and the error bars represent the standard error of the mean (S.E.M) of the band intensity.

      Comment: Comparison of cKO and wild-type littermate yielded nearly identical results (Avg total conc WT = 32.65 M/m; Avg total conc cKO = 32.06 M/ml)". This sounds like a negative result (i.e., no difference between WT and cKO).

      Response: This is correct. There is no difference between Arid1aWT and Arid1aCKO sperm production. This is because wild-type haploid gametes produced were derived from spermatocyte precursors that have escaped Cre-mediated activity (Fig. S4). These data merely serve to highlight an inherent caveat of our conditional knockout model and are not intended to support the main conclusion that ARID1A is necessary for pachytene progression.

      Comment: The authors now admit ~ 70 % efficiency in deletion, and the authors did not show the purity of these samples. If the purity of pachytene spermatocytes is ~ 80%, the real proportion of mutant cells can be ~ 56%. It is very difficult to interpret the data.

      Response: The original submission did refer to inefficient Cre-induced recombination. The reviewer asked for the % efficiency, which was provided in the revised version. Also, please refer to Fig. S2, where Western blot analysis demonstrates a significant loss of ARID1A protein levels in CKO relative to WT pachytene spermatocyte populations that were used for CUT&RUN data generation.

      Comment: The authors should not use the other study to justify their own data. The H3.3 ChIP-seq data in the NAR paper detected clear peaks on autosomes. However, in this study, as shown in Fig. S7A, the authors detected only 4 peaks on autosomes based on MACS2 peak calling. This must be a failed experiment. Also, S7A appears to have labeling errors.

      Response: I believe the reviewer is referring to supplementary figure 8A. Here, it is not clear which labeling errors the reviewer is referring to. In the wild type, the identified peaks were overwhelmingly sex-linked intergenic sites. This is consistent with the fact that H3.3 is hyper-accumulated on the sex chromosomes at pachynema.

      The authors of the NAR paper did not perform a peak-calling analysis using MACS2 or any other peak-calling algorithm. They merely compared the coverage of H3.3 relative to input. Therefore, it is not clear on what basis the reviewer says that the NAR paper identified autosomal peaks. Their H3.3 signal appears widely distributed over a 6 kb window centered at the TSS of autosomal genes, which, compared to input, appears enriched. Our data clearly demonstrates a less noisy and narrower window of H3.3 enrichment at autosomal TSSs in WT pachytene spermatocytes, albeit at levels lower than that seen in CKO pachytene spermatocytes (Fig S8B and see data copied below for each individual replicate). Moreover, the lack of peaks does not mean that there was an absence of H3.3 at these autosomal TSSs (Supp. Fig. S8B). Therefore, we disagree with the reviewer’s comment that the H3.3 CUT&RUN was a failed experiment.

      Author response image 1.

      H3.3 Occupancy at genes mis-regulated in the absence of ARID1A

      Comment: If the author wishes to study the function of ARID2 in spermatogenesis, they may need to try other cre-lines to have more robust phenotypes, and all analyses must be redone using a mouse model with efficient deletion of ARID2.

      Response: As noted, we chose Stra8-Cre to conditionally knockout Arid1a because ARID1A is haploinsufficient during embryonic development. The lack of Cre expression in the maternal germline allows for transmission of the floxed allele, allowing for the experiments to progress.

      Comment: The inefficient deletion of ARID1A in this mouse model does not allow any detailed analysis in a quantitative manner.

      Response: In many experiments, we have been quantitative when possible by co-staining for ARID1A, ensuring that we can score mutant pachytene spermatocytes from escapers. Additionally, we provide data to show the efficiency of ARID1A loss in the purified pachytene populations sampled in our genomic assays.

      Reviewer 3:

      Comment: The Methods section refers to antibodies as being in Supplementary Table 3, but the table is labeled as Supplementary Table 2.

      Response: This has been corrected

    2. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer 1

      Comment 1: It is worth mentioning that the authors show that there are Arid1a transcripts that escape the Cre system. This might mask the phenotype of the Arid1a knockout, given that many sequencing techniques used here are done on a heterogeneous population of knockout and wild-type spermatocytes.

      Response: The proportions of undifferentiated spermatogonia (PLZF+) with detectable (ARID1A+) and non-detectable (ARID1A=) levels of ARID1A protein by immunostaining on testes cryosections obtained from 1-month old Arid1afl/fl (control) and Arid1acKO (CKO) males were 74% ARID1A negative (CKO) and 26% ARID1A positive (CKO) as compared to 95% ARID1A positive and 5% ARID1A negative in WT controls. The manuscript includes these data (page 5, lines 114-116). Furthermore, Western blot analysis of STA-Put purified pachytene WT and mutant spermatocytes showed significantly reduced levels of ARID1A protein in mutant cells (95% reduction). The manuscript has added these data (page 5, line 116 and Fig. S2).

      Comment 2: In relation to this, I think that the use of the term "pachytene arrest" might be overstated, since this is not the phenotype truly observed (these mice produce sperm).

      Response: Based on the profiling of prophase-I spermatocytes by co-staining for SYCP3 and ARID1A, we observed a marked reduction in mid-late pachytene spermatocytes that lacked ARID1A, indicating a failure to progress beyond pachynema in the absence of ARID1A (Table 1 in manuscript). Furthermore, we were unable to detect diplotene spermatocytes lacking ARID1A protein. Haploid spermatid populations isolated from Arid1acKO males appeared normal, expressing the wild-type allele, suggesting that they originated from spermatocytes that failed to undergo efficient Cre recombination (Fig. S3). Arid1acKO also produces viable sperm at a level equal to their wild-type controls (see page 5, lines 123-126). It is reasonable to conclude that the absence of ARID1A results in a pachynema arrest and that the viable sperm are from escapers. We cannot make any conclusions regarding the requirement of ARID1A for progression beyond pachynema.

      Comment 3: ARID1A is present throughout prophase I, and it might have pre-MSCI roles that impact earlier stages of Meiosis I, and cell death might be happening in these earlier stages too.

      Response: We did not observe an effect on the frequency of leptotene and zygotene spermatocytes lacking ARID1A. There appeared to be an accumulation of these prophase-I populations in response to the loss of ARID1A, consistent with a failure in progression beyond pachynema in the mutants (Table 1 in the manuscript).

      Additionally, we did not detect any significant difference in the numbers of undifferentiated spermatogonia expressing PLZF (also known as ZBTB16) in 1-month-old Arid1acKO relative to Arid1afl/fl males (see Table below, now included in the manuscript as supplemental Table 1). Therefore, the Arid1a conditional knockouts generated with a Stra8-Cre did not appear to impact earlier stages of spermatogenesis. However, potential roles of ARID1A early in spermatogenesis might be revealed using a more efficient and earlier-acting germline Cre transgene. In this case, an inducible Cre transgene would be needed, given the haploinsufficiency associated with Arid1a. Such haploinsufficiency was why we used the Stra8-Cre. The lack of Cre expression in the female germline allowed the transmission of the floxed allele maternally.

      Author response table 1.

      Comment 4: Overall, the research presented here is solid, adds new knowledge on how sex chromatin is silenced during meiosis, and has generated relevant databases for the field.

      Response: We thank the reviewer for this comment.

      Reviewer 2

      Comment 1: The conditional deletion mouse model of ARIDA using Stra8-cre showed inefficient deletion; spermatogenesis did not appear to be severely compromised in the mutants. Using this data, the authors claimed that meiotic arrest occurs in the mutants. This is obviously a misinterpretation.

      Response: As stated in response to Reviewer 1, testes cryosections obtained from 1-month-old control and mutant males showed that 74% are ARID1A negative (CKO) and 26% ARID1A positive (CKO) as compared to 95% ARID1A positive and 5% ARID1A negative in WT controls (page 5, lines 114-116). This difference is dramatic. Western blot analysis of STA-Put purified pachytene WT and mutant spermatocytes also showed a significant reduction of ARID1A protein in mutant cells (Fig. S2). We observed a marked decrease in mid-late pachytene spermatocytes that lacked ARID1A, indicating a failure to progress beyond pachynema without ARID1A (Table 1 from the manuscript). Furthermore, we were unable to detect any diplotene spermatocytes lacking ARID1A protein. These data suggest that the haploid spermatids originated from spermatocytes that failed to undergo efficient Cre recombination (Fig. S3). Comparison of cKO and wild-type littermate yielded nearly identical results (Avg total conc WT = 32.65 M/m; Avg total conc cKO = 32.06 M/ml), indicating that the cKO’s produce viable sperm at a level equal to their wild-type controls. Taken together, the conclusion that the absence of ARID1A results in a pachynema arrest and that the escapers produce the haploid spermatids is firm. By IF, we see that ~70% of the spermatocytes have deleted ARID1A. Therefore, we disagree with the reviewer’s comments that “spermatogenesis did not appear to be severely compromised in the mutants”.

      Comment 2: In the later parts, the authors performed next-gen analyses, including ATAC-seq and H3.3 CUT&RUN, using the isolated cells from the mutant mice. However, with this inefficient deletion, most cells isolated from the mutant mice appeared not to undergo Cre-mediated recombination. Therefore, these experiments do not tell any conclusion pertinent to the Arid1a mutation.

      Response: We agree that the ATAC-seq and CUT&RUN data were derived from a mixed population of pachytene spermatocytes consisting of mutants and, to a much lesser extent, escapers. As stated, based on our previous study (Menon et al., 2021, Nat. Commun., PMID: 34772938) and additional analyses in this current work, the undifferentiated spermatogonia lacking ARID1A indicates that Stra8-Cre is ~ 70% efficient. With this efficiency, we can detect striking changes in H3.3 occupancy and chromatin accessibility in the mutants relative to wild-type spermatocytes.

      Comment 3: Furthermore, many of the later parts of this study focus on the analysis of H3.3 CUT&RUN. However, Fig. S7 clearly suggests that the H3.3 CUT&RUN experiment in the wild-type simply failed. Thus, none of the analyses using the H3.3 CUT&RUN data can be interpreted.

      Response: We would like to draw the attention of the reviewer to a recent study (Fointane et al., 2022, NAR, PMID: 35766398) where the authors observed an identical X chromosome-wide spreading of H3.3 in mouse meiotic cells by ChIP-seq. The genomic distribution matches the microscopic observation of H3.3 coating of the sex chromosomes. Therefore, in normal spermatocytes, H3.3 distribution is pervasive across the X chromosome, with very few peaks observed in intergenic regions. Additionally, we detected H3.3 enrichment at TSSs of ARID1A-regulated autosomal genes in wild-type pachytene spermatocytes, albeit reduced relative to the mutants, indicating that the H3.3 CUT&RUN worked. For these reasons, we do not agree with the reviewer’s assessment that the H3.3 CUT&RUN experiment failed in the wild type.

      Comment 4: If the author wishes to study the function of ARID2 in spermatogenesis, they may need to try other cre-lines to have more robust phenotypes, and all analyses must be redone using a mouse model with efficient deletion of ARID2.

      Response: As noted, we chose Stra8-Cre to conditionally knockout Arid1a because ARID1A is haploinsufficient during embryonic development. The lack of Cre expression in the maternal germline allows for transmission of the floxed allele, allowing for the experiments to progress.

      Reviewer 3

      Comment 1: A challenge with the author's CKO model is the incomplete efficiency of ARID1A loss, due to incomplete CRE-mediated deletion. The authors effectively work around this issue, but they don't state specifically what percentage of CKO cells lack ARID1A staining. This information should be added.

      Response: Our data indicate that Stra8-Cre is ~ 70% efficient. This information has been added.

      Comment 2: They refer to cells that retain ARID1A staining in CKO testes as 'internal controls' but this reviewer finds that label inappropriate.

      Response: We have dropped ‘internal controls’ and used ‘escapers’ instead.

      Comment 3: Although some cells that retain ARID1A won't have undergone CRE-mediated excision, others may have excised but possibly have delayed kinetics of deletion or ARID1A RNA/protein turnover and loss. Such cells likely have partial ARID1A depletion to different extents and, therefore, in some cases, are no longer wild-type. In subsequent figures in which co-staining for ARID1A is done, it would be appropriate for the authors to specify if they are quantifying all cells from CKO testes, or only those that lack ARID1A staining.

      Response: We were unable to detect any diplotene spermatocytes lacking ARID1A protein. The data suggest that the haploid spermatids originated from spermatocytes that failed to undergo efficient Cre recombination (Fig. S3). Thus, we conclude that the absence of ARID1A results in a pachynema arrest and that the escapers produce haploid spermatids. In figures displaying quantification data, we indicate whether the quantification was performed on spermatocytes lacking or containing ARID1A from cKO testes. By IF, we see that ~70% of the spermatocytes have deleted ARID1A.

      Comment 4: The authors don't see defects in a few DDR markers in ARID1A CKO cells and conclude that the role of ARID1A in silencing is 'mutually exclusive to DDR pathways' (p 12) and 'occurs independently of DDR signaling' (p30). The data suggest that ARID1A may not be required for DDR signaling, but do not rule out the possibility that ARID1A is downstream of DDR signaling (and the authors even hypothesize this on p30). The data provided do not justify the conclusion that ARID1A acts independently of DDR signaling.

      associated DDR factors such as: H2Ax; ATR; and MDC1. We observed an abnormal persistence of elongating RNA polymerase II on the mutant XY body in response to the loss of ARID1A, emphasizing its role in the transcriptional repression of the XY during pachynema. The loss of ARID1A results in a failure to silence sex-linked genes and does so in the presence of DDR signaling factors in the XY body. As the reviewer notes, we highlighted the possibility that DDR pathways might influence ARID1A recruitment to the XY, evidenced by the hyperaccumulation of ARID1A on the sex body late in diplonema. Therefore, whether ARID1A is dependent on DDR signaling remains an open question.

      Comment 5: After observing no changes in levels or localization of H3.3 chaperones, the authors conclude that 'ARID1A impacts H3.3 accumulation on the sex chromosomes without affecting its expression or incorporation during pachynema.' It's not clear to this reviewer what the authors mean by this. Aside from the issue of not having tested DAXX or HIRA activity, are they suggesting that some other process besides altered incorporation leads to H3.3 accumulation, and if so, what process would that be?

      Response: The loss of ARID1A might result in an abnormal redistribution of DAXX or HIRA on the XY, potentially contributing to the defects in H3.3 accumulation and canonical H3.1/3.2 eviction on the XY. While speculative at this point, it is also possible that the persistence of elongating RNAPII in response to the loss of ARID1A might prevent the sex chromosome-wide coating of H3.3. Addressing the mechanism underlying ARID1A-governed H3.3 accumulation on the XY body remains a topic for future investigation.

      Comment 6: The authors find an interesting connection between certain regions that gained chromatin accessibility after ARID1A loss (clusters G1 and G3) and the presence of the PRDM9 sequence motif. The G1 and G3 clusters also show DMC1 occupancy and H3K4me3 enrichment. However, an additional cluster with gained accessibility (G4) also shows DMC1 occupancy and H3K4me3 enrichment but has modest H3.3 accumulation. The paper would benefit for additional discussion about the G4 cluster (which encompasses 960 peak calls). Is there any enrichment of PRDM9 sites in G4? If H3.3 exclusion governs meiotic DSBs, how does cluster G4 fit into the model?

      Response: We agree that, compared to G1+G3, cluster G4 shows an insignificant increase in H3.3 occupancy in the absence of ARID1A (Figure 6B). The plot profile associated with the heatmap confirms this result (Figure 6B). Therefore, cluster G4 is very distinct in its chromatin composition from G1+G3 upon the loss of ARID1A and, as such, is not inconsistent with our model of H3.3 antagonism with DSB sites. Additionally, we did not observe an enrichment of PRDM9 sites in G4. Since G4 does not display similar dynamics in H3.3 occupancy to G1+G3, DMC1 association might not be perturbed at G4 in response to the loss of ARID1A. Future studies will be required to determine the genomic associations of DMC1 and H3K4me3 in response to the loss of ARID1A.

      Comment 7: The impacts of ARID1A loss on DMC1 focus formation (reduced sex chromosome association) are very interesting and also raise additional questions. Are DMC1 foci on autosomes also affected during pachynema? The corresponding lack of apparent effect on RAD51 implies that breaks are still made and resected, enabling RAD51 filament formation. A more thorough quantitative assessment of RAD51 focus formation will be interesting in the long run, enabling determination of the number of break sites and the kinetics of repair, which the authors suggest is perturbed by ARID1A loss but doesn't directly test. It isn't clear how a nucleosomal factor (H3.3) would influence loading of recombinases onto ssDNA, especially if the alteration is not at the level of resection and ssDNA formation. Additional discussion of this point is warranted. Lastly, there currently are various notions for the interplay between RAD51 and DMC1 in filament formation and break repair, and brief discussion of this area and the implications of the new findings from the ARID1A CKO would strengthen the paper further.

      Response: The impact of H3.3 on the loading of recombinases might be an indirect consequence of ARID1A-governed sex-linked transcriptional repression. In a recent study, Alexander et al. (Nat. Commun, 2023, PMID: 36990976) showed that transcriptional activity and meiotic recombination are spatially compartmentalized during meiosis. Therefore, the persistence of elongating RNA polymerase II on a sex body depleted for H3.3 in the absence of ARID1A might contribute to the defect in DMC1 association. RAD51 and DMC1 are known to bind ssDNA at PRDM9/SPO11 designated DSB hotspots. However, these recombinases occupy unique domains. DMC1 localizes nearest the DSB breakpoint, promoting strand exchange, whereas RAD51 is further away (Hinch et al., PMID32610038). We show that loss of Arid1a decreases DMC1 foci on the XY chromosomes without affecting RAD51. These findings indicate that BAF-A plays a role in the loading and/or retention of DMC1 to the XY chromosomes. This information has been added to the discussion.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Although this manuscript contains a potentially interesting piece of work that delineates a mechanism of IQCH that associates with spermatogenesis, this reviewer feels that a number of issues require clarification and re-evaluation for a better understanding of the role of IQCH in spermatogenesis. With the shortage of logics and supporting data, causal relationships are still not clear among IQCH, CaM, and HNRPAB. The most serious point in this manuscript could be that the authors try to generalize their interpretations with too simplified model from limited pieces of their data. The way the data and the logic are presented needs to be largely revised, and several interpretations should be supported by direct evidence.

      Response: Thank you for the reviewer’s comment. IQCH is a calmodulin-binding protein, and the binding of IQCH and CaM was confirmed by LC-MS/MS analysis and co-IP assay using sperm lysate. We thus speculated that if the interaction of IQCH and CaM might be a prerequisite for IQCH function. To prove that speculation, we took HNRPAB as an example. We knocked down IQCH in cultured cells, and a decrease in the expression of HNRPAB was observed. Similarly, when we knocked down CaM in cultured cells, and a decrease in the expression of HNRPAB was also detected. However, these results cannot exclude that IQCH or CaM could regulate HNRPAB expression alone. To investigate that if IQCH or CaM could regulate HNRPAB expression alone, we overexpressed IQCH in cells that knocked down CaM, while the expression of HNRPAB cannot be rescued, suggesting that IQCH cannot regulate HNRPAB expression when CaM is reduced. In consistent, we overexpressed CaM in cells that knocked down IQCH, while the expression of HNRPAB cannot be rescued, suggesting that CaM cannot regulate HNRPAB expression when IQCH is reduced. Thus, IQCH or CaM cannot regulate HNRPAB expression alone. Moreover, we deleted the IQ motif of IQCH, which is required for binding to CaM. The co-IP results showed that the interaction of IQCH and CaM was disrupted when deleting the IQ motif of IQCH, and the expression of HNRPAB was decreased. Therefore, we suggested that the interaction of IQCH and CaM might be required for IQCH regulating HNRPAB. In future studies, we will further investigate the relationships among IQCH, CaM, and HNRPAB.

      Reviewer #3 (Public Review):

      (1) More background details are needed regarding the proteins involved, in particular IQ proteins and calmodulin. The authors state that IQ proteins are not well-represented in the literature, but do not state how many IQ proteins are encoded in the genome. They also do not provide specifics regarding which calmodulins are involved, since there are at least 5 family members in mice and humans. This information could help provide more granular details about the mechanism to the reader and help place the findings in context.

      Response: Thanks to reviewer’s suggestion. We have provided additional background information regarding IQ-containing protein family members in humans and mice, as well as other IQ-containing proteins implicated in male fertility, in the Introduction section. Furthermore, we have supplemented the Introduction with background information concerning the association between CaM and male infertility.

      (2) The mouse fertility tests could be improved with more depth and rigor. There was no data regarding copulatory plug rate; data was unclear regarding how many WT females were used for the male breeding tests and how many litters were generated; the general methodology used for the breeding tests in the Methods section was not very explicitly or clearly described; the sample size of n=3 for the male breeding tests is rather small for that type of assay; and, given that ICHQ appears to be expressed in testicular interstitial cells (Fig. S10) and somewhat in other organs (Fig. S2), another important parameter of male fertility that should be addressed is reproductive hormone levels (e.g., LH, FSH, and testosterone). While normal epididymal size in Fig. S3 suggests that hormone (testosterone) levels are normal, epididymal size and/or weight were not rigorously quantified.

      Response: Thanks to reviewer’s comment. We have provided the data regarding copulatory plug rate and the average number of litters for breeding tests in revised Figure 3—figure supplement 2. The methodology used for the breeding tests has been revised to be more detailed and explicit in the revised Method section. Moreover, we have increased the sample size for male breeding tests to n=6. We measured the serum levels of FSH, LH, and Testosterone in the WT (9.3±1.9 ng/ml, 0.93±0.15 ng/ml, and 0.2±0.03 ng/ml) and Iqch KO mice (12±2 ng/ml, 1.17±0.2 ng/ml, and 0.2±0.04 ng/ml). There was no significant difference observed in the serum levels of reproductive hormones between WT and Iqch KO mice; therefore, we did not include the data in the study. Furthermore, we have added quantitative data on epididymal size in the revised Figure 3—figure supplement 2.

      (3) The Western blots in Figure 6 should be rigorously quantified from multiple independent experiments so that there is stronger evidence supporting claims based on those assays.

      Response: We appreciate the reviewer's comment. As suggested, we have added quantified data in Figure 6—figure supplement 2 from the results of Western blotting in Figure 6.

      (4) Some of the mouse testis images could be improved. For example, the PNA and PLCz images in Figure S7 are difficult to interpret in that the tubules do not appear to be stage-matched, and since the authors claimed that testicular histology is unaffected in knockout testes, it should be feasible to stage-match control and knockout samples. Also, the anti-ICHQ and CaM immunofluorescence in Figure S10 would benefit from some cell-type-specific co-stains to more rigorously define their expression patterns, and they should also be stage-matched.

      Response: Thanks to reviewer’s suggestions. We have included immunofluorescence images of anti-PLCz, anti-PNA and anti-IQCH and CaM during spermatogenesis development.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) There are multiple grammatical errors and statements drawn beyond the results. The entire manuscript would benefit from professional editing.

      Response: We are sorry for the grammatical errors. We have enlisted professional editing services to refine our manuscript.

      (2) Line 40, "Firstly" is not appropriate here.

      Response: Thanks to reviewer’s comment. The word "Firstly" has been removed from the revised manuscript.

      (3) Line 44, "processes".

      Response: Thanks to reviewer’s suggestion. We have changed “process” in to “processes” on line 45.

      (4) "spermatocytogenesis (mitosis)" is incorrect.

      Response: Thanks to reviewer’s comment. We have changed “spermatocytogenesis (mitosis)” in to “mitosis” on line 47.

      (5) Ca and Ca2+ are both used in line 67 - 77. Be consistent.

      Response: We appreciate the reviewer's detailed checks. We have maintained consistency by revising instances of "Ca" to "Ca2+" in revised manuscript.

      (6) Line 238 to 240, "To elucidate the molecular mechanism by which IQCH regulates male fertility, we performed liquid chromatography tandem mass spectrometry (LC-MS/MS) analysis using mouse sperm lysates and detected 288 interactors of IQCH (Data S1)."It is not clear how LC-MS/MS using mouse sperm lysates could detect "288 interactors of IQCH"? A co-IP experiment for IQCH using sperm lysates prior to LC-MS/MS is needed to detect "interactors of IQCH". However, in the Methods section, consistent with the main text, proteomic quantification was conducted for protein extract from sperm. Figure legend for Fig. 5 did not explain this, either.Thus, it is unable to evaluate Figure 5.

      Response: We sincerely apologize for the oversight. Following reviewer’s suggestions, we have supplemented the method details of LC-MS/MS experiment in the Methods section of revised manuscript. Additionally, we conducted a co-IP experiment for IQCH using sperm lysates prior to LC-MS/MS and we did not include the corresponding figure in the manuscript. The results are as follows:

      Author response image 1.

      The results of a co-IP experiment for IQCH using sperm lysates from WT mice.

      (7) Line 246, "... key proteins that might be activated by IQCH". What does "activated" here refer to? Should it be "upregulated"?

      Response: We are sorry to our inexact statement. Instead, "upregulated" would better convey the intended meaning. According to reviewer’s suggestions, we have modified "activated" into "upregulated".

      (8) Line 252 to 254, "the cross-analysis revealed that 76 proteins were shared between the IQCH-bound proteins and the IQCH-activated proteins (Fig. 5E), implicating this subset of genes as direct targets." This is a confusing statement. Is the author trying to say, IQCH-bound proteins have upregulated expression, suggesting that IQCH enhances their expression?

      Response: We appreciate the reviewer's comment regarding the clarity of the statement in Line 252 to 254 of the manuscript. We have modified this sentence into “Importantly, cross-analysis revealed that 76 proteins were shared between the IQCH-bound proteins and the downregulated proteins in Iqch KO mice (Figure 5E), suggesting that IQCH might regulate their expression by the interaction.”

      (9) Line 260 to 261, "SYNCRIP, HNRNPK, FUS, EWSR1, ANXA7, SLC25A4, and HNRPAB ... the loss of which showed the greatest influence on the phenotype of the Iqch KO mice." There is no evidence suggesting that the loss of SYNCRIP, HNRNPK, FUS, EWSR1, ANXA7, SLC25A4, and HNRPAB leads to Iqch KO phenotype.

      Response: We apologize for our inaccurate statement. According to the literature, Fus KO, Ewsr1 KO, and Hnrnpk KO male mice were infertile, showing the spermatogenic arrest with absence of spermatozoa (Kuroda et al. 2000; Tian et al. 2021; Xu et al. 2022). Syncrip is involved meiotic process in Drosophila by interacting with Doublefault (Sechi et al. 2019). HNRPAB might be associated with mouse spermatogenesis by binding to Protamine 2 and contributing its translational regulation. Specifically, ANXA7 is a calcium-dependent phospholipid-binding protein that is a negative regulator of mitochondrial apoptosis (Du et al. 2015). Loss of SLC25A4 results in mitochondrial energy metabolism defects in mice (Graham et al. 1997). Moreover, RNA immunoprecipitation on formaldehyde cross-linked sperm followed by qPCR detected the interactions between HNRPAB and Catsper1, Catsper2, Catsper3, Ccdc40, Ccdc39, Ccdc65, Dnah8, Irrc6, and Dnhd1, which are essential for sperm development (Fukuda et al. 2013). Our Iqch KO mice showed abnormal sperm count, motility, morphology, and mitochondria, so we inferenced that IQCH might play a role in spermatogenesis by regulating the expression of SYNCRIP, HNRNPK, FUS, EWSR1, ANXA7, SLC25A4, and HNRPAB to some extent. We have changed an appropriate stamen that “We focused on SYNCRIP, HNRNPK, FUS, EWSR1, ANXA7, SLC25A4, and HNRPAB, which play important roles in spermatogenesis.”

      (10) Fig. 6C and 6D use different styles of error bars.

      Response: We are sorry for our oversight. In accordance with the reviewer's recommendations, we have modified the representation of error bars in the revised Fig. 6C.

      (11) Line 296 to 297, "As expected, CaM interacted with IQCH, as indicated by LC-MS/MS analysis". It is not clear how LC-MS/MS detects protein interaction.

      Response: As reviewer’s suggestions, we have supplemented the method details of LC-MS/MS experiment in the Methods section of revised manuscript. The results of proteins interacting with IQCH in sperm lysates from the LC-MS/MS experiment analysis were submitted as Figure 5—source data 1.

      (12) It is still not clear how the interaction between IQCH, CaM, and HNRPAB is required for the expression of each other.

      Response: Thank you for the reviewer’s comment. IQCH is a calmodulin-binding protein, and the binding of IQCH and CaM was confirmed by LC-MS/MS analysis and co-IP assay using sperm lysate. We thus speculated that if the interaction of IQCH and CaM might be a prerequisite for IQCH function. To prove that speculation, we took HNRPAB as an example. We knocked down IQCH in cultured cells, and a decrease in the expression of HNRPAB was observed. Similarly, when we knocked down CaM in cultured cells, and a decrease in the expression of HNRPAB was also detected. However, these results cannot exclude that IQCH or CaM could regulate HNRPAB expression alone. To investigate that if IQCH or CaM could regulate HNRPAB expression alone, we overexpressed IQCH in cells that knocked down CaM, while the expression of HNRPAB cannot be rescued, suggesting that IQCH cannot regulate HNRPAB expression when CaM is reduced. In consistent, we overexpressed CaM in cells that knocked down IQCH, while the expression of HNRPAB cannot be rescued, suggesting that CaM cannot regulate HNRPAB expression when IQCH is reduced. Thus, IQCH or CaM cannot regulate HNRPAB expression alone. Moreover, we deleted the IQ motif of IQCH, which is required for binding to CaM. The co-IP results showed that the interaction of IQCH and CaM was disrupted when deleting the IQ motif of IQCH, and the expression of HNRPAB was decreased. Therefore, we suggested that the interaction of IQCH and CaM might be required for IQCH regulating HNRPAB. In future studies, we will further investigate the relationships among IQCH, CaM, and HNRPAB.

      Reviewer #3 (Recommendations For The Authors):

      The authors have addressed my minor concerns. However, they neglected to address any of my more significant concerns in the public review. I assume that they simply overlooked these critiques, despite the fact that eLife explicitly states that "...as a general rule, concerns about a claim not being justified by the data should be explained in the public review." Therefore, the authors should have looked more carefully at the public reviews. As a result, my major concerns about the manuscript remain.

      Response: We apologize for overlooking the public review process. We have improved our study based on the feedback received during the public review.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Recommendations For The Authors):

      The additional data included in this revision nicely strengthens the major claim.

      I apologize that my comment about K+ concentration in the prior review was unclear. The cryoEM structure of KCNQ1 with S4 in the resting state was obtained with lowered K+ relative to the active state. Throughout the results and discussion it seems implied that the change in voltage sensor state is somehow causative of the change in selectivity filter state while the paper that identified the structures attributes the change in selectivity filter state not to voltage sensors, but to the change in [K+] between the 2 structures. Unless there is a flaw in my understanding of the conditions in which the selectivity filter structures used in modeling were generated, it seems misleading to ignore the change in [K+] when referring to the activated vs resting or up vs down structures. My understanding is that the closed conformation adopted in the resting/low [K+] is similar to that observed in low [K+] previously and is more commonly associated with [K+]-dependent inactivation, not resulting from voltage sensor deactivation as implied here. The original article presenting the low [K+] structure also suggests this. When discussing conformational changes in the selectivity filter, I strongly suggest referring to these structures as activated/high [K+] vs resting/low [K+] or something similar, as the [K+] concentration is a salient variable.

      There seems to be some major confusion here and we will try to explain how we think. Note that in the Mandela and MacKinnon paper, there is no significant difference in the amino acid positions in the selectivity filter between low and high K+ when S4 is in the activated position (See Mandala and Mackinnon, PNAS Suppl. Fig S5 C and D). There are only fewer K+ in the selectivity filter in low K+. So, the structure with the distorted selectivity filter is not due to low K+ by itself. Note that there is no real difference between macroscopic currents recorded in low and high K+ solutions (except what is expected from changes in driving force) for KCNQ1/KCNE1 channels (Larsen et al., Bioph J 2011), suggesting that low K+ do not promote the non-conductive state (Figure 1). We now include a section in the Discussion about high/low K+ in the structures and the absence of effects of K+ on the function of KCNQ1/KCNE1 channels.

      Author response image 1.

      Macroscopic KCNQ1/KCNE1 currents recorded in different K+ conditions.  Note that there is no difference between current recorded in low K+ (2 mM) conditions and high (96 mM) K+ conditions (n=3 oocytes). Currents were normalized in respect to high K+.

      Note also that, in the previous version of the manuscript, we did not propose that the position of S4 is what determines the state of the selectivity filter. We only reported that the CryoEM structure with S4 resting shows a distorted selectivity filter. It seems like our text confused the reviewer to think that we proposed that S4 determines the state of the selectivity filter, when we did not propose this earlier. We previously did not want to speculate too much about this, but we have now included a section in the Discussion to make our view clear in light of the confusion of the reviewers.

      It is clear from our data that the majority of sweeps are empty (which we assume is with S4 up), suggesting that the selectivity filter can be (and is in the majority of sweeps) in the non-conducting state even with S4 up.  We think that the selectivity filter switches between a non-conductive and a conductive conformation both with S4 down and with S4 up. The cryoEM structure in low K+ and S4 down just happened to catch the non-conductive state of the selectivity filter.  We have now added a section in the Discussion to clarify all this and explain how we think it works.

      However, S4 in the active conformation seems to stabilize the conductive conformation of the selectivity filter, because during long pulses the channel seems to stay open once opened (See Suppl Fig S2). So, one possibility is that the selectivity filter goes more readily into the non-conductive state when S4 is down (and maybe, or not, low K+ plays a role) and then when S4 moves up the selectivity filter sometimes recovers into the conductive state and stays there. We now have included a section in the Discussion to present our view. Since this whole discussion was initiated and pushed by the reviewer, we hope that the reviewers will not demand more data to support these ideas. We think that this addition makes sense since other readers might have the same questions and ideas as the reviewer, and we would like to prevent any confusion about this topic.

      Figure 1

      It remains unclear in the manuscript itself what "control" refers to. Are control patched the same patches that later receive LG?

      Yes, the control means the same patch before LG. We now indicate that in legends and text throughout.

      Supplementary Figure S1

      Unclear if any changes occur after addition of LG in left panel and if the LG data on right is paired in any way to data on left.

      Yes, in all cases the left and right panel in all figures are from the same patch. We now indicate that in legends and text throughout.

      The letter p is used both to represent open probability open probability from the all-point amplitude histogram and as a p-value statistical probability indicator sometime lower case, sometimes upper case. This was confusing.

      We have now exclusively use lower case p for statistical probability and Po for open probability.

      "This indicates that mutations of residues in the more intracellular region of the selectivity filter do not affect the Gmax increases and that the interactions that stabilize the channel involve only residues located near the external region part of the selectivity filter. "

      Seems too strongly worded, it remains possible that mutations of other residues in the more intracellular region of the selectivity filter could affect the Gmax increases.

      We have changed the text to: "Mutations of residues in the more intracellular region of the selectivity filter do not affect the Gmax increases, as if the interactions that stabilize the channel involve residues located near the external region part of the selectivity filter. "

      Supplementary Figure S7

      Please report Boltzmann fit parameters. What are "normalized" uA?

      We removed the uA, which was mistakenly inserted. The lines in the graphs are just lines connecting the dots and not Boltzmann fits, since we don’t have saturating curves in all panels to make unique fits.

      "We have previously shown that the effects of PUFAs on IKs channels involve the binding of PUFAs to two independent sites." Was binding to the sites actually shown? Suggest changing to: "We have previously proposed models in which the effects of PUFAs..."

      We have now changed this as the Reviewer suggested: " We have previously proposed models in which the effects of PUFAs on IKs channels involve the binding of PUFAs to two independent sites."

      Statistics used not always clear. Methods refer to multiple statistical tests but it is not clear which is used when.

      We use two different tests and it is now explained in figure legends when either was used.

      n values confusing. Sometimes # of sweeps used as n. Sometimes # patches used as n. In one instance "The average current during the single channel sweeps was increased by 2.3 {plus minus} 0.33 times (n = 4 patches, p =0.0006)" ...this sems a low p value for this n=4 sample?

      We have now more clearly indicated what n stands for in each case. There was an extra 0 in the p value, so now it is p = 0.006. Thanks for catching that error.

      Reviewer #2 (Recommendations For The Authors):

      I still have some comments for the revised manuscript.

      (1) (From the previous minor point #6) Since D317E and T309S did not show statistical significance in Figure 5A, the sentences such as "This data shows that Y315 and D317 are necessary for the ability of Lin-Glycine to increase Gmax" or "the effect of Lin-Glycine on Gmax of the KCNQ1/KCNE1 mutant was noticeably reduced compared to the WT channel showing the this residue contributes to the Gmax effect (Figure 5A)." may need to be toned down. Alternatively, I suggest the authors refer to Supplementary Figure S7 to confirm that Y315 and D317 are critical for increasing Gmax.

      We have redone the analysis and statistical evaluation in Fig 5. We no use the more appropriate value of the fitted Gmax (which use the whole dose response curve instead of only the 20 mM value) in the statistical evaluation and now Y315F and D317E are statistically different from wt.

      (2) Supplementary Fig. S1. All control diary plots include the green arrows to indicate the timing of lin-glycine (LG) application. It is a bit confusing why they are included. Is it to show that LG application did not have an immediate effect? Are the LG-free plots not available?

      Not sure what the Reviewer is asking about? In the previous review round the Reviewers asked specifically for this. The arrow shows when LG was applied and the plot on the right shows the effect of LG from the same patch.

      (3) The legend to Supplementary Figure S4, "The side chain of residues ... are highlighted as sticks and colored based on the atomic displacement values, from white to blue to red on a scale of 0 to 9 Å." They look mostly blue (or light blue). Which one is colored white? It might be better to use a different color code. It would also be nice to link the color code to the colors of Supplementary Figure S5, which currently uses a single color.

      We have removed “from white to blue to red on a scale of 0 to 9 Å” and instead now include a color scale directly in Fig S4 to show how much each atom moved based on the color.

      We feel it is not necessary to include color in Fig S5 since the scale of how much each atom moves is shown on the y axis.

      (4) Add unit (pA) to the y-axis of Supplementary Figure S2.

      pA has been added.

      Reviewer #3 (Recommendations For The Authors):

      Some issues on how data support conclusions are identified. Further justifications are suggested.

      186: “The decrease in first latency is most likely due to an effect of Lin-Glycine on Site I in the VSD and related to the shift in voltage dependence caused by Lin-Glycine." The results in Fig S1B do not seem to support this statement since the mutation Y315F in the pore helix seemed to have eliminated the effect of Lin-Glycine in reducing first latency. The authors may want to show that a mutation that eliminating Site I would eliminate the effect of Lin-Glycine on first latency. On the other hand, it will be also interesting to examine if another pore mutation, such as P320L (Fig 5) also reduce the effect of Lin-Glycine on first latency.

      These experiments are very hard and laborious, and we feel these are outside the scope of this paper which focuses on Site II and the mechanism of increasing Gmax. Further studies of the voltage shift and latency will have to be for a future study.

      The mutation D317E did not affect the effect of Lin-Glycine on Gmax significantly (Fig 5A, and Fig S7F comparing with Fig S7A), but the authors conclude that D317 is important for Lin-Glycine association. This conclusion needs a better justification.

      We have redone the analysis and statistical evaluation in Fig 5. We no use the more appropriate value of the fitted Gmax (which use the whole dose response curve instead of only the 20 mM value) in the statistical evaluation and now D317E is statistically different from wt

    1. Author response:

      The following is the authors’ response to the previous reviews.

      As you can see from the assessment (which is unchanged from before) and the reviews included below, the reviewers felt that the revisions did not yet address all of the major concerns. There was agreement that the strength of evidence would be upgraded to "solid" by addressing, at minimum, the following: 

      (1) Which of the results are significant for individual monkeys; and 

      (2) How trials from different target contrasts were analyzed 

      In this revision, we have addressed the two primary editorial recommendations:

      (1) We apologize if this information was not clear in the previous version. We have updated Table 1 to highlight clearly the significant results for individual monkeys. Six of our key results – pupil diameter (Fig 2B), microsaccades (Fig 2D), decoding performance for narrow-spiking units (Fig 3A), decoding performance for broad-spiking units (Fig 3B), target-evoked firing rate for all units (Fig 3E) and target-evoked firing rate for broad-spiking units (Fig 3F) – are significant for individual animals and therefore gives us high confidence regarding our results. Please also note that we present all results for individual animals in the Supplementary figures accompanying each main figure.

      (2) We have updated the manuscript and methods to explain how trials of each contrast were included in each analysis, and how contrast normalization was performed for the analysis in Figure 3. In addition, we discuss this point in the Discussion section, which we quote below:

      “Non-target stimulus contrasts were slightly different between hits and misses (mean: 33.1% in hits, 34.0% in misses, permutation test, 𝑝 = 0.02), but the contrast of the target was higher in hits compared to misses (mean: 38.7% in hits, 27.7% in misses, permutation test, 𝑝 = 1.6   𝑒 − 31). To control for potential effects of stimulus contrast, firing rates were first normalized by contrast before performing the analyses reported in Figure 3. For all other results, we considered only non-target stimuli, which had very minor differences in contrast (<1%) across hits and misses. In fact, this minor difference was in the opposite direction of our results with mean contrast being slightly higher for misses. While we cannot completely rule out any other effects of stimulus contrast, the normalization in Figure 3 and minor differences for non-target stimuli should minimize them.”

      Reviewer #1 (Public Review): 

      Summary: 

      In this study, Nandy and colleagues examine neural, physiological and behavioral correlates of perceptual variability in monkeys performing a visual change detection task. They used a laminar probe to record from area V4 while two macaque monkeys detected a small change in stimulus orientation that occurred at a random time in one of two locations, focusing their analysis on stimulus conditions where the animal was equally likely to detect (hit) or not-detect (miss) a briefly presented orientation change (target). They discovered two behavioral and physiological measures that are significantly different between hit and miss trials - pupil size tends to be slightly larger on hits vs. misses, and monkeys are more likely to miss the target on trials in which they made a microsaccade shortly before target onset. They also examined multiple measures of neural activity across the cortical layers and found some measures that are significantly different between hits and misses. 

      Strengths: 

      Overall the study is well executed and the analyses are appropriate (though several issues still need to be addressed as discussed in Specific Comments). 

      Thank you.

      Weaknesses: 

      My main concern with this study is that, with the exception of the pre-target microsaccades, the correlates of perceptual variability (differences between hits and misses) appear to be weak, potentially unreliable and disconnected. The GLM analysis of predictive power of trial outcome based on the behavioral and neural measures is only discussed at the end of the paper. This analysis shows that some of the measures have no significant predictive power, while others cannot be examined using the GLM analysis because these measures cannot be estimated in single trials. Given these weak and disconnected effects, my overall sense is that the current results provide limited advance to our understanding of the neural basis of perceptual variability. 

      Please see our response above to item #1 of the editorial recommendation. Six of our key results are individually significant in both animals giving us high confidence about the reliability and strength of our results. 

      Regarding the reviewer’s comment about the GLM, we note (also stated in the manuscript) that among the measures that we could estimate reliably on a single trial basis, two of these – pre-target microsaccades and input-layer firing rates – were reliable signatures of stimulus perception at threshold. This analysis does not imply that the other measures – Fano Factor, PPC, inter-laminar population correlations, SSC (which are all standard tools in modern systems neuroscience, and which cannot be estimated on a single-trial basis) – are irrelevant. Our intent in including the GLM analyses was to complement the results reported from these across-trial measures (Figs 4-7) with the predictive power of single-trial measures.

      While no study is entirely complete in itself, we have attempted to synthesize our results into a conceptual model as depicted in Fig 8.

      Reviewer #2 (Public Review): 

      Strengths: 

      The experiments were well-designed and executed with meticulous control. The analyses of both behavioural and electrophysiological data align with the standards in the field. 

      Thank you.

      Weaknesses: 

      Many of the findings appear to be subtle differences and incremental compared to previous literature, including the authors' own work. While incremental findings are not necessarily a problem, the manuscript lacks clear statements about the extent to which the dataset, analysis, and findings overlap with the authors' prior research. For example, one of the main findings, which suggests that V4 neurons exhibit larger visual responses in hit trials (as shown in Fig. 3), appears to have been previously reported in their 2017 paper. 

      We respectfully disagree with the assessment that the findings reported here are incremental over the results reported in our prior study (Nandy et al,. 2017). In the previous study, we compared the laminar profile of neural modulation due to the deployment of attention i.e. the main comparison points were the attend-in and the attend-away conditions while controlling for visual stimulation. In this study, we go one step further and home in on the attend-in condition and investigate the differences in the laminar profile of neural activity (and two additional physiological measures: pupil and microsaccades) when the animal either correctly reports or fails to report a stimulus with equal probability. We thus control for both the visual stimulation and the cued attention state of the animal. While there are parallels to our previous results (as the reviewer correctly noted), the results reported here cannot be trivially predicted from our previous results. Please also note that we discuss our new results in the context of prior results, from both our group and others, in the manuscript (lines 310-332).

      Furthermore, the manuscript does not explore potentially interesting aspects of the dataset. For instance, the authors could have investigated instances where monkeys made 'false' reports, such as executing saccades towards visual stimuli when no orientation change occurred, which allows for a broader analysis that considers the perceptual component of neural activity over pure sensory responses. Overall, lacking broad interest with the current form.

      We appreciate the reviewer’s feedback on analyzing false alarm trials. Our focus for this study was to investigate the behavioral and neural correlates accompanying a correct or incorrect perception of a target stimulus presented at perceptual threshold. False alarm trials, by definition, do not include a target presentation. Moreover, false alarm rates rapidly decline with duration into a trial, with high rates during the first non-target presentation and rates close to zero by the time of the eighth presentation (see figure). Investigating false alarms will thus involve a completely different form of analysis than we have undertaken here. We therefore feel that while analyzing false alarm trials will be an interesting avenue to pursue in the future, it is outside the scope of the present study.

      Author response image 1.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      eLife assessment

      This useful study tests the hypothesis that Mycobacterium tuberculosis infection increases glycolysis in monocytes, which alters their capacity to migrate to lymph nodes as monocyte-derived dendritic cells. The authors conclude that infected monocytes are metabolically pre-conditioned to differentiate, with reduced expression of Hif1a and a glycolytically exhaustive phenotype, resulting in low migratory and immunologic potential. However, the evidence is incomplete as the use of live and dead mycobacteria still limits the ability to draw firm conclusions. The study will be of interest to microbiologists and infectious disease scientists.

      In response to the general eLife assessment, we would like to emphasize that the study did not deal with “infected monocytes” per se but rather with monocytes purified from patients with active TB. We show that monocytes purified from these TB patients (versus healthy controls) differentiate into DCs with different migratory capacities. In addition, to address the reviewer's comments in this new version of our manuscript, we include a relevant characterization of the migration capacity of DCs infected with Mtb to the plethora of assays already shown with viable bacteria in the previous revised version of our manuscript. 

      All in all, we believe that our study has significantly improved thanks to the feedback provided by the editor and reviewer panel during the different revision processes. We sincerely hope that this version of our manuscript is deemed fit for publication in this prestigious journal.

      Public Reviews:

      Reviewer #3 (Public Review):

      In the revised manuscript by Maio et al, the authors examined the bioenergetic mechanisms involved in the delayed migration of DC's during Mtb infection. The authors performed a series of in vitro infection experiments including bioenergetic experiments using the Agilent Seahorse XF, and glucose uptake and lactate production experiments. Also, data from SCENITH is included in the revised manuscript as well as some clinical data. This is a well written manuscript and addresses an important question in the TB field. A remaining weakness is the use of dead (irradiated) Mtb in several of the new experiments and claims where iMtb data were used to support live Mtb data. Another notable weakness lies in the author's insistence on asserting that lactate is the ultimate product of glycolysis, rather than acknowledging a large body of historical data in support of pyruvate's role in the process. This raises a perplexing issue highlighted by the authors: if Mtb indeed upregulates glycolysis, one would expect that inhibiting glycolysis would effectively control TB. However, the reality contradicts this expectation. Lastly, the examination of the bioenergetics of cells isolated from TB patients undergoing drug therapy, rather than studying them at their baseline state is a weakness.

      We thank the reviewer for this insightful assessment and feedback of our study. With regards to the data obtained with iMtb to support that with live Mtb, we have clarified the use of either iMtb or Mtb for each figure legend in the new version of the manuscript. Furthermore, we included the confirmation of the involvement of TLR2 ligation in the up-regulation of HIF-1α triggered by viable Mtb (new Fig S2E). We also conducted migration assays using (live) Mtb-infected dendritic cells (DCs) treated with either oxamate or PX-478 to validate that the HIF1a/glycolysis axis is indeed essential for DC migration (new Fig 5D).

      We respectfully acknowledge the reviewer's statement regarding the potential relationship between glycolysis and the control of TB. However, we find it necessary to elaborate on our stance, as our data offer a nuanced perspective. Our research indicates that DCs exhibit upregulated glycolysis following stimulation or infection by Mtb. This metabolic shift is crucial for facilitating cell migration to the draining lymph nodes, an essential step in mounting an effective immune response. Yet, it remains uncertain whether this glycolytic induction reaches a threshold conducive to generating a protective immune response, a matter that our findings do not definitively address. This aspect is carefully discussed in the manuscript, lines 380-385.

      Moreover, analyses of samples from chronic TB patients suggest that the outcome of inhibiting glycolysis may vary depending on factors such as the infection stage, the targeted cell type (e.g., monocytes, DCs), and the affected compartment (systemic versus local). This variability aligns with the concept of "too much, too little" exemplified by the dual roles of IFNγ (PMID: 28646367) and TNFα (PMID: 19275693) in TB, emphasizing the need to maintain an inflammatory equilibrium. In the context of the HIF1α/glycolysis axis, it appears to be a matter of timing: a case of "too early" activation of glycolysis in precursors, which could upset the delicate balance necessary for an effective immune response. We have added these comments in the discussion (pages 19-20, lines 468-485).

      In summary, while acknowledging the reviewer's perspective, we believe that a comprehensive understanding of the interplay between Mtb infection and glycolysis in myeloid cells requires further consideration of various contextual conditions, urging caution against oversimplified interpretations.

      With regard to the patients' information, as pointed out by the reviewer, according to the inclusion criteria for patient samples in the approved protocol by the Institutional Ethics Committee, we recruit patients who have received less than 15 days of treatment (for sensitive TB, the total treatment duration is at least 6 months). We do not have access to patient sample before they begin the treatment, as starting therapy is the most urgent matter in this case. Following the reviewer's suggestion, we investigated whether the glycolytic activity of monocytes correlated with the initiation of antibiotic treatment within this 15-day period. Our observations did not show any significant impact during the initial 15 days of treatment (see expanded reply below). However, after 2 months of treatment, we found that the glycolytic profile of CD16+ monocytes returned to baseline levels as per our analysis. This suggests that despite the normalization of glycolytic activity with antibiotic therapy, heightened basal glycolysis remains noticeable during the initial two weeks of treatment (time limit to meet the inclusion criteria in our study cohort).

      Recommendations for the authors:

      Reviewer #3 (Recommendations For The Authors):

      (1) In the revised manuscript, the authors addressed concerns related to using irradiated Mtb, a positive development. However, the study predominantly employs 1:1 or 2:1 MOI, representing a low infection model, with no observed statistical distinction between the two MOIs (Fig-1). To enhance the study, inclusion of a higher MOI (e.g., 5:1 or 10:1) would have been more informative. This becomes crucial as prior research on human macrophages indicates that Mtb infection typically hampers glycolysis, a finding inconsistent with the present study.

      As the reviewer notes, important work has documented the inhibition of glycolysis in M. tuberculosis-infected macrophages dependent on the MOI (PMID 30444490). For instance, in this study, hMDMs infected at an MOI of 1 showed increased extracellular acidification and glycolytic parameters, as opposed to macrophages infected at higher MOI, or the same MOI but measured in THP1 cells. In light of these findings, we attempted to extend our study with Mo-DCs to higher MOIs, but too much cell death was induced, limiting our ability to obtain reliable metabolic measurements and functional assays from these cultures. Consistent with this, other authors reported that more than 40% of Mo-DC die after 24 hours following infection with H37Rv at an MOI of 10 (PMID 22024399, Fig 2B). We acknowledge that more comprehensive focused in vivo studies would be needed to assess the overall impact of infection. We foresee that in the context of natural infection, DC with different levels of infection will coexist, some with low bacillary load that may be able to trigger glycolysis and migrate, others highly infected and more likely to die. In this case, we are unable to provide a full explanation for the delay in the onset of the adaptive response, an aspect that requires further investigation. From our perspective, the important contribution of our work is more focused on understanding the later stage of infection, when chronic infection is established, where precursors already seem to have a limited capacity to generate DC with a good migratory performance regardless of being confronted with a low bacillary load. 

      To better clarify the scope and limitations of the work, we added these comments to the discussion (see discussion, lines 405-408).

      The study emphasizes that Mtb infection enhances glycolysis in Mo-DCs (Fig-1 and Fig-2). Despite the authors advocating lactate as the end product (citing three reviews/opinions), the historical literature supported by detailed experimentation convincingly favors pyruvate. While the authors' attempt to support an alternate glycolytic paradigm is understandable, it is simply not necessary. This is further supported by the authors' claim that oxamate is an inhibitor of glycolysis (abstract and main text). Oxamate is a pyruvate analogue that directly inhibits the conversion of pyruvate into lactate by lactate dehydrogenase. Simply put, if oxamate was an inhibitor of glycolysis then the cells would have died.

      (2) Taking into account the reviewer's suggestions, we changed the text accordingly, referring to oxamate as an LDH inhibitor, including in the abstract.

      In Fig-2, clarify the term "bystander DCs." Explain why these MtbRFP- DCs exhibit distinct behavior compared to uninfected DCs, especially considering their similarity to Mtb-infected ones.

      (3) To clarify these results, as correctly suggested by the reviewer, we incorporated a sentence in the results section, stating that bystander DCs are cells that are not in direct association with Mtb (Mtb-RFP-DCs), but are rather nearby and exposed to the same environment (page 7, line 145-148). In other words, bystander cells are those exposed to the same secretome and soluble factors as infected cells. Our data indicate that bystander DCs upregulate their state of glycolysis just like infected DCs do, which suggests the presence of soluble mediators induced during infection that are capable of triggering glycolysis even in uninfected cells.

      These results are in line with the observation that bacteria lacking infectious capacity (such as the irradiated Mtb) also trigger glycolysis in DCs (Fig 1), likely via TLR2 receptors that are potentially activated by the release of mycobacterial antigens or bacterial debris present in the microenvironment (Fig 3). We incorporated this interpretation in the discussion of the manuscript (lines 403-408).

      (4) Notably, the authors conducted SCENITH on both iMtb and viable Mtb (Fig-2). However, OCR, PER, and Mito- & Glyco- ATP were solely measured in MO-DCs stimulated by iMtb. Given the distinct glycolytic responses between iMtb and viable Mtb, it is crucial to assess these parameters in Mo-DCs treated with viable Mtb. Moreover, it is unclear as to how the relative ATP in Fig-2F was calculated as both Mito-ATP and Glyco-ATP is significantly high in iMtb-treated Mo-DCs (Fig-2E). Also, figure 2 contains panels with no labeling, which is confusing.

      We appreciate the reviewer's suggestion that additional determinations would enrich the bioenergetic profile of DCs during infection. However, due to biosafety considerations and economic-driven limitations, we are currently unable to measure OCR, PER, and Mito- & Glyco- ATP, as these assessments require live cell cultures within BSL3 containment, if live Mtb is to be employed. Regrettably, our BSL3 facility is not equipped with a Seahorse instrument—few facilities in the world have such type of BLS3-driven investment. For this key reason, we employed SCENITH for our BSL3-based experiments.

      Concerning the how ATP was calculated, we show below the raw data for Mito-ATP and Glyco-ATP results and calculations of their relative contributions.

      Author response table 1.

      (5) In Figures 3, 4, & 5, the consistent use of only iMtb was observed. Previous concerns about this approach were raised in the review, with the authors asserting that the use of viable Mtb was beyond the manuscript's scope. However, this claim is inaccurate. Both the authors' findings and literature elsewhere emphasize notable differences not only in host-cell metabolism but also in immune responses when treated with viable Mtb compared to dead or iMtb. Therefore, it is recommended to incorporate viable Mtb in experiments where only iMtb was utilized. Also, in the abstract (3rd sentence), do the authors refer to live or irradiated Mtb? It is imperative to clearly indicate this distinction, as the subsequent conclusions are based only on one of these two scenarios, not both. The contradictory mitochondrial mass results (figure 1; live and dead Mtb showed opposite mitochondrial mass results) clearly illustrate the profound difference live (versus dead) Mtb cells can have on an experiment.

      We thank the reviewer for stating this concern. For Figure 3, the involvement of TLR2 ligation on lactate release was also confirmed with live Mtb (shown in Figure S2D). In this current version, we also confirmed the involvement of TLR2 ligation in the up-regulation of HIF-1α triggered by live Mtb (new Fig S2E). As for Figure 4, we agree that performing assays with live Mtb will add complementary information. Indeed, we hope to investigate in the future the impact of the glycolysis/HIF1a axes on the adaptive immune response. We believe that employing live bacteria and considering their active immune evasion strategies will be crucial. However, at present, this is not the focus of the current manuscript and is beyond its scope.

      We also agree with the reviewer that confirmation of the migratory behavior of DCs following Mtb infection is a crucial aspect of the study. To comply with this pertinent request, we performed new migration assays using Mtb-infected DCs treated with oxamate or PX-478 to validate that the HIF1a/glycolysis axis; results convincingly demonstrate that this axis is essential for DC migration, particularly in the context of Mtb-infected cells (new Fig 5D). Having observed the same inhibitory effect of HIF1a and LDH inhibition on cell migration in either Mtb-infected or iMtb-stimulated DCs, we consider that the sentence alluded to by the reviewer in the abstract is now applicable to both contexts (page 2, line 34-36). We hope this reviewer agrees.

      (6) The discussion and the graphical abstract elucidating the distinctions in glycolysis between CD16+ monocytes of HS and TB patients and iMtb-treated Mo-DCs are currently confusing and require clarification. According to the abstract, monocytes from TB patients exhibit heightened glycolysis, resulting in diminished HIF-a activity and migratory capacity of MO-DCs. This prompts a question: if exacerbated glycolysis in monocytes is associated with adverse outcomes, wouldn't it be logical to consider suppressing glycolysis? If so, how can inhibiting glycolysis, a favored metabolic pathway for pro-inflammatory responses, be beneficial for TB therapy?

      We understand the reviewer’s concern about this apparent paradox. As previously mentioned in response to the public review provided by the reviewer, inhibiting glycolysis may yield varying outcomes depending on the stage of infection, as well as the cellular target (e.g., monocytes, DCs) or compartment (systemic versus local). It is imperative to delve deeper into the potential role of the HIF1α/glycolysis axis at the systemic level within the context of chronic inflammation, contrasting with its role in a local setting during the acute phase of infection.

      A comprehensive understanding of the interplay between Mtb infection and glycolysis in myeloid cells requires further consideration of various contextual conditions, urging caution against oversimplified interpretations. For instance, one of the objectives of host-directed therapies (HDTs) is to mitigate host-response inflammatory toxicity, which can impede treatment efficacy (doi: 10.3389/fimmu.2021.645485). In this regard, traditional anti-inflammatory drugs such as non-steroidal anti-inflammatory drugs (NSAIDs) and corticosteroids have been explored as adjunct therapies due to their immunomodulatory properties. Additionally, compounds like vitamin D, phenylbutyrate (PBA), metformin, and thalidomide, among others, have been investigated in the context of TB infections (doi:10.3389/fimmu.2017.00772), highlighting the diverse range of strategies aimed at enhancing TB treatment. These efforts extend beyond bolstering antimicrobial activity to encompass minimizing inflammation and mitigating tissue damage.

      (7) I am not convinced that BubbleMap made any significant contribution to the manuscript perhaps because it is poorly described in the figure legends/main text (I am unable to determine what data set is significant or not).

      We agree with the reviewer’s comment. To clarify the valuable information gleaned from these analyses, we have added interpretive guidelines on bubble color, bubble size and statistical significance in the legend of Figure 7. We hope these changes may reflect the significant contribution of the BubbleMap analysis approach to this study, which demonstrates a significant enrichment of interferon response gene expression in the monocyte compartment from patients with active TB compared to their control counterparts. Notably, this enrichment does not extend to genes associated with the OXPHOS hallmark.

      (8) The use of cells/monocytes from TB patients is a concern in addition to the incomplete demographic table. In the case of the latter, absolute numbers including percentages should be included. Importantly, it appears that cells from TB patients were used, that received anti-TB drug therapy (regimen not stated) up to two weeks post diagnosis and not at baseline. This is important as recent studies have shown that anti-TB drugs modulates the bioenergetics of host cells. Lastly, what were the precise TB symptoms the authors referred to in figure 7C?

      We have updated the demographic table and included the absolute numbers. We concur with the reviewer's viewpoint, particularly in light of recent findings illustrating the impact of anti-TB drug treatment on cell metabolism (doi: 10.1128/AAC.00932-21/). Again, this study underscores the complexity of such effects, which exhibit considerable variability influenced by factors such as cell type, drug concentration, and combination therapy.

      Despite this variability, our analysis involving monocytes from TB patients, who received different antibiotic combinations within short time frames (less than 15 days) reveals a marked increase in glycolysis in CD16+ monocytes compared to healthy counterparts. We did not observe a correlation between monocyte glycolytic capacity and the start time of antibiotic treatment within this 15-day window (see below, Author response image 1). These findings suggest that the antibiotic regimen does not have a significant impact on monocyte glycolytic capacity during the first 15 days.  However, we did observe an effect of antibiotic treatment when comparing patients before and 2 months after treatment. Enrichment analysis of various monocyte subsets before and after 2 months of treatment (GEO accession number: GSE185372) showed that CD14dim CD16+ and CD14+ CD16+ populations had higher glycolytic activity before treatment, which is decreased then post-treatment (Author response image 2).

      Author response image 1.

      Correlation analysis between the baseline glycolytic capacity and the time since treatment onset for each monocyte subset (CD14+CD16-, CD14+CD16+ and CD14dimCD16+, N = 11). Linear regression lines are shown. Spearman’s rank test. The data are represented as scatter plots with each circle representing a single individual.

      Author response image 2.

      Gene enrichment analysis for glycolytic genes on the pairwise comparisons of each monocyte subset (CD14+CD16-, CD14+CD16+ and CD14dimCD16+) from patients with active TB pre-treatment vs patients with active TB (TB) undergoing treatment for 2 months. Comparisons with a p-value of less than 0.05 and an FDR value of less than 0.25 are considered significantly different.

      Overall, our results indicate that while drug treatment does affect cell bioenergetics, this effect is not prominent within the first 15 days of treatment. CD16+ monocytes maintain high basal glycolytic activity that normalizes after treatment, contrasting with the CD16- population (even under the same circulating antibiotic doses). This highlights the intricate interplay between anti-TB drugs and cellular metabolism, underscoring the need for further research to understand the underlying mechanisms and therapeutic implications.

      Finally, the term symptoms evolution refers to the time period during which a patient experiences cough and phlegm for more than 2-3 weeks, with or without sputum that may (or not) be bloody, accompanied by symptoms of constitutional illness (e.g, loss of appetite, weight loss, night sweats, general malaise). As requested, this definition has been included in the method section (page 28-29, lines 705-709).

      Minor:

      (1) Incorporate the abbreviation for tuberculosis "(TB)" in the first line of the abstract and similarly introduce the abbreviation for Mycobacterium tuberculosis when it is first mentioned in the abstract.

      Thank you, we have amended it accordingly.

      (2) As the majority of experiments are in vitro, the authors should specify the number of times each experiment was conducted for every figure.

      We have included this information in each figure legend (see N for each panel). Since the majority of our approaches are conducted in vitro using primary cell cultures (specifically, human monocyte-derived DCs), we utilized samples from four to ten independent donors, not replicates, in order to account for the variability seen between donors.

      (3) Rename Fig-2. Ensure consistent labeling for the metabolic dependency of uninfected, Mtb-infected, and the Bystander panel, aligning with the format used in panels A & B. Similarly, replace '-' with 'uninfected'.

      We have modified the figure following most of the reviewer’s suggestions. However, we decided to keep the nomenclature “-” to denote a control condition, which can be unstimulated (panels A-B, fig 2) or uninfected cells (panels C-D, fig 2) depending on the experimental design.

      (4) Discussion: It is unclear what the authors mean by 'some sort of exhausted glycolytic capacity'.

      We have slightly modified the phrase.

    1. Author response:

      The following is the authors’ response to the current reviews.

      eLife assessment

      This useful manuscript challenges the utility of current paradigms for estimating brain-age with magnetic resonance imaging measures, but presents inadequate evidence to support the suggestion that an alternative approach focused on predicting cognition is more useful. The paper would benefit from a clearer explication of the methods and a more critical evaluation of the conceptual basis of the different models. This work will be of interest to researchers working on brain-age and related models.

      Thank you so much for providing high-quality reviews on our manuscript. We revised the manuscript to address all of the reviewers’ comments and provided full responses to each of the comments below. Importantly, in this revision, we clarified that we did not intend to use Brain Cognition as an alternative approach as mentioned by the editor. This is because, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. Here we made this point more explicit and further stated that the relationship between Brain Cognition and fluid cognition indicates the upper limit of Brain Age’s capability in capturing fluid cognition. By examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age. And such quantification is the third aim of this study.

      Reviewer #1 (Public Review):

      In this paper, the authors evaluate the utility of brain age derived metrics for predicting cognitive decline by performing a 'commonality' analysis in a downstream regression that enables the different contribution of different predictors to be assessed. The main conclusion is that brain age derived metrics do not explain much additional variation in cognition over and above what is already explained by age. The authors propose to use a regression model trained to predict cognition ('brain cognition') as an alternative suited to applications of cognitive decline. While this is less accurate overall than brain age, it explains more unique variance in the downstream regression.

      Importantly, in this revision, we clarified that we did not intend to use Brain Cognition as an alternative approach. This is because, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. Here we made this point more explicit and further stated that the relationship between Brain Cognition and fluid cognition indicates the upper limit of Brain Age’s capability in capturing fluid cognition. By examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age.

      REVISED VERSION: while the authors have partially addressed my concerns, I do not feel they have addressed them all. I do not feel they have addressed the weight instability and concerns about the stacked regression models satisfactorily.

      Please see our responses to #3 below

      I also must say that I agree with Reviewer 3 about the limitations of the brain age and brain cognition methods conceptually. In particular that the regression model used to predict fluid cognition will by construction explain more variance in cognition than a brain age model that is trained to predict age. This suffers from the same problem the authors raise with brain age and would indeed disappear if the authors had a separate measure of cognition against which to validate and were then to regress this out as they do for age correction. I am aware that these conceptual problems are more widespread than this paper alone (in fact throughout the brain age literature), so I do not believe the authors should be penalised for that. However, I do think they can make these concerns more explicit and further tone down the comments they make about the utility of brain cognition. I have indicated the main considerations about these points in the recommendations section below.

      Thank you so much for raising this point. We now have the following statement in the introduction and discussion to address this concern (see below).

      Briefly, we made it explicit that, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. That is, the relationship between Brain Cognition and fluid cognition indicates the upper limit of Brain Age’s capability in capturing fluid cognition. More importantly, by examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age. And this is the third goal of this present study.

      From Introduction:

      “Third and finally, certain variation in fluid cognition is related to brain MRI, but to what extent does Brain Age not capture this variation? To estimate the variation in fluid cognition that is related to the brain MRI, we could build prediction models that directly predict fluid cognition (i.e., as opposed to chronological age) from brain MRI data. Previous studies found reasonable predictive performances of these cognition-prediction models, built from certain MRI modalities (Dubois et al., 2018; Pat, Wang, Anney, et al., 2022; Rasero et al., 2021; Sripada et al., 2020; Tetereva et al., 2022; for review, see Vieira et al., 2022). Analogous to Brain Age, we called the predicted values from these cognition-prediction models, Brain Cognition. The strength of an out-of-sample relationship between Brain Cognition and fluid cognition reflects variation in fluid cognition that is related to the brain MRI and, therefore, indicates the upper limit of Brain Age’s capability in capturing fluid cognition. This is, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. Consequently, if we included Brain Cognition, Brain Age and chronological age in the same model to explain fluid cognition, we would be able to examine the unique effects of Brain Cognition that explain fluid cognition beyond Brain Age and chronological age. These unique effects of Brain Cognition, in turn, would indicate the amount of co-variation between brain MRI and fluid cognition that is missed by Brain Age.”

      From Discussion:

      “Third, by introducing Brain Cognition, we showed the extent to which Brain Age indices were not able to capture the variation in fluid cognition that is related to brain MRI. More specifically, using Brain Cognition allowed us to gauge the variation in fluid cognition that is related to the brain MRI, and thereby, to estimate the upper limit of what Brain Age can do. Moreover, by examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age.

      From our results, Brain Cognition, especially from certain cognition-prediction models such as the stacked models, has relatively good predictive performance, consistent with previous studies (Dubois et al., 2018; Pat, Wang, Anney, et al., 2022; Rasero et al., 2021; Sripada et al., 2020; Tetereva et al., 2022; for review, see Vieira et al., 2022). We then examined Brain Cognition using commonality analyses (Nimon et al., 2008) in multiple regression models having a Brain Age index, chronological age and Brain Cognition as regressors to explain fluid cognition. Similar to Brain Age indices, Brain Cognition exhibited large common effects with chronological age. But more importantly, unlike Brain Age indices, Brain Cognition showed large unique effects, up to around 11%. As explained above, the unique effects of Brain Cognition indicated the amount of co-variation between brain MRI and fluid cognition that was missed by a Brain Age index and chronological age. This missing amount was relatively high, considering that Brain Age and chronological age together explained around 32% of the total variation in fluid cognition. Accordingly, if a Brain Age index was used as a biomarker along with chronological age, we would have missed an opportunity to improve the performance of the model by around one-third of the variation explained.”

      This is a reasonably good paper and the use of a commonality analysis is a nice contribution to understanding variance partitioning across different covariates. I have some comments that I believe the authors ought to address, which mostly relate to clarity and interpretation

      Reviewer #1 Public Review #1

      First, from a conceptual point of view, the authors focus exclusively on cognition as a downstream outcome. I would suggest the authors nuance their discussion to provide broader considerations of the utility of their method and on the limits of interpretation of brain age models more generally.

      Thank you for your comments on this issue.

      We now discussed the broader consideration in detail:

      (1) the consistency between our findings on fluid cognition and other recent works on brain disorders,

      (2) the difference between studies investigating the utility of Brain Age in explaining cognitive functioning, including ours and others (e.g., Butler et al., 2021; Cole, 2020, 2020; Jirsaraie, Kaufmann, et al., 2023) and those explaining neurological/psychological disorders (e.g., Bashyam et al., 2020; Rokicki et al., 2021)

      and

      (3) suggested solutions we and others made to optimise the utility of Brain Age for both cognitive functioning and brain disorders.

      From Discussion:

      “This discrepancy between the predictive performance of age-prediction models and the utility of Brain Age indices as a biomarker is consistent with recent findings (for review, see Jirsaraie, Gorelik, et al., 2023), both in the context of cognitive functioning (Jirsaraie, Kaufmann, et al., 2023) and neurological/psychological disorders (Bashyam et al., 2020; Rokicki et al., 2021). For instance, combining different MRI modalities into the prediction models, similar to our stacked models, often leads to the highest performance of age-prediction models, but does not likely explain the highest variance across different phenotypes, including cognitive functioning and beyond (Jirsaraie, Gorelik, et al., 2023).”

      “There is a notable difference between studies investigating the utility of Brain Age in explaining cognitive functioning, including ours and others (e.g., Butler et al., 2021; Cole, 2020, 2020; Jirsaraie, Kaufmann, et al., 2023) and those explaining neurological/psychological disorders (e.g., Bashyam et al., 2020; Rokicki et al., 2021). We consider the former as a normative type of study and the latter as a case-control type of study (Insel et al., 2010; Marquand et al., 2016). Those case-control Brain Age studies focusing on neurological/psychological disorders often build age-prediction models from MRI data of largely healthy participants (e.g., controls in a case-control design or large samples in a population-based design), apply the built age-prediction models to participants without vs. with neurological/psychological disorders and compare Brain Age indices between the two groups. On the one hand, this means that case-control studies treat Brain Age as a method to detect anomalies in the neurological/psychological group (Hahn et al., 2021). On the other hand, this also means that case-control studies have to ignore under-fitted models when applied prediction models built from largely healthy participants to participants with neurological/psychological disorders (i.e., Brain Age may predict chronological age well for the controls, but not for those with a disorder). On the contrary, our study and other normative studies focusing on cognitive functioning often build age-prediction models from MRI data of largely healthy participants and apply the built age-prediction models to participants who are also largely healthy. Accordingly, the age-prediction models for explaining cognitive functioning in normative studies, while not allowing us to detect group-level anomalies, do not suffer from being under-fitted. This unfortunately might limit the generalisability of our study into just the normative type of study. Future work is still needed to test the utility of brain age in the case-control case.”

      “Next, researchers should not select age-prediction models based solely on age-prediction performance. Instead, researchers could select age-prediction models that explained phenotypes of interest the best. Here we selected age-prediction models based on a set of features (i.e., modalities) of brain MRI. This strategy was found effective not only for fluid cognition as we demonstrated here, but also for neurological and psychological disorders as shown elsewhere (Jirsaraie, Gorelik, et al., 2023; Rokicki et al., 2021). Rokicki and colleagues (2021), for instance, found that, while integrating across MRI modalities led to age-prediction models with the highest age-prediction performance, using only T1 structural MRI gave age-prediction models that were better at classifying Alzheimer’s disease. Similarly, using only cerebral blood flow gave age-prediction models that were better at classifying mild/subjective cognitive impairment, schizophrenia and bipolar disorder.

      As opposed to selecting age-prediction models based on a set of features, researchers could also select age-prediction models based on modelling methods. For instance, Jirsaraie and colleagues (2023) compared gradient tree boosting (GTB) and deep-learning brain network (DBN) algorithms in building age-prediction models. They found GTB to have higher age-prediction performance but DBN to have better utility in explaining cognitive functioning. In this case, an algorithm with better utility (e.g., DBN) should be used for explaining a phenotype of interest. Similarly, Bashyam and colleagues (2020) built different DBN-based age-prediction models, varying in age-prediction performance. The DBN models with a higher number of epochs corresponded to higher age-prediction performance. However, DBN-based age-prediction models with a moderate (as opposed to higher or lower) number of epochs were better at classifying Alzheimer’s disease, mild cognitive impairment and schizophrenia. In this case, a model from the same algorithm with better utility (e.g., those DBN with a moderate epoch number) should be used for explaining a phenotype of interest. Accordingly, this calls for a change in research practice, as recently pointed out by Jirasarie and colleagues (2023, p7), “Despite mounting evidence, there is a persisting assumption across several studies that the most accurate brain age models will have the most potential for detecting differences in a given phenotype of interest”. Future neuroimaging research should aim to build age-prediction models that are not necessarily good at predicting age, but at capturing phenotypes of interest.”

      Reviewer #1 Public Review #2

      Second, from a methods perspective, there is not a sufficient explanation of the methodological procedures in the current manuscript to fully understand how the stacked regression models were constructed. I would request that the authors provide more information to enable the reader to better understand the stacked regression models used to ensure that these models are not overfit.

      Thank you for allowing us an opportunity to clarify our stacked model. We made additional clarification to make this clearer (see below). We wanted to confirm that we did not use test sets to build a stacked model in both lower and higher levels of the Elastic Net models. Test sets were there just for testing the performance of the models.

      From Methods: “We used nested cross-validation (CV) to build these prediction models (see Figure 7). We first split the data into five outer folds, leaving each outer fold with around 100 participants. This number of participants in each fold is to ensure the stability of the test performance across folds. In each outer-fold CV loop, one of the outer folds was treated as an outer-fold test set, and the rest was treated as an outer-fold training set. Ultimately, looping through the nested CV resulted in a) prediction models from each of the 18 sets of features as well as b) prediction models that drew information across different combinations of the 18 separate sets, known as “stacked models.” We specified eight stacked models: “All” (i.e., including all 18 sets of features), “All excluding Task FC”, “All excluding Task Contrast”, “Non-Task” (i.e., including only Rest FC and sMRI), “Resting and Task FC”, “Task Contrast and FC”, “Task Contrast” and “Task FC”. Accordingly, there were 26 prediction models in total for both Brain Age and Brain Cognition.

      To create these 26 prediction models, we applied three steps for each outer-fold loop. The first step aimed at tuning prediction models for each of 18 sets of features. This step only involved the outer-fold training set and did not involve the outer-fold test set. Here, we divided the outer-fold training set into five inner folds and applied inner-fold CV to tune hyperparameters with grid search. Specifically, in each inner-fold CV, one of the inner folds was treated as an inner-fold validation set, and the rest was treated as an inner-fold training set. Within each inner-fold CV loop, we used the inner-fold training set to estimate parameters of the prediction model with a particular set of hyperparameters and applied the estimated model to the inner-fold validation set. After looping through the inner-fold CV, we, then, chose the prediction models that led to the highest performance, reflected by coefficient of determination (R2), on average across the inner-fold validation sets. This led to 18 tuned models, one for each of the 18 sets of features, for each outer fold.

      The second step aimed at tuning stacked models. Same as the first step, the second step only involved the outer-fold training set and did not involve the outer-fold test set. Here, using the same outer-fold training set as the first step, we applied tuned models, created from the first step, one from each of the 18 sets of features, resulting in 18 predicted values for each participant. We, then, re-divided this outer-fold training set into new five inner folds. In each inner fold, we treated different combinations of the 18 predicted values from separate sets of features as features to predict the targets in separate “stacked” models. Same as the first step, in each inner-fold CV loop, we treated one out of five inner folds as an inner-fold validation set, and the rest as an inner-fold training set. Also as in the first step, we used the inner-fold training set to estimate parameters of the prediction model with a particular set of hyperparameters from our grid. We tuned the hyperparameters of stacked models using grid search by selecting the models with the highest R2 on average across the inner-fold validation sets. This led to eight tuned stacked models.

      The third step aimed at testing the predictive performance of the 18 tuned prediction models from each of the set of features, built from the first step, and eight tuned stacked models, built from the second step. Unlike the first two steps, here we applied the already tuned models to the outer-fold test set. We started by applying the 18 tuned prediction models from each of the sets of features to each observation in the outer-fold test set, resulting in 18 predicted values. We then applied the tuned stacked models to these predicted values from separate sets of features, resulting in eight predicted values.

      To demonstrate the predictive performance, we assessed the similarity between the observed values and the predicted values of each model across outer-fold test sets, using Pearson’s r, coefficient of determination (R2) and mean absolute error (MAE). Note that for R2, we used the sum of squares definition (i.e., R2 = 1 – (sum of squares residuals/total sum of squares)) per a previous recommendation (Poldrack et al., 2020). We considered the predicted values from the outer-fold test sets of models predicting age or fluid cognition, as Brain Age and Brain Cognition, respectively.”

      Note some previous research, including ours (Tetereva et al., 2022), splits the observations in the outer-fold training set into layer 1 and layer 2 and applies the first and second steps to layers 1 and 2, respectively. Here we decided against this approach and used the same outer-fold training set for both first and second steps in order to avoid potential bias toward the stacked models. This is because, when the data are split into two layers, predictive models built for each separate set of features only use the data from layer 1, while the stacked models use the data from both layers 1 and 2. In practice with large enough data, these two approaches might not differ much, as we demonstrated previously (Tetereva et al., 2022).

      Reviewer #1 Public Review #3

      Please also provide an indication of the different regression strengths that were estimated across the different models and cross-validation splits. Also, how stable were the weights across splits?

      The focus of this article is on the predictions. Still, it is informative for readers to understand how stable the feature importance (i.e., Elastic Net coefficients) is. To demonstrate the stability of feature importance, we now examined the rank stability of feature importance using Spearman’s ρ (see Figure 4). Specifically, we correlated the feature importance between two prediction models of the same features, used in two different outer-fold test sets. Given that there were five outer-fold test sets, we computed 10 Spearman’s ρ for each prediction model of the same features. We found Spearman’s ρ to be varied dramatically in both age-prediction (range=.31-.94) and fluid cognition-prediction (range=.16-.84) models. This means that some prediction models were much more stable in their feature importance than others. This is probably due to various factors such as a) the collinearity of features in the model, b) the number of features (e.g., 71,631 features in functional connectivity, which were further reduced to 75 PCAs, as compared to 19 features in subcortical volume based on the ASEG atlas), c) the penalisation of coefficients either with ‘Ridge’ or ‘Lasso’ methods, which resulted in reduction as a group of features or selection of a feature among correlated features, respectively, and d) the predictive performance of the models. Understanding the stability of feature importance is beyond the scope of the current article. As mentioned by Reviewer 1, “The predictions can be stable when the coefficients are not,” and we chose to focus on the prediction in the current article.

      Reviewer #1 Public Review #4

      Please provide more details about the task designs, MRI processing procedures that were employed on this sample in addition to the regression methods and bias correction methods used. For example, there are several different parameterisations of the elastic net, please provide equations to describe the method used here so that readers can easily determine how the regularisation parameters should be interpreted.

      Thank you for the opportunity for us to provide more methodical details.

      First, for the task design, we included the following statements:

      From Methods:

      “HCP-A collected fMRI data from three tasks: Face Name (Sperling et al., 2001), Conditioned Approach Response Inhibition Task (CARIT) (Somerville et al., 2018) and VISual MOTOR (VISMOTOR) (Ances et al., 2009).

      First, the Face Name task (Sperling et al., 2001) taps into episodic memory. The task had three blocks. In the encoding block [Encoding], participants were asked to memorise the names of faces shown. These faces were then shown again in the recall block [Recall] when the participants were asked if they could remember the names of the previously shown faces. There was also the distractor block [Distractor] occurring between the encoding and recall blocks. Here participants were distracted by a Go/NoGo task. We computed six contrasts for this Face Name task: [Encode], [Recall], [Distractor], [Encode vs. Distractor], [Recall vs. Distractor] and [Encode vs. Recall].

      Second, the CARIT task (Somerville et al., 2018) was adapted from the classic Go/NoGo task and taps into inhibitory control. Participants were asked to press a button to all [Go] but not to two [NoGo] shapes. We computed three contrasts for the CARIT task: [NoGo], [Go] and [NoGo vs. Go].

      Third, the VISMOTOR task (Ances et al., 2009) was designed to test simple activation of the motor and visual cortices. Participants saw a checkerboard with a red square either on the left or right. They needed to press a corresponding key to indicate the location of the red square. We computed just one contrast for the VISMOTOR task: [Vismotor], which indicates the presence of the checkerboard vs. baseline.”

      Second, for MRI processing procedures, we included the following statements.

      From Methods: “HCP-A provides details of parameters for brain MRI elsewhere (Bookheimer et al., 2019; Harms et al., 2018). Here we used MRI data that were pre-processed by the HCP-A with recommended methods, including the MSMALL alignment (Glasser et al., 2016; Robinson et al., 2018) and ICA-FIX (Glasser et al., 2016) for functional MRI. We used multiple brain MRI modalities, covering task functional MRI (task fMRI), resting-state functional MRI (rsfMRI) and structural MRI (sMRI), and organised them into 19 sets of features.”

      “ Sets of Features 1-10: Task fMRI contrast (Task Contrast) Task contrasts reflect fMRI activation relevant to events in each task. Bookheimer and colleagues (2019) provided detailed information about the fMRI in HCP-A. Here we focused on the pre-processed task fMRI Connectivity Informatics Technology Initiative (CIFTI) files with a suffix, “_PA_Atlas_MSMAll_hp0_clean.dtseries.nii.” These CIFTI files encompassed both the cortical mesh surface and subcortical volume (Glasser et al., 2013). Collected using the posterior-to-anterior (PA) phase, these files were aligned using MSMALL (Glasser et al., 2016; Robinson et al., 2018), linear detrended (see https://groups.google.com/a/humanconnectome.org/g/hcp-users/c/ZLJc092h980/m/GiihzQAUAwAJ) and cleaned from potential artifacts using ICA-FIX (Glasser et al., 2016).

      To extract Task Contrasts, we regressed the fMRI time series on the convolved task events using a double-gamma canonical hemodynamic response function via FMRIB Software Library (FSL)’s FMRI Expert Analysis Tool (FEAT) (Woolrich et al., 2001). We kept FSL’s default high pass cutoff at 200s (i.e., .005 Hz). We then parcellated the contrast ‘cope’ files, using the Glasser atlas (Gordon et al., 2016) for cortical surface regions and the Freesurfer’s automatic segmentation (aseg) (Fischl et al., 2002) for subcortical regions. This resulted in 379 regions, whose number was, in turn, the number of features for each Task Contrast set of features. “

      “ Sets of Features 11-13: Task fMRI functional connectivity (Task FC) Task FC reflects functional connectivity (FC ) among the brain regions during each task, which is considered an important source of individual differences (Elliott et al., 2019; Fair et al., 2007; Gratton et al., 2018). We used the same CIFTI file “_PA_Atlas_MSMAll_hp0_clean.dtseries.nii.” as the task contrasts. Unlike Task Contrasts, here we treated the double-gamma, convolved task events as regressors of no interest and focused on the residuals of the regression from each task (Fair et al., 2007). We computed these regressors on FSL, and regressed them in nilearn (Abraham et al., 2014). Following previous work on task FC (Elliott et al., 2019), we applied a highpass at .008 Hz. For parcellation, we used the same atlases as Task Contrast (Fischl et al., 2002; Glasser et al., 2016). We computed Pearson’s correlations of each pair of 379 regions, resulting in a table of 71,631 non-overlapping FC indices for each task. We then applied r-to-z transformation and principal component analysis (PCA) of 75 components (Rasero et al., 2021; Sripada et al., 2019, 2020). Note to avoid data leakage, we conducted the PCA on each training set and applied its definition to the corresponding test set. Accordingly, there were three sets of 75 features for Task FC, one for each task.

      Set of Features 14: Resting-state functional MRI functional connectivity (Rest FC) Similar to Task FC, Rest FC reflects functional connectivity (FC ) among the brain regions, except that Rest FC occurred during the resting (as opposed to task-performing) period. HCP-A collected Rest FC from four 6.42-min (488 frames) runs across two days, leading to 26-min long data (Harms et al., 2018). On each day, the study scanned two runs of Rest FC, starting with anterior-to-posterior (AP) and then with posterior-to-anterior (PA) phase encoding polarity. We used the “rfMRI_REST_Atlas_MSMAll_hp0_clean.dscalar.nii” file that was pre-processed and concatenated across the four runs. We applied the same computations (i.e., highpass filter, parcellation, Pearson’s correlations, r-to-z transformation and PCA) with the Task FC.

      Sets of Features 15-18: Structural MRI (sMRI)

      sMRI reflects individual differences in brain anatomy. The HCP-A used an established pre-processing pipeline for sMRI (Glasser et al., 2013). We focused on four sets of features: cortical thickness, cortical surface area, subcortical volume and total brain volume. For cortical thickness and cortical surface area, we used Destrieux’s atlas (Destrieux et al., 2010; Fischl, 2012) from FreeSurfer’s “aparc.stats” file, resulting in 148 regions for each set of features. For subcortical volume, we used the aseg atlas (Fischl et al., 2002) from FreeSurfer’s “aseg.stats” file, resulting in 19 regions. For total brain volume, we had five FreeSurfer-based features: “FS_IntraCranial_Vol” or estimated intra-cranial volume, “FS_TotCort_GM_Vol” or total cortical grey matter volume, “FS_Tot_WM_Vol” or total cortical white matter volume, “FS_SubCort_GM_Vol” or total subcortical grey matter volume and “FS_BrainSegVol_eTIV_Ratio” or ratio of brain segmentation volume to estimated total intracranial volume.”

      Third, for regression methods and bias correction methods used, we included the following statements:

      From Methods:

      “For the machine learning algorithm, we used Elastic Net (Zou & Hastie, 2005). Elastic Net is a general form of penalised regressions (including Lasso and Ridge regression), allowing us to simultaneously draw information across different brain indices to predict one target variable. Penalised regressions are commonly used for building age-prediction models (Jirsaraie, Gorelik, et al., 2023). Previously we showed that the performance of Elastic Net in predicting cognitive abilities is on par, if not better than, many non-linear and more-complicated algorithms (Pat, Wang, Bartonicek, et al., 2022; Tetereva et al., 2022). Moreover, Elastic Net coefficients are readily explainable, allowing us the ability to explain how our age-prediction and cognition-prediction models made the prediction from each brain feature (Molnar, 2019; Pat, Wang, Bartonicek, et al., 2022) (see below).

      Elastic Net simultaneously minimises the weighted sum of the features’ coefficients. The degree of penalty to the sum of the feature’s coefficients is determined by a shrinkage hyperparameter ‘α’: the greater the α, the more the coefficients shrink, and the more regularised the model becomes. Elastic Net also includes another hyperparameter, ‘l1 ratio’, which determines the degree to which the sum of either the squared (known as ‘Ridge’; l1 ratio=0) or absolute (known as ‘Lasso’; l1 ratio=1) coefficients is penalised (Zou & Hastie, 2005). The objective function of Elastic Net as implemented by sklearn (Pedregosa et al., 2011) is defined as:

      where X is the features, y is the target, and β is the coefficient. In our grid search, we tuned two Elastic Net hyperparameters: α using 70 numbers in log space, ranging from .1 and 100, and l_1-ratio using 25 numbers in linear space, ranging from 0 and 1.

      To understand how Elastic Net made a prediction based on different brain features, we examined the coefficients of the tuned model. Elastic Net coefficients can be considered as feature importance, such that more positive Elastic Net coefficients lead to more positive predicted values and, similarly, more negative Elastic Net coefficients lead to more negative predicted values (Molnar, 2019; Pat, Wang, Bartonicek, et al., 2022). While the magnitude of Elastic Net coefficients is regularised (thus making it difficult for us to interpret the magnitude itself directly), we could still indicate that a brain feature with a higher magnitude weights relatively stronger in making a prediction. Another benefit of Elastic Net as a penalised regression is that the coefficients are less susceptible to collinearity among features as they have already been regularised (Dormann et al., 2013; Pat, Wang, Bartonicek, et al., 2022).

      Given that we used five-fold nested cross validation, different outer folds may have different degrees of ‘α’ and ‘l1 ratio’, making the final coefficients from different folds to be different. For instance, for certain sets of features, penalisation may not play a big part (i.e., higher or lower ‘α’ leads to similar predictive performance), resulting in different ‘α’ for different folds. To remedy this in the visualisation of Elastic Net feature importance, we refitted the Elastic Net model to the full dataset without splitting them into five folds and visualised the coefficients on brain images using Brainspace (Vos De Wael et al., 2020) and Nilern (Abraham et al., 2014) packages. Note, unlike other sets of features, Task FC and Rest FC were modelled after data reduction via PCA. Thus, for Task FC and Rest FC, we, first, multiplied the absolute PCA scores (extracted from the ‘components_’ attribute of ‘sklearn.decomposition.PCA’) with Elastic Net coefficients and, then, summed the multiplied values across the 75 components, leaving 71,631 ROI-pair indices. “

      References

      Abraham, A., Pedregosa, F., Eickenberg, M., Gervais, P., Mueller, A., Kossaifi, J., Gramfort, A., Thirion, B., & Varoquaux, G. (2014). Machine learning for neuroimaging with scikit-learn. Frontiers in Neuroinformatics, 8, 14. https://doi.org/10.3389/fninf.2014.00014

      Ances, B. M., Liang, C. L., Leontiev, O., Perthen, J. E., Fleisher, A. S., Lansing, A. E., & Buxton, R. B. (2009). Effects of aging on cerebral blood flow, oxygen metabolism, and blood oxygenation level dependent responses to visual stimulation. Human Brain Mapping, 30(4), 1120–1132. https://doi.org/10.1002/hbm.20574

      Bashyam, V. M., Erus, G., Doshi, J., Habes, M., Nasrallah, I. M., Truelove-Hill, M., Srinivasan, D., Mamourian, L., Pomponio, R., Fan, Y., Launer, L. J., Masters, C. L., Maruff, P., Zhuo, C., Völzke, H., Johnson, S. C., Fripp, J., Koutsouleris, N., Satterthwaite, T. D., … on behalf of the ISTAGING Consortium, the P. A. disease C., ADNI, and CARDIA studies. (2020). MRI signatures of brain age and disease over the lifespan based on a deep brain network and 14 468 individuals worldwide. Brain, 143(7), 2312–2324. https://doi.org/10.1093/brain/awaa160

      Bookheimer, S. Y., Salat, D. H., Terpstra, M., Ances, B. M., Barch, D. M., Buckner, R. L., Burgess, G. C., Curtiss, S. W., Diaz-Santos, M., Elam, J. S., Fischl, B., Greve, D. N., Hagy, H. A., Harms, M. P., Hatch, O. M., Hedden, T., Hodge, C., Japardi, K. C., Kuhn, T. P., … Yacoub, E. (2019). The Lifespan Human Connectome Project in Aging: An overview. NeuroImage, 185, 335–348. https://doi.org/10.1016/j.neuroimage.2018.10.009

      Butler, E. R., Chen, A., Ramadan, R., Le, T. T., Ruparel, K., Moore, T. M., Satterthwaite, T. D., Zhang, F., Shou, H., Gur, R. C., Nichols, T. E., & Shinohara, R. T. (2021). Pitfalls in brain age analyses. Human Brain Mapping, 42(13), 4092–4101. https://doi.org/10.1002/hbm.25533

      Cole, J. H. (2020). Multimodality neuroimaging brain-age in UK biobank: Relationship to biomedical, lifestyle, and cognitive factors. Neurobiology of Aging, 92, 34–42. https://doi.org/10.1016/j.neurobiolaging.2020.03.014

      Destrieux, C., Fischl, B., Dale, A., & Halgren, E. (2010). Automatic parcellation of human cortical gyri and sulci using standard anatomical nomenclature. NeuroImage, 53(1), 1–15. https://doi.org/10.1016/j.neuroimage.2010.06.010

      Dormann, C. F., Elith, J., Bacher, S., Buchmann, C., Carl, G., Carré, G., Marquéz, J. R. G., Gruber, B., Lafourcade, B., Leitão, P. J., Münkemüller, T., McClean, C., Osborne, P. E., Reineking, B., Schröder, B., Skidmore, A. K., Zurell, D., & Lautenbach, S. (2013). Collinearity: A review of methods to deal with it and a simulation study evaluating their performance. Ecography, 36(1), 27–46. https://doi.org/10.1111/j.1600-0587.2012.07348.x

      Dubois, J., Galdi, P., Paul, L. K., & Adolphs, R. (2018). A distributed brain network predicts general intelligence from resting-state human neuroimaging data. Philosophical Transactions of the Royal Society B: Biological Sciences, 373(1756), 20170284. https://doi.org/10.1098/rstb.2017.0284

      Elliott, M. L., Knodt, A. R., Cooke, M., Kim, M. J., Melzer, T. R., Keenan, R., Ireland, D., Ramrakha, S., Poulton, R., Caspi, A., Moffitt, T. E., & Hariri, A. R. (2019). General functional connectivity: Shared features of resting-state and task fMRI drive reliable and heritable individual differences in functional brain networks. NeuroImage, 189, 516–532. https://doi.org/10.1016/j.neuroimage.2019.01.068

      Fair, D. A., Schlaggar, B. L., Cohen, A. L., Miezin, F. M., Dosenbach, N. U. F., Wenger, K. K., Fox, M. D., Snyder, A. Z., Raichle, M. E., & Petersen, S. E. (2007). A method for using blocked and event-related fMRI data to study “resting state” functional connectivity. NeuroImage, 35(1), 396–405. https://doi.org/10.1016/j.neuroimage.2006.11.051

      Fischl, B. (2012). FreeSurfer. NeuroImage, 62(2), 774–781. https://doi.org/10.1016/j.neuroimage.2012.01.021

      Fischl, B., Salat, D. H., Busa, E., Albert, M., Dieterich, M., Haselgrove, C., van der Kouwe, A., Killiany, R., Kennedy, D., Klaveness, S., Montillo, A., Makris, N., Rosen, B., & Dale, A. M. (2002). Whole Brain Segmentation. Neuron, 33(3), 341–355. https://doi.org/10.1016/S0896-6273(02)00569-X

      Glasser, M. F., Smith, S. M., Marcus, D. S., Andersson, J. L. R., Auerbach, E. J., Behrens, T. E. J., Coalson, T. S., Harms, M. P., Jenkinson, M., Moeller, S., Robinson, E. C., Sotiropoulos, S. N., Xu, J., Yacoub, E., Ugurbil, K., & Van Essen, D. C. (2016). The Human Connectome Project’s neuroimaging approach. Nature Neuroscience, 19(9), 1175–1187. https://doi.org/10.1038/nn.4361

      Glasser, M. F., Sotiropoulos, S. N., Wilson, J. A., Coalson, T. S., Fischl, B., Andersson, J. L., Xu, J., Jbabdi, S., Webster, M., Polimeni, J. R., Van Essen, D. C., & Jenkinson, M. (2013). The minimal preprocessing pipelines for the Human Connectome Project. NeuroImage, 80, 105–124. https://doi.org/10.1016/j.neuroimage.2013.04.127

      Gordon, E. M., Laumann, T. O., Adeyemo, B., Huckins, J. F., Kelley, W. M., & Petersen, S. E. (2016). Generation and Evaluation of a Cortical Area Parcellation from Resting-State Correlations. Cerebral Cortex, 26(1), 288–303. https://doi.org/10.1093/cercor/bhu239

      Gratton, C., Laumann, T. O., Nielsen, A. N., Greene, D. J., Gordon, E. M., Gilmore, A. W., Nelson, S. M., Coalson, R. S., Snyder, A. Z., Schlaggar, B. L., Dosenbach, N. U. F., & Petersen, S. E. (2018). Functional Brain Networks Are Dominated by Stable Group and Individual Factors, Not Cognitive or Daily Variation. Neuron, 98(2), 439-452.e5. https://doi.org/10.1016/j.neuron.2018.03.035

      Hahn, T., Fisch, L., Ernsting, J., Winter, N. R., Leenings, R., Sarink, K., Emden, D., Kircher, T., Berger, K., & Dannlowski, U. (2021). From ‘loose fitting’ to high-performance, uncertainty-aware brain-age modelling. Brain, 144(3), e31–e31. https://doi.org/10.1093/brain/awaa454

      Harms, M. P., Somerville, L. H., Ances, B. M., Andersson, J., Barch, D. M., Bastiani, M., Bookheimer, S. Y., Brown, T. B., Buckner, R. L., Burgess, G. C., Coalson, T. S., Chappell, M. A., Dapretto, M., Douaud, G., Fischl, B., Glasser, M. F., Greve, D. N., Hodge, C., Jamison, K. W., … Yacoub, E. (2018). Extending the Human Connectome Project across ages: Imaging protocols for the Lifespan Development and Aging projects. NeuroImage, 183, 972–984. https://doi.org/10.1016/j.neuroimage.2018.09.060

      Insel, T., Cuthbert, B., Garvey, M., Heinssen, R., Pine, D. S., Quinn, K., Sanislow, C., & Wang, P. (2010). Research Domain Criteria (RDoC): Toward a New Classification Framework for Research on Mental Disorders. American Journal of Psychiatry, 167(7), 748–751. https://doi.org/10.1176/appi.ajp.2010.09091379

      Jirsaraie, R. J., Gorelik, A. J., Gatavins, M. M., Engemann, D. A., Bogdan, R., Barch, D. M., & Sotiras, A. (2023). A systematic review of multimodal brain age studies: Uncovering a divergence between model accuracy and utility. Patterns, 4(4), 100712. https://doi.org/10.1016/j.patter.2023.100712

      Jirsaraie, R. J., Kaufmann, T., Bashyam, V., Erus, G., Luby, J. L., Westlye, L. T., Davatzikos, C., Barch, D. M., & Sotiras, A. (2023). Benchmarking the generalizability of brain age models: Challenges posed by scanner variance and prediction bias. Human Brain Mapping, 44(3), 1118–1128. https://doi.org/10.1002/hbm.26144

      Marquand, A. F., Rezek, I., Buitelaar, J., & Beckmann, C. F. (2016). Understanding Heterogeneity in Clinical Cohorts Using Normative Models: Beyond Case-Control Studies. Biological Psychiatry, 80(7), 552–561. https://doi.org/10.1016/j.biopsych.2015.12.023

      Molnar, C. (2019). Interpretable Machine Learning. A Guide for Making Black Box Models Explainable. https://christophm.github.io/interpretable-ml-book/

      Nimon, K., Lewis, M., Kane, R., & Haynes, R. M. (2008). An R package to compute commonality coefficients in the multiple regression case: An introduction to the package and a practical example. Behavior Research Methods, 40(2), 457–466. https://doi.org/10.3758/BRM.40.2.457

      Pat, N., Wang, Y., Anney, R., Riglin, L., Thapar, A., & Stringaris, A. (2022). Longitudinally stable, brain‐based predictive models mediate the relationships between childhood cognition and socio‐demographic, psychological and genetic factors. Human Brain Mapping, hbm.26027. https://doi.org/10.1002/hbm.26027

      Pat, N., Wang, Y., Bartonicek, A., Candia, J., & Stringaris, A. (2022). Explainable machine learning approach to predict and explain the relationship between task-based fMRI and individual differences in cognition. Cerebral Cortex, bhac235. https://doi.org/10.1093/cercor/bhac235

      Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12(85), 2825–2830.

      Poldrack, R. A., Huckins, G., & Varoquaux, G. (2020). Establishment of Best Practices for Evidence for Prediction: A Review. JAMA Psychiatry, 77(5), 534–540. https://doi.org/10.1001/jamapsychiatry.2019.3671

      Rasero, J., Sentis, A. I., Yeh, F.-C., & Verstynen, T. (2021). Integrating across neuroimaging modalities boosts prediction accuracy of cognitive ability. PLOS Computational Biology, 17(3), e1008347. https://doi.org/10.1371/journal.pcbi.1008347

      Robinson, E. C., Garcia, K., Glasser, M. F., Chen, Z., Coalson, T. S., Makropoulos, A., Bozek, J., Wright, R., Schuh, A., Webster, M., Hutter, J., Price, A., Cordero Grande, L., Hughes, E., Tusor, N., Bayly, P. V., Van Essen, D. C., Smith, S. M., Edwards, A. D., … Rueckert, D. (2018). Multimodal surface matching with higher-order smoothness constraints. NeuroImage, 167, 453–465. https://doi.org/10.1016/j.neuroimage.2017.10.037

      Rokicki, J., Wolfers, T., Nordhøy, W., Tesli, N., Quintana, D. S., Alnæs, D., Richard, G., de Lange, A.-M. G., Lund, M. J., Norbom, L., Agartz, I., Melle, I., Nærland, T., Selbæk, G., Persson, K., Nordvik, J. E., Schwarz, E., Andreassen, O. A., Kaufmann, T., & Westlye, L. T. (2021). Multimodal imaging improves brain age prediction and reveals distinct abnormalities in patients with psychiatric and neurological disorders. Human Brain Mapping, 42(6), 1714–1726. https://doi.org/10.1002/hbm.25323

      Somerville, L. H., Bookheimer, S. Y., Buckner, R. L., Burgess, G. C., Curtiss, S. W., Dapretto, M., Elam, J. S., Gaffrey, M. S., Harms, M. P., Hodge, C., Kandala, S., Kastman, E. K., Nichols, T. E., Schlaggar, B. L., Smith, S. M., Thomas, K. M., Yacoub, E., Van Essen, D. C., & Barch, D. M. (2018). The Lifespan Human Connectome Project in Development: A large-scale study of brain connectivity development in 5–21 year olds. NeuroImage, 183, 456–468. https://doi.org/10.1016/j.neuroimage.2018.08.050

      Sperling, R. A., Bates, J. F., Cocchiarella, A. J., Schacter, D. L., Rosen, B. R., & Albert, M. S. (2001). Encoding novel face-name associations: A functional MRI study. Human Brain Mapping, 14(3), 129–139. https://doi.org/10.1002/hbm.1047

      Sripada, C., Angstadt, M., Rutherford, S., Kessler, D., Kim, Y., Yee, M., & Levina, E. (2019). Basic Units of Inter-Individual Variation in Resting State Connectomes. Scientific Reports, 9(1), Article 1. https://doi.org/10.1038/s41598-018-38406-5

      Sripada, C., Angstadt, M., Rutherford, S., Taxali, A., & Shedden, K. (2020). Toward a “treadmill test” for cognition: Improved prediction of general cognitive ability from the task activated brain. Human Brain Mapping, 41(12), 3186–3197. https://doi.org/10.1002/hbm.25007

      Tetereva, A., Li, J., Deng, J. D., Stringaris, A., & Pat, N. (2022). Capturing brain‐cognition relationship: Integrating task‐based fMRI across tasks markedly boosts prediction and test‐retest reliability. NeuroImage, 263, 119588. https://doi.org/10.1016/j.neuroimage.2022.119588

      Vieira, B. H., Pamplona, G. S. P., Fachinello, K., Silva, A. K., Foss, M. P., & Salmon, C. E. G. (2022). On the prediction of human intelligence from neuroimaging: A systematic review of methods and reporting. Intelligence, 93, 101654. https://doi.org/10.1016/j.intell.2022.101654

      Vos De Wael, R., Benkarim, O., Paquola, C., Lariviere, S., Royer, J., Tavakol, S., Xu, T., Hong, S.-J., Langs, G., Valk, S., Misic, B., Milham, M., Margulies, D., Smallwood, J., & Bernhardt, B. C. (2020). BrainSpace: A toolbox for the analysis of macroscale gradients in neuroimaging and connectomics datasets. Communications Biology, 3(1), 103. https://doi.org/10.1038/s42003-020-0794-7

      Woolrich, M. W., Ripley, B. D., Brady, M., & Smith, S. M. (2001). Temporal Autocorrelation in Univariate Linear Modeling of FMRI Data. NeuroImage, 14(6), 1370–1386. https://doi.org/10.1006/nimg.2001.0931

      Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320. https://doi.org/10.1111/j.1467-9868.2005.00503.x


      The following is the authors’ response to the previous reviews.

      eLife assessment

      This useful manuscript challenges the utility of current paradigms for estimating brain-age with magnetic resonance imaging measures, but presents inadequate evidence to support the suggestion that an alternative approach focused on predicting cognition is more useful. The paper would benefit from a clearer explication of the methods and a more critical evaluation of the conceptual basis of the different models. This work will be of interest to researchers working on brain-age and related models.

      Thank you so much for providing high-quality reviews on our manuscript. We revised the manuscript to address all of the reviewers’ comments and provided full responses to each of the comments below. Importantly, in this revision, we clarified that we did not intend to use Brain Cognition as an alternative approach. This is because, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. Here we made this point more explicit and further stated that the relationship between Brain Cognition and fluid cognition indicates the upper limit of Brain Age’s capability in capturing fluid cognition. By examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age. And such quantification is the third aim of this study.

      Public Reviews:

      Reviewer 1 (Public Review):

      In this paper, the authors evaluate the utility of brain-age-derived metrics for predicting cognitive decline by performing a 'commonality' analysis in a downstream regression that enables the different contribution of different predictors to be assessed. The main conclusion is that brain-age-derived metrics do not explain much additional variation in cognition over and above what is already explained by age. The authors propose to use a regression model trained to predict cognition ("brain-cognition") as an alternative suited to applications of cognitive decline. While this is less accurate overall than brain age, it explains more unique variance in the downstream regression.

      (1) I thank the authors for addressing many of my concerns with this revision. However, I do not feel they have addressed them all. In particular I think the authors could do more to address the concern I raised about the instability of the regression coefficients and about providing enough detail to determine that the stacked regression models do not overfit.

      Thank you Reviewer 1 for the comment. We addressed them in our response to Reviewer 1 Recommendations For The Authors #1 and #2 (see below).

      (2) In considering my responses to the authors revision, I also must say that I agree with Reviewer 3 about the limitations of the brain age and brain cognition methods conceptually. In particular that the regression model used to predict fluid cognition will by construction explain more variance in cognition than a brain age model that is trained to predict age. To be fair, these conceptual problems are more widespread than this paper alone, so I do not believe the authors should be penalised for that. However, I would recommend to make these concerns more explicit in the manuscript

      Thank you Reviewer 1 for the comment. We addressed them in our response to Reviewer 1 Recommendations For The Authors #3 (see below).

      Reviewer 2 (Public Review):

      In this study, the authors aimed to evaluate the contribution of brain-age indices in capturing variance in cognitive decline and proposed an alternative index, brain-cognition, for consideration.

      The study employs suitable methods and data to address the research questions, and the methods and results sections are generally clear and easy to follow.

      I appreciate the authors' efforts in significantly improving the paper, including some considerable changes, from the original submission. While not all reviewer points were tackled, the majority of them were adequately addressed. These include additional analyses, more clarity in the methods and a much richer and nuanced discussion. While recognising the merits of the revised paper, I have a few additional comments.

      (1) Perhaps it would help the reader to note that it might be expected for brain-cognition to account for a significantly larger variance (11%) in fluid cognition, in contrast to brain-age. This stems from the fact that the authors specifically trained brain-cognition to predict fluid cognition, the very variable under consideration. In line with this, the authors later recommend that researchers considering the use of brain-age should evaluate its utility using a regression approach. The latter involves including a brain index (e.g. brain-cognition) previously trained to predict the regression's target variable (e.g. fluid cognition) alongside a brain-age index (e.g., corrected brain-age gap). If the target-trained brain index outperforms the brain-age metric, it suggests that relying solely on brain-age might not be the optimal choice. Although not necessarily the case, is it surprising for the target-trained brain index to demonstrate better performance than brain-age? This harks back to the broader point raised in the initial review: while brain-age may prove useful (though sometimes with modest effect sizes) across diverse outcomes as a generally applicable metric, a brain index tailored for predicting a specific outcome, such as brain-cognition in this case, might capture a considerably larger share of variance in that specific context but could lack broader applicability. The latter aspect needs to be empirically assessed.

      Thank you so much for raising this point. Reviewer 1 (Public Review #2/Recommendations For The Authors #3) and Reviewer 3 (Recommendations for the Authors #1) made a similar observation. We now made changes to the introduction and discussion to address this concern (please see our responses to Reviewer 1 Recommendations For The Authors #3 below).

      Briefly, as in our 2nd revision, we did not intend to compare Brain Age with Brain Cognition since, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. Here we made this point more explicit and further stated that the relationship between Brain Cognition and fluid cognition indicates the upper limit of Brain Age’s capability in capturing fluid cognition. By examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age. And such quantification is the third aim of this study.

      (2) Furthermore, the discussion pertaining to training brain-age models on healthy populations for subsequent testing on individuals with neurological or psychological disorders seems somewhat one-sided within the broader debate. This one-sidedness might potentially confuse readers. It is worth noting that the choice to employ healthy participants in the training model is likely deliberate, serving as a norm against which atypical populations are compared. To provide a more comprehensive understanding, referencing Tim Hans's counterargument to Bashyam's perspective could offer a more complete view (https://academic.oup.com/brain/article/144/3/e31/6214475?login=false).

      Thank you Reviewer 2 for bringing up this issue. We have now revised the paragraph in question and added nuances on the usage of Brain Age for normative vs. case-control studies. We also cited Tim Hahn’s article that explained the conceptual foundation of the use of Brain Age in case-control studies. Please see below. Additionally, we also made a statement about our study not being able to address issues about the case-control studies directly in the newly written conclusion (see Reviewer 3 Recommendations for the Authors #3).

      Discussion:

      “There is a notable difference between studies investigating the utility of Brain Age in explaining cognitive functioning, including ours and others (e.g., Butler et al., 2021; Cole, 2020, 2020; Jirsaraie et al., 2023) and those explaining neurological/psychological disorders (e.g., Bashyam et al., 2020; Rokicki et al., 2021). We consider the former as a normative type of study and the latter as a case-control type of study (Insel et al., 2010; Marquand et al., 2016). Those case-control Brain Age studies focusing on neurological/psychological disorders often build age-prediction models from MRI data of largely healthy participants (e.g., controls in a case-control design or large samples in a population-based design), apply the built age-prediction models to participants without vs. with neurological/psychological disorders and compare Brain Age indices between the two groups. On the one hand, this means that case-control studies treat Brain Age as a method to detect anomalies in the neurological/psychological group (Hahn et al., 2021). On the other hand, this also means that case-control studies have to ignore under-fitted models when applied prediction models built from largely healthy participants to participants with neurological/psychological disorders (i.e., Brain Age may predict chronological age well for the controls, but not for those with a disorder). On the contrary, our study and other normative studies focusing on cognitive functioning often build age-prediction models from MRI data of largely healthy participants and apply the built age-prediction models to participants who are also largely healthy. Accordingly, the age-prediction models for explaining cognitive functioning in normative studies, while not allowing us to detect group-level anomalies, do not suffer from being under-fitted. This unfortunately might limit the generalisability of our study into just the normative type of study. Future work is still needed to test the utility of brain age in the case-control case.”

      (3) Overall, this paper makes a significant contribution to the field of brain-age and related brain indices and their utility.

      Thank you for the encouragement.

      Reviewer 3 (Public Review):

      The main question of this article is as follows: "To what extent does having information on brain-age improve our ability to capture declines in fluid cognition beyond knowing a person's chronological age?" This question is worthwhile, considering that there is considerable confusion in the field about the nature of brain-age.

      (1) Thank you to the authors for addressing so many of my concerns with this revision. There are a few points that I feel still need addressing/clarifying related to 1) calculating brain cognition, 2) the inevitability of their results, and 3) their continued recommendation to use brain-age metrics.

      Thank you Reviewer 3 for the comment. We addressed them in our response to Reviewer 3 Recommendations For The Authors #1-3 (see below).

      Recommendations for the authors:

      Reviewer 1 (Recommendations For The Authors):

      (1) I do not feel the authors have fully addressed the concern I raised about the stacked regression models. Despite the new figure, it is still not entirely clear what the authors are using as the training set in the final step. To be clear, the problem occurs because of the parameters, not the hyperparameters (which the authors now state that they are optimising via nested grid search). in other words, given a regression model y = X*beta, if the X are taken to be predictions from a lower level regression model, then they contain information that is derived from both the training set at the test set for the model that this was trained on. If the split is the same (i.e. the predictions are derived on the same test set as is being used at the second level), then this can lead to overfitting. It is not clear to me whether the authors have done this or not. Please provide additional detail to clarify this point.

      Thank you for allowing us an opportunity to clarify our stacked model. We wanted to confirm that we did not use test sets to build a stacked model in both lower and higher levels of the Elastic Net models. Test sets were there just for testing the performance of the models. We made additional clarification to make this clearer (see below). Let us explain what we did and provide the rationales below.

      From Methods:

      “We used nested cross-validation (CV) to build these prediction models (see Figure 7). We first split the data into five outer folds, leaving each outer fold with around 100 participants. This number of participants in each fold is to ensure the stability of the test performance across folds. In each outer-fold CV loop, one of the outer folds was treated as an outer-fold test set, and the rest was treated as an outer-fold training set. Ultimately, looping through the nested CV resulted in a) prediction models from each of the 18 sets of features as well as b) prediction models that drew information across different combinations of the 18 separate sets, known as “stacked models.” We specified eight stacked models: “All” (i.e., including all 18 sets of features), “All excluding Task FC”, “All excluding Task Contrast”, “Non-Task” (i.e., including only Rest FC and sMRI), “Resting and Task FC”, “Task Contrast and FC”, “Task Contrast” and “Task FC”. Accordingly, there were 26 prediction models in total for both Brain Age and Brain Cognition.

      To create these 26 prediction models, we applied three steps for each outer-fold loop. The first step aimed at tuning prediction models for each of 18 sets of features. This step only involved the outer-fold training set and did not involve the outer-fold test set. Here, we divided the outer-fold training set into five inner folds and applied inner-fold CV to tune hyperparameters with grid search. Specifically, in each inner-fold CV, one of the inner folds was treated as an inner-fold validation set, and the rest was treated as an inner-fold training set. Within each inner-fold CV loop, we used the inner-fold training set to estimate parameters of the prediction model with a particular set of hyperparameters and applied the estimated model to the inner-fold validation set. After looping through the inner-fold CV, we, then, chose the prediction models that led to the highest performance, reflected by coefficient of determination (R2), on average across the inner-fold validation sets. This led to 18 tuned models, one for each of the 18 sets of features, for each outer fold.

      The second step aimed at tuning stacked models. Same as the first step, the second step only involved the outer-fold training set and did not involve the outer-fold test set. Here, using the same outer-fold training set as the first step, we applied tuned models, created from the first step, one from each of the 18 sets of features, resulting in 18 predicted values for each participant. We, then, re-divided this outer-fold training set into new five inner folds. In each inner fold, we treated different combinations of the 18 predicted values from separate sets of features as features to predict the targets in separate “stacked” models. Same as the first step, in each inner-fold CV loop, we treated one out of five inner folds as an inner-fold validation set, and the rest as an inner-fold training set. Also as in the first step, we used the inner-fold training set to estimate parameters of the prediction model with a particular set of hyperparameters from our grid. We tuned the hyperparameters of stacked models using grid search by selecting the models with the highest R2 on average across the inner-fold validation sets. This led to eight tuned stacked models.

      The third step aimed at testing the predictive performance of the 18 tuned prediction models from each of the set of features, built from the first step, and eight tuned stacked models, built from the second step. Unlike the first two steps, here we applied the already tuned models to the outer-fold test set. We started by applying the 18 tuned prediction models from each of the sets of features to each observation in the outer-fold test set, resulting in 18 predicted values. We then applied the tuned stacked models to these predicted values from separate sets of features, resulting in eight predicted values.

      To demonstrate the predictive performance, we assessed the similarity between the observed values and the predicted values of each model across outer-fold test sets, using Pearson’s r, coefficient of determination (R2) and mean absolute error (MAE). Note that for R2, we used the sum of squares definition (i.e., R2 = 1 – (sum of squares residuals/total sum of squares)) per a previous recommendation (Poldrack et al., 2020). We considered the predicted values from the outer-fold test sets of models predicting age or fluid cognition, as Brain Age and Brain Cognition, respectively.”

      Author response image 1.

      Diagram of the nested cross-validation used for creating predictions for models of each set of features as well as predictions for stacked models.

      Note some previous research, including ours (Tetereva et al., 2022), splits the observations in the outer-fold training set into layer 1 and layer 2 and applies the first and second steps to layers 1 and 2, respectively. Here we decided against this approach and used the same outer-fold training set for both first and second steps in order to avoid potential bias toward the stacked models. This is because, when the data are split into two layers, predictive models built for each separate set of features only use the data from layer 1, while the stacked models use the data from both layers 1 and 2. In practice with large enough data, these two approaches might not differ much, as we demonstrated previously (Tetereva et al., 2022).

      (2) I also do not feel the authors have fully addressed the concern I raised about stability of the regression coefficients over splits of the data. I wanted to see the regression coefficients, not the predictions. The predictions can be stable when the coefficients are not.

      The focus of this article is on the predictions. Still, as pointed out by reviewer 1, it is informative for readers to understand how stable the feature importance (i.e., Elastic Net coefficients) is. To demonstrate the stability of feature importance, we now examined the rank stability of feature importance using Spearman’s ρ (see Figure 4). Specifically, we correlated the feature importance between two prediction models of the same features, used in two different outer-fold test sets. Given that there were five outer-fold test sets, we computed 10 Spearman’s ρ for each prediction model of the same features. We found Spearman’s ρ to be varied dramatically in both age-prediction (range=.31-.94) and fluid cognition-prediction (range=.16-.84) models. This means that some prediction models were much more stable in their feature importance than others. This is probably due to various factors such as a) the collinearity of features in the model, b) the number of features (e.g., 71,631 features in functional connectivity, which were further reduced to 75 PCAs, as compared to 19 features in subcortical volume based on the ASEG atlas), c) the penalisation of coefficients either with ‘Ridge’ or ‘Lasso’ methods, which resulted in reduction as a group of features or selection of a feature among correlated features, respectively, and d) the predictive performance of the models. Understanding the stability of feature importance is beyond the scope of the current article. As mentioned by Reviewer 1, “The predictions can be stable when the coefficients are not,” and we chose to focus on the prediction in the current article.

      Author response image 2.

      Stability of feature importance (i.e., Elastic Net Coefficients) of prediction models. Each dot represents rank stability (reflected by Spearman’s ρ) in the feature importance between two prediction models of the same features, used in two different outer-fold test sets. Given that there were five outer-fold test sets, there were 10 Spearman’s ρs for each prediction model. The numbers to the right of the plots indicate the mean of Spearman’s ρ for each prediction model.

      (3) I also must say that I agree with Reviewer 3 about the limitations of the brain-age and brain-cognition methods conceptually. In particular that the regression model used to predict fluid cognition will by construction explain more variance in cognition than a brain-age model that is trained to predict age. This suffers from the same problem the authors raise with brain-age and I agree that this would probably disappear if the authors had a separate measure of cognition against which to validate and were then to regress this out as they do for age correction. I am aware that these conceptual problems are more widespread than this paper alone (in fact throughout the brain-age literature), so I do not believe the authors should be penalised for that. However, I do think they can make these concerns more explicit and further tone down the comments they make about the utility of brain-cognition.

      Thank you so much for raising this point. Reviewer 2 (Public Review #1) and Reviewer 3 (Recommendations for the Authors #1) made a similar observation. We now made changes to the introduction and discussion to address this concern (see below).

      Briefly, we made it explicit that, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. That is, the relationship between Brain Cognition and fluid cognition indicates the upper limit of Brain Age’s capability in capturing fluid cognition. More importantly, by examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age. And this is the third goal of this present study.

      From Introduction:

      “Third and finally, certain variation in fluid cognition is related to brain MRI, but to what extent does Brain Age not capture this variation? To estimate the variation in fluid cognition that is related to the brain MRI, we could build prediction models that directly predict fluid cognition (i.e., as opposed to chronological age) from brain MRI data. Previous studies found reasonable predictive performances of these cognition-prediction models, built from certain MRI modalities (Dubois et al., 2018; Pat et al., 2022; Rasero et al., 2021; Sripada et al., 2020; Tetereva et al., 2022; for review, see Vieira et al., 2022). Analogous to Brain Age, we called the predicted values from these cognition-prediction models, Brain Cognition. The strength of an out-of-sample relationship between Brain Cognition and fluid cognition reflects variation in fluid cognition that is related to the brain MRI and, therefore, indicates the upper limit of Brain Age’s capability in capturing fluid cognition. This is, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. Consequently, if we included Brain Cognition, Brain Age and chronological age in the same model to explain fluid cognition, we would be able to examine the unique effects of Brain Cognition that explain fluid cognition beyond Brain Age and chronological age. These unique effects of Brain Cognition, in turn, would indicate the amount of co-variation between brain MRI and fluid cognition that is missed by Brain Age.”

      From Discussion:

      “Third, by introducing Brain Cognition, we showed the extent to which Brain Age indices were not able to capture the variation in fluid cognition that is related to brain MRI. More specifically, using Brain Cognition allowed us to gauge the variation in fluid cognition that is related to the brain MRI, and thereby, to estimate the upper limit of what Brain Age can do. Moreover, by examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age.

      From our results, Brain Cognition, especially from certain cognition-prediction models such as the stacked models, has relatively good predictive performance, consistent with previous studies (Dubois et al., 2018; Pat et al., 2022; Rasero et al., 2021; Sripada et al., 2020; Tetereva et al., 2022; for review, see Vieira et al., 2022). We then examined Brain Cognition using commonality analyses (Nimon et al., 2008) in multiple regression models having a Brain Age index, chronological age and Brain Cognition as regressors to explain fluid cognition. Similar to Brain Age indices, Brain Cognition exhibited large common effects with chronological age. But more importantly, unlike Brain Age indices, Brain Cognition showed large unique effects, up to around 11%. As explained above, the unique effects of Brain Cognition indicated the amount of co-variation between brain MRI and fluid cognition that was missed by a Brain Age index and chronological age. This missing amount was relatively high, considering that Brain Age and chronological age together explained around 32% of the total variation in fluid cognition. Accordingly, if a Brain Age index was used as a biomarker along with chronological age, we would have missed an opportunity to improve the performance of the model by around one-third of the variation explained.”

      Reviewer #3 (Recommendations For The Authors):

      Thank you to the authors for addressing so many of my concerns with this revision. There are a few points that I feel still need addressing/clarifying related to: 1) calculating brain cognition, 2) the inevitability of their results, and 3) their continued recommendation to use brain age metrics.

      (1) I understand your point here. I think the distinction is that it is fine to build predictive models, but then there is no need to go through this intermediate step of "brain-cognition". Just say that brain features can predict cognition XX well, and brain-age (or some related metric) can predict cognition YY well. It creates a confusing framework for the reader that can lead them to believe that "brain-cognition" is not just a predicted value of fluid cognition from a model using brain features to predict cognition. While you clearly state that that is in fact what it is in the text, which is a huge improvement, I do not see what is added by going through brain-cognition instead of simply just obtaining a change in R2 where the first model uses brain features alone to predict cognition, and the second adds on brain-age (or related metrics), or visa versa, depending on the question. Please do this analysis, and either compare and contrast it with going through "brain-cognition" in your paper, or switch to this analysis, as it more directly addresses the question of the incremental predictive utility of brain-age above and beyond brain features.

      Thank you so much for raising this point. Reviewer 1 (Public Review #2/Recommendations For The Authors #3) and Reviewer 2 (Public Review #1) made a similar observation. We now made changes to the introduction and discussion to address this concern (see our responses to Reviewer 1 Recommendations For The Authors #3 above).

      Briefly, as in our 2nd revision, we made it explicitly clear that we did not intend to compare Brain Age with Brain Cognition since, by design, the variation in fluid cognition explained by Brain Cognition should be higher or equal to that explained by Brain Age. And, by examining what was captured by Brain Cognition, over and above Brain Age and chronological age via the unique effects of Brain Cognition, we were able to quantify the amount of co-variation between brain MRI and fluid cognition that was missed by Brain Age.

      We have thought about changing the name Brain Cognition into something along the lines of “predicted values of prediction models predicting fluid cognition based on brain MRI.” However, this made the manuscript hard to follow, especially with the commonality analyses. For instance, the sentence, “Here, we tested Brain Cognition’s unique effects in multiple regression models with a Brain Age index, chronological age and Brain Cognition as regressors to explain fluid cognition” would become “Here, we tested predicted values of prediction models predicting fluid cognition based on brain MRI unique effects in multiple regression models with a Brain Age index, chronological age and predicted values of prediction models predicting fluid cognition based on brain MRI as regressors to explain fluid cognition.” We believe, given our additional explanation (see our responses to Reviewer 1 Recommendations For The Authors #3 above), readers should understand what Brain Cognition is, and that we did not intend to compare Brain Age and Brain Cognition directly.

      As for the suggested analysis, “obtaining a change in R2 where the first model uses brain features alone to predict cognition, and the second adds on brain-age (or related metrics), or visa versa,” we have already done this in the form of commonality analysis (Nimon et al., 2008) (see Figure 7 below). That is, to obtain unique and common effects of the regressors, we need to look at all of the possible changes in R2 when all possible subsets of regressors were excluded or included, see equations 12 and 13 below.

      From Methods:

      “Similar to the above multiple regression model, we had chronological age, each Brain Age index and Brain Cognition as the regressors for fluid cognition:

      Fluid Cognitioni = β0 + β1 Chronological Agei + β2 Brain Age Indexi,j + β3 Brain Cognitioni + εi, (12)

      Applying the commonality analysis here allowed us, first, to investigate the addictive, unique effects of Brain Cognition, over and above chronological age and Brain Age indices. More importantly, the commonality analysis also enabled us to test the common, shared effects that Brain Cognition had with chronological age and Brain Age indices in explaining fluid cognition. We calculated the commonality analysis as follows (Nimon et al., 2017):

      Unique Effectchronological age = ΔR2chronological age = R2chronological age, Brain Age index, Brain Cognition – R2 Brain Age index, Brain Cognition

      Unique EffectBrain Age index = ΔR2Brain Age index = R2chronological age, Brain Age index, Brain Cognition – R2 chronological age, Brain Cognition

      Unique EffectBrain Cognition = ΔR2Brain Cognition = R2chronological age, Brain Age index, Brain Cognition – R2 chronological age, Brain Age Index

      Common Effectchronological age, Brain Age index = R2chronological age, Brain Cognition + R2 Brain Age index, Brain Cognition – R2 Brain Cognition – R2chronological age, Brain Age index, Brain Cognition

      Common Effectchronological age, Brain Cognition = R2chronological age, Brain Age Index + R2 Brain Age index, Brain Cognition – R2 Brain Age Index – R2chronological age, Brain Age index, Brain Cognition

      Common Effect Brain Age index, Brain Cognition = R2chronological age, Brain Age Index + R2 chronological age, Brain Cognition – R2 chronological age – R2chronological age, Brain Age index, Brain Cognition

      Common Effect chronological age, Brain Age index, Brain Cognition = R2 chronological age + R2 Brain Age Index + R2 Brain Cognition – R2chronological age, Brain Age Index – R2 chronological age, Brain Cognition – R2 Brain Age Index, Brain Cognition – R2chronological age, Brain Age index, Brain Cognition , (13)”

      (2) I agree that the solution is not to exclude age as a covariate, and that there is a big difference between inevitable and obvious. I simply think a further discussion of the inevitability of the results would be clarifying for the readers. There is a big opportunity in the brain-age literature to be as direct as possible about why you are finding what you are finding. People need to know not only what you found, but why you found what you found.

      Thank you. We agreed that we need to make this point more explicit and direct. In the revised manuscript, we had the statements in both Introduction and Discussion (see below) about the tight relationship between Brain Age and chronological age by design, making the small unique effects of Brain Age inevitable.

      Introduction:

      “Accordingly, by design, Brain Age is tightly close to chronological age. Because chronological age usually has a strong relationship with fluid cognition, to begin with, it is unclear how much Brain Age adds to what is already captured by chronological age.“

      Discussion:

      “First, Brain Age itself did not add much more information to help us capture fluid cognition than what we had already known from a person’s chronological age. This can clearly be seen from the small unique effects of Brain Age indices in the multiple regression models having Brain Age and chronological age as the regressors. While the unique effects of some Brain Age indices from certain age-prediction models were statistically significant, there were all relatively small. Without Brain Age indices, chronological age by itself already explained around 32% of the variation in fluid cognition. Including Brain Age indices only added around 1.6% at best. We believe the small unique effects of Brain Age were inevitable because, by design, Brain Age is tightly close to chronological age. Therefore, chronological age and Brain Age captured mostly a similar variation in fluid cognition.

      Investigating the simple regression models and the commonality analysis between each Brain Age index and chronological age provided additional insights….”

      (3) I believe it is very important to critically examine the use of brain-age and related metrics. As part of this process, I think we should be asking ourselves the following questions (among others): Why go through age prediction? Wouldn't the predictions of cognition (or another variable) using the same set of brain features always be as good or better? You still have not justified the use of brain-age. As I said before, if you are going to continue to recommend the use of brain-age, you need a very strong argument for why you are recommending this. What does it truly add? Otherwise, temper your statements to indicate possible better paths forward.

      Thank you Reviewer 3 for making an argument against the use of Brain Age. We largely agree with you. However, our work only focuses on one phenotype, fluid cognition, and on the normative situation (i.e., not having a case vs control group). As Reviewer 2 pointed out, Brain Age might still have utility in other cases, not studied here. Still, future studies that focus on other phenotypes may consider using our approach as a template to test the utility of Brain Age in other situations. We added the conclusion statement to reflect this.

      From Discussion:

      “Altogether, we examined the utility of Brain Age as a biomarker for fluid cognition. Here are the three conclusions. First, Brain Age failed to add substantially more information over and above chronological age. Second, a higher ability to predict chronological age did not correspond to a higher utility to capture fluid cognition. Third, Brain Age missed up to around one-third of the variation in fluid cognition that could have been explained by brain MRI. Yet, given our focus on fluid cognition, future empirical research is needed to test the utility of Brain Age on other phenotypes, especially when Brain Age is used for anomaly detection in case-control studies (e.g., Bashyam et al., 2020; Rokicki et al., 2021). We hope that future studies may consider applying our approach (i.e., using the commonality analysis that includes predicted values from a model that directly predicts the phenotype of interest) to test the utility of Brain Age as a biomarker for other phenotypes.”

      References

      Bashyam, V. M., Erus, G., Doshi, J., Habes, M., Nasrallah, I. M., Truelove-Hill, M., Srinivasan, D., Mamourian, L., Pomponio, R., Fan, Y., Launer, L. J., Masters, C. L., Maruff, P., Zhuo, C., Völzke, H., Johnson, S. C., Fripp, J., Koutsouleris, N., Satterthwaite, T. D., … on behalf of the ISTAGING Consortium, the P. A. disease C., ADNI, and CARDIA studies. (2020). MRI signatures of brain age and disease over the lifespan based on a deep brain network and 14 468 individuals worldwide. Brain, 143(7), 2312–2324. https://doi.org/10.1093/brain/awaa160

      Butler, E. R., Chen, A., Ramadan, R., Le, T. T., Ruparel, K., Moore, T. M., Satterthwaite, T. D., Zhang, F., Shou, H., Gur, R. C., Nichols, T. E., & Shinohara, R. T. (2021). Pitfalls in brain age analyses. Human Brain Mapping, 42(13), 4092–4101. https://doi.org/10.1002/hbm.25533

      Cole, J. H. (2020). Multimodality neuroimaging brain-age in UK biobank: Relationship to biomedical, lifestyle, and cognitive factors. Neurobiology of Aging, 92, 34–42. https://doi.org/10.1016/j.neurobiolaging.2020.03.014

      Dubois, J., Galdi, P., Paul, L. K., & Adolphs, R. (2018). A distributed brain network predicts general intelligence from resting-state human neuroimaging data. Philosophical Transactions of the Royal Society B: Biological Sciences, 373(1756), 20170284. https://doi.org/10.1098/rstb.2017.0284

      Hahn, T., Fisch, L., Ernsting, J., Winter, N. R., Leenings, R., Sarink, K., Emden, D., Kircher, T., Berger, K., & Dannlowski, U. (2021). From ‘loose fitting’ to high-performance, uncertainty-aware brain-age modelling. Brain, 144(3), e31–e31. https://doi.org/10.1093/brain/awaa454

      Insel, T., Cuthbert, B., Garvey, M., Heinssen, R., Pine, D. S., Quinn, K., Sanislow, C., & Wang, P. (2010). Research Domain Criteria (RDoC): Toward a New Classification Framework for Research on Mental Disorders. American Journal of Psychiatry, 167(7), 748–751. https://doi.org/10.1176/appi.ajp.2010.09091379

      Jirsaraie, R. J., Kaufmann, T., Bashyam, V., Erus, G., Luby, J. L., Westlye, L. T., Davatzikos, C., Barch, D. M., & Sotiras, A. (2023). Benchmarking the generalizability of brain age models: Challenges posed by scanner variance and prediction bias. Human Brain Mapping, 44(3), 1118–1128. https://doi.org/10.1002/hbm.26144

      Marquand, A. F., Rezek, I., Buitelaar, J., & Beckmann, C. F. (2016). Understanding Heterogeneity in Clinical Cohorts Using Normative Models: Beyond Case-Control Studies. Biological Psychiatry, 80(7), 552–561. https://doi.org/10.1016/j.biopsych.2015.12.023

      Nimon, K., Lewis, M., Kane, R., & Haynes, R. M. (2008). An R package to compute commonality coefficients in the multiple regression case: An introduction to the package and a practical example. Behavior Research Methods, 40(2), 457–466. https://doi.org/10.3758/BRM.40.2.457

      Pat, N., Wang, Y., Anney, R., Riglin, L., Thapar, A., & Stringaris, A. (2022). Longitudinally stable, brain‐based predictive models mediate the relationships between childhood cognition and socio‐demographic, psychological and genetic factors. Human Brain Mapping, hbm.26027. https://doi.org/10.1002/hbm.26027

      Poldrack, R. A., Huckins, G., & Varoquaux, G. (2020). Establishment of Best Practices for Evidence for Prediction: A Review. JAMA Psychiatry, 77(5), 534–540. https://doi.org/10.1001/jamapsychiatry.2019.3671

      Rasero, J., Sentis, A. I., Yeh, F.-C., & Verstynen, T. (2021). Integrating across neuroimaging modalities boosts prediction accuracy of cognitive ability. PLOS Computational Biology, 17(3), e1008347. https://doi.org/10.1371/journal.pcbi.1008347

      Rokicki, J., Wolfers, T., Nordhøy, W., Tesli, N., Quintana, D. S., Alnæs, D., Richard, G., de Lange, A.-M. G., Lund, M. J., Norbom, L., Agartz, I., Melle, I., Nærland, T., Selbæk, G., Persson, K., Nordvik, J. E., Schwarz, E., Andreassen, O. A., Kaufmann, T., & Westlye, L. T. (2021). Multimodal imaging improves brain age prediction and reveals distinct abnormalities in patients with psychiatric and neurological disorders. Human Brain Mapping, 42(6), 1714–1726. https://doi.org/10.1002/hbm.25323

      Sripada, C., Angstadt, M., Rutherford, S., Taxali, A., & Shedden, K. (2020). Toward a “treadmill test” for cognition: Improved prediction of general cognitive ability from the task activated brain. Human Brain Mapping, 41(12), 3186–3197. https://doi.org/10.1002/hbm.25007

      Tetereva, A., Li, J., Deng, J. D., Stringaris, A., & Pat, N. (2022). Capturing brain‐cognition relationship: Integrating task‐based fMRI across tasks markedly boosts prediction and test‐retest reliability. NeuroImage, 263, 119588. https://doi.org/10.1016/j.neuroimage.2022.119588

      Vieira, B. H., Pamplona, G. S. P., Fachinello, K., Silva, A. K., Foss, M. P., & Salmon, C. E. G. (2022). On the prediction of human intelligence from neuroimaging: A systematic review of methods and reporting. Intelligence, 93, 101654. https://doi.org/10.1016/j.intell.2022.101654

    1. Author response:

      The following is the authors’ response to the previous reviews.

      eLife assessment

      This study presents valuable data on the antigenic properties of neuraminidase proteins of human A/H3N2 influenza viruses sampled between 2009 and 2017. The antigenic properties are found to be generally concordant with genetic groups. Additional analysis have strengthened the revised manuscript, and the evidence supporting the claims is solid.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary

      The authors investigated the antigenic diversity of recent (2009-2017) A/H3N2 influenza neuraminidases (NAs), the second major antigenic protein after haemagglutinin. They used 27 viruses and 43 ferret sera and performed NA inhibition. This work was supported by a subset of mouse sera. Clustering analysis determined 4 antigenic clusters, mostly in concordance with the genetic groupings. Association analysis was used to estimate important amino acid positions, which were shown to be more likely close to the catalytic site. Antigenic distances were calculated and a random forest model used to determine potential important sites.

      This revision has addressed many of my concerns of inconsistencies in the methods, results and presentation. There are still some remaining weaknesses in the computational work.

      Strengths

      (1) The data cover recent NA evolution and a substantial number (43) of ferret (and mouse) sera were generated and titrated against 27 viruses. This is laborious experimental work and is the largest publicly available neuraminidase inhibition dataset that I am aware of. As such, it will prove a useful resource for the influenza community.

      (2) A variety of computational methods were used to analyse the data, which give a rounded picture of the antigenic and genetic relationships and link between sequence, structure and phenotype.

      (3) Issues raised in the previous review have been thoroughly addressed.

      Weaknesses

      (1). Some inconsistencies and missing data in experimental methods Two ferret sera were boosted with H1N2, while recombinant NA protein for the others. This, and the underlying reason, are clearly explained in the manuscript. The authors note that boosting with live virus did not increase titres. Additionally, one homologous serum (A/Kansas/14/2017) was not generated, although this would not necessarily have impacted the results.

      We agree with the reviewer and this point was addressed in the previous rebuttal.

      (2) Inconsistency in experimental results

      Clustering of the NA inhibition results identifies three viruses which do not cluster with their phylogenetic group. Again this is clearly pointed out in the paper and is consistent with the two replicate ferret sera. Additionally, A/Kansas/14/2017 is in a different cluster based on the antigenic cartography vs the clustering of the titres

      We agree with the reviewer and this point was addressed in the previous rebuttal.

      (3) Antigenic cartography plot would benefit from documentation of the parameters and supporting analyses

      a. The number of optimisations used

      We used 500 optimizations. This information is now included in the Methods section.

      b. The final stress and the difference between the stress of the lowest few (e.g. 5) optimisations, or alternatively a graph of the stress of all the optimisations. Information on the stress per titre and per point, and whether any of these were outliers

      The stress was obtained from 1, 5, 500, or even 5000 optimizations (resulting in stress values of respectively, 1366.47, 1366.47, 2908.60, and 3031.41). Besides limited variation or non-conversion of the stress values after optimization, the obtained maps were consistent in multiple runs. The map was obtained keeping the best optimization (stress value 1366.47, selected using the keepBestOptimization() function).

      Author response image 1.

      The stress per point is presented in the heat map below.

      The heat map indicates stress per serum (x-axis) and strain (y-axis) in blue to red scale.

      c. A measure of uncertainty in position (e.g. from bootstrapping)

      Bootstrap was performed using 1000 repeats and 100 optimizations per repeat. The uncertainty is represented in the blob plot below.

      Author response image 2.

      (4) Random forest

      The full dataset was used for the random forest model, including tuning the hyperparameters. It is more robust to have a training and test set to be able to evaluate overfitting (there are 25 features to classify 43 sera).

      Explicit cross validation is not necessary for random forests as the out of bag process with multiple trees implicitly covers cross validation. In the random forest function in R this is done by setting the mtry argument (number of variables randomly sampled as candidates at each split). R samples variables with replacement (the same variable can be sampled multiple times) of the candidates from the training set. RF will then automatically take the data that is not selected as candidates as test set. Overfit may happen when all data is used for training but the RF method implicitly does use a test set and does not use all data for training.

      Code:

      rf <- randomForest(X,y=Y,ntree=1500,mtry=25,keep.forest=TRUE,importance=TRUE)

      Reviewer #2 (Public Review):

      Summary:

      The authors characterized the antigenicity of N2 protein of 43 selected A(H3N2) influenza A viruses isolated from 2009-2017 using ferret and mice immune sera. Four antigenic groups were identified, which the authors claimed to be correlated with their respective phylogenic/ genetic groups. Among 102 amino acids differed by the 44 selected N2 proteins, the authors identified residues that differentiate the antigenicity of the four groups and constructed a machine-learning model that provides antigenic distance estimation. Three recent A(H3N2) vaccine strains were tested in the model but there was no experimental data to confirm the model prediction results.

      Strengths:

      This study used N2 protein of 44 selected A(H3N2) influenza A viruses isolated from 2009-2017 and generated corresponding panels of ferret and mouse sera to react with the selected strains. The amount of experimental data for N2 antigenicity characterization is large enough for model building.

      Weaknesses:

      The main weakness is that the strategy of selecting 43 A(H3N2) viruses from 2009-2017 was not explained. It is not clear if they represent the overall genetic diversity of human A(H3N2) viruses circulating during this time. In response to the reviewer's comment, the authors have provided a N2 phylogenetic tree using180 randomly selected N2 sequences from human A(H3N2) viruses from 2009-2017. While the 43 strains seems to scatter across the N2 tree, the four antigenic groups described by the author did not correlated with their respective phylogenic/ genetic groups as shown in Fig. 2. The authors should show the N2 phylogenic tree together with Fig. 2 and discuss the discrepancy observed.

      The discrepancies between the provided N2 phylogenetic tree using 180 selected N2 sequences was primarily due to visualization. In the tree presented in Figure 2 the phylogeny was ordered according to branch length in a decreasing way. Further, the tree represented in the rebuttal was built with PhyML 3.0 using JTT substitution model, while the tree in figure 2 was build in CLC Workbench 21.0.5 using Bishop-Friday substitution model. The tree below was built using the same methodology as Figure 2, including branch size ordering. No discrepancies are observed.

      Phylogenetic tree representing relatedness of N2 head domain. N2 NA sequences were ordered according to the branch length and phylogenetic clusters are colored as follows: G1: orange, G2: green, G3: blue, and G4: purple. NA sequences that were retained in the breadth panel are named according to the corresponding H3N2 influenza viruses. The other NA sequences are coded.

      Author response image 3.

      The second weakness is the use of double-immune ferret sera (post-infection plus immunization with recombinant NA protein) or mouse sera (immunized twice with recombinant NA protein) to characterize the antigenicity of the selected A(H3N2) viruses. Conventionally, NA antigenicity is characterized using ferret sera after a single infection. Repeated influenza exposure in ferrets has been shown to enhance antibody binding affinity and may affect the cross-reactivity to heterologous strains (PMID: 29672713). The increased cross-reactivity is supported by the NAI titers shown in Table S3, as many of the double immune ferret sera showed the highest reactivity not against its own homologous virus but to heterologous strains. In response to the reviewer's comment, the authors agreed the use of double-immune ferret sera may be a limitation of the study. It would be helpful if the authors can discuss the potential effect on the use of double-immune ferret sera in antigenicity characterization in the manuscript.

      Our study was designed to understand the breadth of the anti-NA response after the incorporation of NA as a vaccine antigens. Our data does not allow to conclude whether increased breadth of protection is merely due to increased antibody titers or whether an NA boost immunization was able to induce antibody responses against epitopes that were not previously recognized by primary response to infection. However, we now mention this possibility in the discussion and cite Kosikova et al. CID 2018, in this context.

      Another weakness is that the authors used the newly constructed a model to predict antigenic distance of three recent A(H3N2) viruses but there is no experimental data to validate their prediction (eg. if these viruses are indeed antigenically deviating from group 2 strains as concluded by the authors). In response to the comment, the authors have taken two strains out of the dataset and use them for validation. The results is shown as Fig. R7. However, it may be useful to include this in the main manuscript to support the validity of the model.

      The removal of 2 strains was performed to illustrate the predictive performance of the RF modeling. However, Random Forest does not require cross-validation. The reason is that RF modeling already uses an out-of-bag evaluation which, in short, consists of using only a fraction of the data for the creation of the decision trees (2/3 of the data), obviating the need for a set aside the test set:

      “…In each bootstrap training set, about one-third of the instances are left out. Therefore, the out-of-bag estimates are based on combining only about one- third as many classifiers as in the ongoing main combination. Since the error rate decreases as the number of combinations increases, the out-of-bag estimates will tend to overestimate the current error rate. To get unbiased out-of-bag estimates, it is necessary to run past the point where the test set error converges. But unlike cross-validation, where bias is present but its extent unknown, the out-of-bag estimates are unbiased…” from https://www.stat.berkeley.edu/%7Ebreiman/randomforest2001.pdf

      Reviewer #3 (Public Review):

      Summary:

      This paper by Portela Catani et al examines the antigenic relationships (measured using monotypic ferret and mouse sera) across a panel of N2 genes from the past 14 years, along with the underlying sequence differences and phylogenetic relationships. This is a highly significant topic given the recent increased appreciation of the importance of NA as a vaccine target, and the relative lack of information about NA antigenic evolution compared with what is known about HA. Thus, these data will be of interest to those studying the antigenic evolution of influenza viruses. The methods used are generally quite sound, though there are a few addressable concerns that limit the confidence with which conclusions can be drawn from the data/analyses.

      Strengths:

      • The significance of the work, and the (general) soundness of the methods. -Explicit comparison of results obtained with mouse and ferret sera

      Weaknesses:

      • Approach for assessing influence of individual polymorphisms on antigenicity does not account for potential effects of epistasis (this point is acknowledged by the authors).

      We agree with the reviewer and this point was addressed in the previous rebuttal.

      • Machine learning analyses neither experimentally validated nor shown to be better than simple, phylogenetic-based inference.

      We respectfully disagree with the reviewer. This point was addressed in the previous rebuttal as follows.

      This is a valid remark and indeed we have found a clear correlation between NAI cross reactivity and phylogenetic relatedness. However, besides achieving good prediction of the experimental data (as shown in Figure 5 and in FigureR7), machine Learning analysis has the potential to rank or indicate major antigenic divergences based on available sequences before it has consolidated as new clade. ML can also support the selection and design of broader reactive antigens. “

      Recommendations for the authors:

      Reviewer #2 (Recommendations For The Authors):

      (1) Discuss the discrepancy between Fig. 2 and the newly constructed N2 phylogenetic tree with 180 randomly selected N2 sequences of A(H3N2) viruses from 2009-2017. Specifically please explain the antigenic vs. phylogenetic relationship observed in Fig. 2 was not observed in the large N2 phylogenetic tree.

      Discrepancies were due to different method and visualization. A new tree was provided.

      (2) Include a sentence to discuss the potential effect on the use of double-immune ferret sera in antigenic characterization.

      We prefer not to speculate on this.

      (3) Include the results of the exercise run (with the use of Swe17 and HK17) in the manuscript as a way to validate the model.

      The exercise was performed to illustrate predictive potential of the RF modeling to the reviewer. However, cross-validation is not a usual requirement for random forest, since it uses out-of-bag calculations. We prefer to not include the exercise runs within the main manuscript.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Recommendations for The Authors):

      To hopefully contribute to more strongly support the conclusions of the manuscript, I am including a series of concerns regarding the experiments, as well as some recommendations that could be followed to address these issues:

      (1) The Q-nMT bundle is largely unaffected by the nocodazole treatment in most phases during its formation. However, cells were only treated with nocodazole for a very short period of time (15 min). Have the authors analyzed Q-nMT stability after longer nocodazole exposures? Is a similar treatment enough to depolymerize the mitotic spindle? This result could be further substantiated by treatment with other MT-depolymerizing agents. Furthermore, the dynamicity of the Q-nMT bundle could be ideally also assessed by other techniques, such as FRAP.

      The experiments suggested by the reviewer have been published in our previous paper (Laporte et al, JCB 2013). In this previous study, we presented data demonstrating the resistance of the Q-nMT bundle to several MT poisons: TBZ, benomyl, MBC (Sup Fig 2D) and to an increasing amount of nocodazole after a 90 min treatment (Sup Fig2E). These published figures are provided below.

      Author response image 1.

      The nMT array contains highly stable MTS. (A) Variation Of nuclear MT length in function Of time (second) in proliferating cells. Cells express GFP•Tubl (green) and Nup2•RFP (red). Bars, 2 pm. N = l, n is indicated. (B) Variation of the nMT array length in function of time measured for BirnlGFP—expressing cells In = 161, for 6-d•old Dad2GFP—expressing cells In = 171, for Stu2GFP—expressing cells (n = 17), and 6•d-old Nuf2• GFP—expressing cells (n = 17). Examples Of corresponding time lapse are shown. Time is in minutes experiments). Bar, 2 pm. (CJ Nuf2•GFP dots detected along nMT array (arrow) are immobile. Several time lapse images of cells are shown. Time is in minutes. gar, 2 pm _ MT organizations in proliferating cells and 4-d•old quiescent cells before and after a 90-min treatment With indicated drugs. Bar, 2 pm. (E) MT organizations in Sci-old quiescent cells before and after a 90min treatment With increasing concentrations Of nocodazole.

      In the same article, we showed that Q-nMT bundles resist a 3h nocodazole treatment, while all MT structures assembled in proliferating cells, including mitotic spindle, vanished (see Fig 2E below). In addition, in our previous article, FRAP experiments were provided in Fig 2D.

      Author response image 2.

      The nuclear array is composed of stable MTS. Variation of the length in function of time of (A) aMTs in proliferating cells, (B) nMT array in quiescent cells (7 d), and the two MT structures in early quiescent cells (4 d). White arrows point ot dynamic aMTs. In A—C, N = 2, n is indicated ID) FRAP on 7-d-old quiescent cells. White arrows point to bleach areas. Error bars are SEM. In A—D. time is in seconds. (E) nMT array is not affected by nocodazole treatment. Before and various times after carbon exhaustion (red dashed line), cells were incubated for 3 h with 22.5 pg/pL nocodozole and then imaged. The corresponding control experiment is shown in Fig I A. In all panels, cells expressing GFP-TtJbl (green) and Nup2-RFP (red) are shown; bars, 2 pm.

      This previous study was mentioned in the introduction and is now re-cited at the beginning of the results section (line 107-108).

      As expected from our previous study, when proliferating cells were treated with Noc (30 µg/ml) in the same conditions as in Fig1, most of the short and the long mitotic spindles vanished after a 15 min treatment as shown in the graph below.

      Author response image 3.

      Proliferating cells expressing NOf2=GFP and mTQZ-TUb1 (00—2) were treated or not With NOC (30vgfmI) for 15 min.% Of cells With detectable MT and representative cells are shown. Khi-teet values are indicated. Bar: 2 pm,

      (2) The graph in Figure 1B is somewhat confusing. Is the X-axis really displaying the length of the MTs as stated in the legend? If so, one would expect to see a displacement of the average MT length of the population as cells progress from phase II to phase III, as previously demonstrated in Figure 1A. Likewise, no data points would be anticipated for those phases in which the MT length is 0 or close to 0. Moreover, when the length of half pre-anaphase mitotic spindle was measured as a control, how can one get MT lengths that are equal or close to 0 in these cells? The length of the pre-anaphase spindle is between 2-4 um, so MT length values should range from 1 to 2 um if half the spindle is measured.

      The graph in Fig1B represents the fluorescence intensity (a proxy for the Q-nMT bundle thickness) along the Q-nMT bundle length.

      Fluorescence intensity is measured along a “virtual line” that starts 0,5 µm before the extremity of the QnMT bundle that is in contact with the SPB. In other words, we aligned all intensity measurements at the fluorescence increasing onset on the SPB side. We arbitrarily set the ‘zero’ at 0,5um before the fluorescence increased onset. That is why the fluorescence intensity is zero between 0 and 0,5 µm – The X-axis represents this virtual line, the 0 being set 0,5 µm before the Q-nMT bundle extremity on the SPB side. This virtual line allows us to standardize our “thickness” measurements for all Q-nMT bundles.

      Using this standardization, it is clear that the length of the Q-nMT bundles increased from phase II to III (see the red arrow). Yet, as in phase II, Q-nMT bundles are not yet stable, their lengths are shorter in phase II than in phase II after a Noc treatment (compare the end of the orange line and the end of the blue line in phase II).

      Author response image 4.

      This is now explained in details in the Material and Methods section (line 539-545).

      This is the same for the inset of Fig 1B and in Sup Fig 1A, in which we measured fluorescence intensity along the halfmitotic spindle just as we did for MT bundle. The X-axis represent a virtual line along the mitotic spindle, starting 0,5 µm before the SBP spindle extremity.

      Author response image 5.

      (3) Microtubules seem to locate next to or to extend beyond the nucleus in the control cells (DMSO) in Figure 1H. Since both nuclear MTs and cytoplasmic MTs emanate from the SPBs, it would have been desirable to display the morphology of the nucleus when possible. Moreover, since the nucleus is a tridimensional structure, it would also be advisable to image different Z-sections.

      Analysis demonstrating that Q-nMT bundles are located inside the nucleus have been provided in our previous paper (Laporte et al, JCB 2013). In this article most of the images are maximal projections of Z-stacks in which the nuclear envelope is visualized via Nup2-RFP (see Fig1 of Laporte et al, JCB 2013 as an example below).

      Author response image 6.

      MTsare organized as a nuclear array in quiescent cells. (A) MT reorganization upon quiescence entry. Cells expressing GFP-Tub1 (green) and Nup2RFP (red) are shown. Glucose exhaustion is indicated as a red dashed line. Quiescent cells dl expressing Tub I-RFP and either Spc72GFP,

      In Laporte et al, JCB 2013, we also provided EM analysis both in cryo and immune-gold (Fig 1E below).

      Author response image 7.

      (top) or coexpr;sse8 with Tub I-RFP (bottom). Arrows point dot along the nMT array. Bars: (A—C)) 2 pm. (E) AMT arroy visualized in WT cells by EMI Yellow arrows, MTS; red arrowheads, nuclear membrane; pink arrow, SPB. Insets: nMT cut transversally. Bar, 100 nm.

      (4) Movies depicting the process of Q-nMT bundle formation in live cells would have been really informative to more precisely evaluate the MT dynamics. Likewise, together with still images (Fig 1D and Supp. Fig. 1D), movies depicting the changes in the localization of Nuf2-GFP would have further facilitated the analysis of this process.

      In a new Sup Fig 1E, we now provide images of Q-nMT bundle formation initiation in phase I, in which it can be observed that Nuf2-GFP accompanies the growth of MT (mTQZ-TUB1) at the onset of Q-nMT bundle formation. Unfortunately, it is technically very challenging to follow the entire process of Q-nMT bundle formation in individual cells, as it takes > 48h. Indeed, for movies longer than 24h, on both microscope pads or specific microfluidic devices (Jacquel, et al, eLife 2021), phototoxicity and oxygen availability become problematic and affect cells’ viability.

      (5) Western blot images displaying the relative protein levels for mTQZ-Tub1 and of the ADH2 promoter-driven mRuby-Tub1 at the different time points should be included to more strongly support the conclusion that new tubulin molecules are introduced in the Q-nMT bundle only after phase I. It is worth noting, in this sense, that the percentage of cells with 2 colors Q-nMT bundle is analyzed only 1 hour after expression of mRuby-Tub1 was induced for phase I cells, but after 24 hours for phase II cells.<br /> We have modified Fig 1F and now provide images of cells after 3, 6 and 24h after glucose exhaustion and the corresponding percentage of cells displaying Q-nMT bundle with the two colors. We also now provide a western blot in Sup Fig 1H using specific antibodies against mTQZ (anti-GFP) and mRuby (anti-RFP).

      (6) In order to demonstrate that Q-nMT formation is an active process induced by a transient signal and that the Q-nMT bundle is required for cell survival, the authors treated cells with nocodazole for 24 h (Fig 1H and Supp Fig 1K). Both events, however, could be associated with the toxic effects of the extremely prolonged nocodazole treatment leading to cell death.

      We have treated 5 days old cells for 24h with 30 µg/ml Noc. We then washed the drug and transferred the cells into a glucose free medium. We then followed both cell survival, using methylene blue, and the cell’s capacity to form a colony after refeeding. In these conditions, we did not observe any toxic effect of the nocodazole. This result is now provided in Sup Fig 1L and discussed line 172-176.

      (7) The "Tub1-only" mutant displays shorter but stable Q-nMT bundles in phase II, although they are thinner than in wild-type cells. What happens in the "Tub3-only" mutant, which also has beta-tubulin levels similar to wild-type cells (Supp. Fig. 2B)?

      In order to measure Q-nMT bundle length and thickness, we used Tub1 fused to GFP. This cannot be done in a Tub3-only mutant. Yet, we have measured Q-nMT bundle length in Tub3-only cells using Bim1-3GFP as a MT marker (as in Laporte et al, JCB 2013). As shown in the figure below, Q-nMT bundles were shorter in Tub3-only cells than in WT cells whatever the phase.

      Author response image 8.

      We do not know if this effect is directly linked to the absence of Tub1 or if it is very indirect and for example due to the fact that Tub1 and Tub3 interact differently with Bim1 or other proteins that are involved in Q-nMT bundle stabilization. As we cannot give a clear interpretation for that result, we decided not to present those data in our manuscript.

      (8) Why were wild-type and ndc80-1 cells imaged after a 20 min nocodazole treatment to evaluate the role of KT-MT attachments in Q-nMT bundle formation (Fig 3A)? Importantly, this experiment is also missing a control in which Q-nMT length is analyzed in both wild-type and ndc80-1 cells at 25ºC instead of 37ºC.

      In this experiment, we used nocodazole to test both the formation and the stability of the Q-nMT bundle. Fig 3A shows MT length distribution in WT (grey) and ndc80-1 (violet) cells expressing mTQZTub1 (green) and Nuf2-GFP (red), shifted to 37 °C at the onset of glucose exhaustion and kept at this non-permissive temperature for 12 or 96 h then treated with Noc. The control experiment was provided in Sup Fig 3B. Indeed, this figure shows MT length in WT (grey) and ndc80-1 (violet) expressing mTQZ-Tub1 (green) and Nuf2-GFP (red) grown for 4 d (96h) at 25 °C, and treated or not with Noc. This is now indicated in the text line 216 and in the figure legend line 976

      Author response image 9.

      (9) As a general comment linked to the previous concern, it is striking that in many instances, Q-nMT bundle length is measured after nocodazole treatment without any evident reason to do this and without displaying the results in untreated cells as a control. If nocodazole is used, the authors should explicitly indicate it and state the reason for it.

      We provide control experiments without nocodazole for all of the figures. For the sake of figure clarity, for Fig.3A the control without the drug is in Sup. Fig. 3B, for Fig. 3B it is shown in Sup. Fig. 3D, for Fig. 4B, it is shown in Sup. Fig 4A. This is now stated in the text and in the figure legend: for Fig. 3A: line 216 and in the figure legend line 976; for Fig. 3B: line 222 and figure legend line 984; for Fig. 4B: line 280 and in the figure legend line 1017.

      The only figures where the untreated cells are not shown is for Fig 1D since the goal of the experiment is to make dynamic MTs shorten.

      In Fig. 5C and Sup. Fig. 5D to F, we used nocodazole to get rid of dynamic cytoplasmic MTs that form upon quiescence exit in order to facilitate Q-nMT bundle measurement. This was explained in our previous study (Laporte et al, JCB 2013). We now mention it in the figure legends, see for example Fig. 5 legend line 1054.

      (10) Ipl1 inactivation using the ipl1-1 thermosensitive allele impedes Q-nMT bundle formation. The inhibitor-sensitive ipl1-as1 allele could have been further used to show whether this depends on its kinase activity, also avoiding the need to increase the temperature, which affects MT dynamics. As suggested, we have used the ipl1-5as allele. We have thus modified Fig 3B and now show that is it indeed the Ipl1 kinase activity that is required for Q-nMT bundle formation initiation (line 222). In any case, it is surprising that deletion of SLI15 does not affect Q-nMT formation (in fact, MT length is even larger), despite the fact that Sli15, which localizes and activates Ipl1, is present at the Q-nMT (Fig 3C). Likewise, deletion of BIR1 has barely any effect on MT length after 4 days in quiescence (Fig 3D). Do the previous observations mean that Ipl1 role is CPC-independent? Does the lack of Sli15 or Bir1 aggravate the defect in Q-nMT formation of ipl1-1 cells at non-permissive or semi-permissive temperature?

      Thanks to the Reviewer’s comments, we have re-checked our sli15Δ strain and found that it was accumulating suppressors very rapidly. To circumvent this problem, we utilized the previously described sli15-3 strain (Kim et al, JCB 1999). We found that sli15-3 was synthetic lethal with both ipl1-1, ipl1-2 (as described in Kim et al, JCB 1999) and with ipl1-as5, preventing us from addressing the CPC dependence of the Ipl1 effect asked by the Reviewer. However, using the sli15-3 strain, we now show that inactivation of Sli15 upon glucose exhaustion does prevent Q-nMT bundle formation (See new Sup Fig 3F and the text line 226-227).

      (11) Lack of both Bir1 and Bim1 act in a synergistic way with regard to the defect in Q-nMT bundle formation. Although the absence of both Sli15 and Bim1 is proposed to lead to a similar defect, this is not sustained by the data provided, particularly in the absence of nocodazole treatment (Supp. Fig 3E).

      Deletion of bir1 alone has only a subtle effect on Q-nMT bundle length in the absence of Noc, yet in bir1Δ cells, Q-nMT bundles are sensitive to Noc. Deletion of BIM1 (bim1Δ) aggravates this phenotype (Fig. 3D). As mentioned above, Q-nMT bundle formation is impaired in sli15-3 cells. In our hands, and as expected from (Zimnaik et al, Cur Biol 2012), this allele is synthetic lethal with bim1Δ.

      On the other hand, the simultaneous lack of Bir1 and Bim1 drastically reduces the viability of cells in quiescence and this is proposed to be evidence supporting that KT-MT attachments are critical for QnMT bundle assembly (Supp Fig 3G). However, similarly to what was indicated previously for the 24 h nocodazole treatment, here again, the lack of viability could be originated by other reasons that are associated with the lack of Bir1 and Bim1 and not necessarily with problems in Q-nMT formation. In fact, the viability defect of cells lacking Bir1 and Bim1 is similar to that of cells only lacking Bir1 (Supp Fig 3G).

      We have previously shown that many mutants impaired for Q-nMT bundle formation (dyn1Δ, nip100Δ etc) have a reduced viability in quiescence (Laporte et al, JCB 2013). In the current study, a very strong phenotype is observed for other mutants impaired for Q-nMT bundle formation such as bim1Δ bir1Δ cells, but also for slk19Δ bim1Δ.

      Importantly, as shown in the new Sup Fig 1L, in WT cells treated with Noc upon entry into quiescence, a treatment that prevents Q-nMT formation, showed a reduced viability, while a Noc treatment that does not affect Q-nMT bundle formation, i.e. a treatment in late quiescence, has no effect on cell survival. This solid set of data point to a clear correlation between the ability of cells to assemble a Q-nMT bundle and their ability to survive in quiescence. Yet, of course, we cannot formally exclude that in all these mutants, the reduction of cell viability in quiescence is due to another reason.

      (12) Both Mam1 and Spo13 are, to my knowledge, meiosis-specific proteins. It is therefore surprising that mutants in these proteins have an effect on MT bundle formation (Fig 3G-H, Supp. Fig. 3G). Are Mam1 and Spo13 also expressed during quiescence? Transcription of MAM1 or SPO13 does not seem to be induced by glucose depletion in previously published microarray experiments, but if Mam1 are Spo13 are expressed in quiescent cells, the authors should show this together with their results.<br /> Indeed, it is interesting to notice that Mam1 and Spo13 are involved in both meiosis and Q-nMT bundle formation. As suggested by the Reviewer we have performed western blots in order to address the expression of those proteins in proliferation and quiescence (4d). We tagged Spo13 with either GFP, HA or Myc but none of the fusion proteins were functional. Yet, as shown in the new Sup Fig 3I, Mam1-GFP, Csm1-GFP and Lsr4-GFP were expressed both in proliferation and quiescence.

      (13) In the laser ablation experiments that demonstrate that KT-MT attachments are not needed in order to maintain Q-nMT bundles once formed, anaphase spindles of proliferating cells were cut as a control (Supp. Fig 3I). However, late anaphase cells have already segregated the chromosomes, which lie next to the SPBs (this can be evidenced by looking at Dad2-GFP localization in Supp. Fig 3I), so that only interpolar MTs are severed in these experiments. The authors should have instead used metaphase cells as a control, since chromosomes are maintained at the spindle midzone and the length and width of the metaphase spindle is more similar to that of the Q-nMT bundle.

      We have tried to “cut” short metaphase spindles, but as they are < 1 µm, after the laser pulse, it is difficult to verify that spindles are indeed cut and not solely “bleached”. Furthermore, after the cut, the remaining MT structure that is detectable is very short, and we are not confident in our length measurements. Yet, this type of experiment has been done in S. pombe (Khodjakov et al, Cur Biol 2004 and Zareiesfandabadi et al, Biophys. J. 2022). In these articles the authors have demonstrated that after a cut, metaphase spindles are unstable and rapidly shrink through the action of Kinesin14 and dynein. This is now mentioned in the text line 265.

      (14) In the experiment that shows that cycloheximide prevents Q-nMT disassembly after quiescence exit, and therefore that this process requires de novo protein synthesis (Fig. 5A), cells are indicated to express only Spc42-RFP and Nuf2-GFP. However, Stu2-GFP images are also shown next to the graph and, according to the figure legend, it was indeed Stu2-GFP that was used to measure individual QnMT bundles in cells treated with cycloheximide. In the graph, additionally, time t=0 represents the onset of MT bundle depolymerization, but Q-nMT bundle disassembly does not take place after cycloheximide treatment. The authors should clarify these aspects of the experiment.

      Following the Reviewer’s suggestion, to clarify these aspects we have split Fig. 5A into 2 panels.

      Finally, some minor issues are:

      (1) The text should be checked for proper spelling and grammar.

      We have done our best.

      (2) In some instances, there is no indication of how many cells were imaged and analyzed.

      We now provide all these details either in the figure itself or in the figure legend.

      (3) Besides the Q-nMT bundle, it is sometimes noticeable an additional strong cytoplasmic fluorescent signal in cells that express mTQZ-Tub1 and/or mRuby-Tub1 (e.g., Figs 1F, 1H and, particularly, Supp Fig 1H). What is the nature of these cytoplasmic MT structures?

      We did mention this observation in the material and methods section (see line 526-528). This signal is a background fluorescence signal detected with our long pass GFP filter. It is not GFP as it is “yellowish” when we view it via the microscope oculars. This background signal can also be observed in quiescent WT cells that do not express any GFP. We do not know what molecule could be at the origin of that signal but it may be derivative of an adenylic metabolite that accumulates in quiescence and could be fluorescent in the 550nm –ish wavelength, but this is pure speculation.

      (4) It is remarkable that a 20-30% decrease in tubulin levels had such a strong impact on the assembly of the Q-nMT bundle (Supp. Fig. 2). Can this phenotype be recovered by increasing the amount of tubulin in the mutants impaired for tubulin folding?

      Yes, this is astonishing, but we believe our data are very solid since we observed that with both tub3Δ and in all the tubulin folding mutants we have tested (See Sup. Fig. 2). To answer Reviewer’s question, we would need to increase the amount of properly folded tubulin, in a tubulin folding mutant. One way to try to do that would be to find suppressors of GIM mutations, but this is a lengthy process that we feel would not add much strength to this conclusion.

      (5) The graphs displaying the length of the Q-nMT bundle in several mutants in microtubule motors throughout a time course are presented in a different manner than in previous experiments, with data points for individual cells being only shown for the most extreme values (Fig 4C, 4H). It would be advisable, for the sake of comparison, to unify the way to represent the data.

      We have now unified the way we present our figures.

      (6) How was the exit from quiescence established in the experiments evaluating Q-nMT disassembly? How synchronous is quiescence exit in the whole population of cells once they are transferred to a rich medium?

      We set the “zero” time upon cell refeeding with new medium. In fact, quiescence exit is NOT synchronous. We have reported this in previous publications, with the best description of this phenomena being in Laporte et al, MIC 2017 . <br /> The figures below are the same data but on the left graph, the kinetic is aligned upon SPB separation onset, while on the right graph (Fig 5A), it is aligned on MT shrinking onset.

      Author response image 10.

      We can add this piece of data in a Sup Figure if the Reviewer believes it is important.

      Reviewer #2 (Recommendations For The Authors):

      General:

      • In general, more precise language that accurately describes the experiments would improve the text. <br /> We have tried to do our best to improve the text.

      • The authors should clearly define what they mean by an active process and provide context to support this statement regarding the Q-nMT.

      We have strived to clarify this point in the text (see paragraph form line 146 to 178).

      • It is reasonable to assume that structures composed of microtubules are dynamic during the assembly process. The authors should clarify what they mean by "stable by default i.e., intrinsically stable." Do they mean that when Q-nMT assembly starts, it will proceed to completion regardless of a change in condition?

      We mean that in phase I the Q-nMT bundle is stabilized as it grows and that stabilization is concomitant with polymerization. By contrast, MTs polymerized during phase II are not stabilized upon elongation beyond the phase I polymer, and get stabilized later, in a separate phase (i.e. in phase III). We hope to have clarified this point in the text (see line 108-110).

      • In lines 33-34, the authors claim that the Q-nMT bundle functions as a "sort of checkpoint for cell cycle resumption." This wording is imprecise, and more significantly the authors do not provide evidence supporting a direct role for Q-nMT in a quiescence checkpoint that inhibits re-entry into the cell cycle.

      We have softened and clarified the text in the abstract (see line 29-30)., in the introduction (line 101104), in the result section (line 331-332) and in the discussion (line 426-430).

      • Many statements are qualitative and subjective. Quantitative statements supported by the results should be used where possible, and if not possible restated or removed.

      We provide statistical data analysis for all the figures.

      • The number of hours after glucose exhaustion used for each phase varies between assays. This is likely a logistical issue but should be explained.

      This is indeed a logistical issue and when pertinent, it is explained in the text.

      • It would be interesting to address how this process occurs in diploids. Do they form a Q-nMT? How does this relate to the decision to enter meiosis?

      Diploid cells enter meiosis when they are starved for nitrogen. Upon glucose exhaustion diploids do form a Q-nMT bundle. This is shown and measured in the new Sup Fig1C. In fact, in diploids, Q-nMT bundles are thicker than in haploid cells.

      • It would be interesting to address how the timescale of this process compares to the types of nutrient stress yeast would be exposed to in the environment.

      We have transferred proliferating yeast cells to water, to try to mimic what could happen when yeast cells face rain in the wild. As shown below, they do form a Q-nMT bundle that becomes nocodazole resistant after 30h. This data is now provided in the new Sup Fig 1D.

      • It is recommended that the authors use FRAP experiments to directly measure the stability of the QnMT bundles.

      This experiment was published in (Laporte et al, 2013). Please see response to Reviewer #1.

      • In many cases, the description of the experimental methods lacks sufficient detail to evaluate the approach or for independent verification of results.

      We have strived to provide a more detailed material and methods section, as well as more detailed figure legends and statistical informations.

      Specific comments on figures:

      • In Figure 1 c), what do the polygons represent? They do not contain all the points of the associated colour.

      The polygon represented the area of distribution of 90% of the data points. As they did not significantly add to the data presentation they have been removed.

      • In Figure 2 a), is the use of two different sets of markers to control for the effect of the markers on microtubule dynamics?

      Yes, we are always concerned about the influence of GFP on our results, so very often we replicate our experiments with different fluorescent proteins or even with different proteins tagged with GFP. This is now mentioned in the text (line 184-186).

      • Is it accurate to say (line 201, figure 3 a)) that no Q-nMT bundles were detected in ndc80-1 cells shifted to 37 degrees, or are they just shorter?

      As shown in Fig 3A, in ndc80-1 cells, most of the MT structures that we measured are below 0,5um. This has been re-phrased in the text (line 214-215).

      • Lines 265-269, figure 4 b), how can the phenotype observed in cin8∆ cells be explained given the low abundance of Cin8 that is detected in quiescent cells?

      Faint fluorescence signal is not synonymous of an absence of function. As shown in Sup Fig 4B, we do detect Cin8-GFP in quiescent cells.

      • Quantification is needed in Figure 4 panels c) and h).

      Fig 4C and 4H have been changed and quantification are provided in the figure legend.

      Reviewer #3 (Recommendations For The Authors):

      A few points should be addressed for clarity:

      (1) Sup. Fig. 1K: are only viable cells used for the colony-forming assay? How were these selected? If not, the assay would just measure survival (as in the viability assay).

      Yes, only viable cells were selected for the colony forming assay. We used methylene blue to stain dead cells. Then, we used a micromanipulation instrument (Singer Spore Play) that is commonly used for tetrad dissection to select “non blue cells” and position them on a plate (as we do with spores). Each micromanipulated cell is then allowed to grow on the plate and we count colonies (see picture in Sup Fig 1L right panel). This was described in Laporte et al, JCB 2011. We have added that piece of information in the legend (line 1129-1130) and in the M&M section (line 580-586).

      (2) Could Tub3 have a role in phase I? It is not clear why the authors conclude involvement only in phase II.

      As it can be seen in Fig 2D, MT bundle length and thickness are quite similar in WT and Tub1-only cells in phase I, indicating that the absence of Tub3 as no effect in phase I. In Tub1-only cells, MT bundles are thinner in both phase II and phase III, yet, they get fully stabilized in phase III. Thus, the effect of Tub3 is largely specific to the nucleation/elongation of phase II MTs. We hope to have clarified that point in the text (line 203-207).

      (3) Quantifications, statistics: for all quantifications, the authors should clearly state the number of experiments (replicates), and number of cells used in each, and what number was used for statistics. For all quantifications in cells, it seems that the values from the total number of cells across different experiments were plotted and used for statistics. This is not very useful and results in extremely small p values. I assume that the values for individual cells were obtained from multiple, independent experiments. Unless there are technical limitations that allow only a very small sample size (not the case here for most experiments), for experiments involving treatments the authors should determine values for each experiment and show statistics for comparison between experiments rather than individual cells pooled from multiple experiments.

      All the experiments have been done at least in replicate. In the new Fig. 1A, we now display each independent experiment with a specific color code. For Fig 2B and 2C we now provide the data obtained for each separate experiment in Sup Fig 2C. Additional details about quantifications and statistics are provided in the M&M section or in the specific figure legends.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1:

      I am satisfied with all clarifications and additional analyses performed by the authors. 

      The only concern I have is about changes in running after [AM+VM] mismatches. 

      The authors reported that they "found no evidence of a change in running speed or pupil diameter following [AM + VM] mismatch (Figures S5A)" (line 197). 

      Nevertheless, it seems that there is a clear increase in running speed for the [AM+VM] condition (S5A). Could this be more specifically quantified? I am concerned that part of the [AM+VM] could stem from this change in running behavior. Could one factor out the running contribution? 

      Please excuse, this was unintentionally omitted. We have added the quantification to Table S1 and included the results of the significance test in (Fig S2A, Fig S4A and Fig S5A). The increase in running speed upon MM presentation (0.5 – 1 s), compared to the baseline running speed in the time window preceding MM presentation (-0.5 – 0 s), was not significant in any of the tested conditions.

      In the process of adding the statistics, we noticed an unfortunate inconsistency in our figures that relates to Figure S5A. The data shown in all other Figures is aligned to the onset of audiomotor mismatch. In Figure S5A, however, the data were aligned to the onset of the visuomotor mismatch. As there is a differential delay in the closed loop coupling of auditory and visual feedback of approximately 170 ms (as described in the methods), visuomotor mismatch onset is slightly before audiomotor mismatch onset. We have corrected this now in the manuscript but have done the statistical analysis for both old and new versions of the figure. In neither case do we find evidence of a running speed response.

      The authors thoroughly addressed the concerns raised. In my opinion, this has substantially strengthened the manuscript, enabling much clearer interpretation of the results reported. I commend the authors for the response to review. Overall, I find the experiments elegantly designed, and the results robust, providing compelling evidence for non-hierarchical interactions across neocortical areas and more specifically for the exchange of sensorimotor prediction error signals across modalities. 

      We are happy to hear!

      Reviewer #2:

      The incorporation of the analysis of the animal's running speed and the pupil size upon sound interruption improves the interpretation of the data. The authors can now conclude that responses to the mismatch are not due to behavioral effects. 

      The issue of the relationship between mismatch responses and offset responses remains uncommented. The auditory system is sensitive to transitions, also to silence. See the work of the Linden or the Barkat labs (including the work of the first author of this manuscript) on offset responses, and also that of the Mesgarani lab (Khalighinejad et al., 2019) on responses to transitions 'to clean' (Figure 1c) in human auditory cortex. Offset responses, as the first author knows well, are modulated by intensity and stimulus length (after adaptation?). That responses to the interruption of the sound are similar in quality, if not quantity, in the closed and open loop conditions suggest that offset response might modulate the mismatch response. A mismatch response that reflects a break in predictability would presumably be less modulated by the exact details of the sensory input than an offset response. Therefore, what is the relationship between the mismatch response and the mean sound amplitude prior to the sound interruption (for example during the preceding 1 second)? And between the mismatch response and the mean firing rate over the same period? 

      Finally, how do visual stimuli modulate sound responses in the absence of a mismatch? Is the multimodal response potentiation specific to a mismatch?

      There are probably two points important to clarify before answering the question – just to make sure there is no semantic misunderstanding. 

      (1) In the jargon of predictive processing, a prediction error is a deviation from a predictable relationship. This can be sensorimotor coupling (as in audio- and visuomotor mismatch), stimulus history (as in oddball, or sound offset responses), surround sensory input (as in endstopping response and center-surround effects in visual processing), etc. A sound offset perceived by an animal in an open loop condition is thus a negative prediction error based on stimulus history (this assumes the animal has no way to predict the time of offset – as is the case in our experiments). We are primarily interested in our work here in characterizing negative prediction errors that result from motor-related predictions – hence the comparison we use is unpredictable sound offset in closed-loop coupling vs. unpredictable sound offset in open-loop coupling. The first is a mixture of an audiomotor prediction error and a stimulus history prediction error. The second is just a stimulus history prediction error. Thus, we compare the two types of responses to isolate the component that can only be attributed to audiomotor prediction errors. 

      (2) Audiomotor mismatch responses can of course be explained in a large variety of ways. For example, one could consider a sound offset a sensory stimulus. One could further assume that locomotion increases sensory responses. If so, one could explain audiomotor mismatch responses as a locomotion related gain of a sensory offset response. However, we need to further postulate that this locomotion related gain is stimulus specific, as for sound onset responses there is no detectable difference between locomotion and sitting. Thus, we are left with a model that explains audiomotor mismatch responses as a “stimulus specific locomotion gain of sensory responses”. This is correct – it is just not very satisfying, has no computational basis, and makes no useful predictions (see e.g. https://pubmed.ncbi.nlm.nih.gov/36821437/ for an extended treatise of exactly this point for visuomotor mismatch responses).

      That responses to the interruption of the sound are similar in quality, if not quantity, in the closed and open loop conditions suggest that offset response might modulate the mismatch response.

      Conceptually both a “sound offset” and an “audiomotor mismatch” are negative prediction errors. Could one describe the effect we see as an audiomotor mismatch modulating a sound offset? Certainly. But if the reviewer means modulate in the sense of neuromodulatory – we are not aware of a neuromodulatory responses that would be fast enough (or be strong enough to have these effects – we have looked into ACh, NA, and Ser (unpublished – no MM response)). Alternatively, they could simply add linearly (as predictive processing would predict). Given that AM mismatch responses are likely computed in auditory cortex, we see no reason to speculate that anything more complicated is happening than a linear summation of different prediction error responses. 

      A mismatch response that reflects a break in predictability would presumably be less modulated by the exact details of the sensory input than an offset response. Therefore, what is the relationship between the mismatch response and the mean sound amplitude prior to the sound interruption (for example during the preceding 1 second)? And between the mismatch response and the mean firing rate over the same period? 

      The reviewer’s intuition here – that mismatch responses have a lower resolution than what one thinks of as sensory responses (or sound offset responses) – is probably not warranted. Experiments that quantify the resolution of mismatch responses are relatively data intense – and to the best of our knowledge this has only been done once in the visual system for visuomotor mismatch responses (Zmarz and Keller, 2016). Here we found that visuomotor mismatch responses exhibited matched spatial (in visual space) resolution to that of visual responses. 

      Regarding the suggested analyses: In a closed loop session, the sound amplitude preceding the mismatch is directly related to the running speed of the mouse. In visual cortex, the amplitude of visuomotor mismatch responses linearly scales with running speed (and consequently visual flow speed) prior to the mismatch – as predicted by predictive processing. See e.g. figure 4B in (Zmarz and Keller, 2016). We have tried this analysis for audiomotor mismatches in the previous round of reviews, but we fear we do not have sufficient data to address this question properly. If we look at how mismatch responses change as a function of locomotion speed (sound amplitude) across the entire population of neurons, we have no evidence of a systematic change (and the effects are highly variable as a function of speed bins we choose). However, just looking at the most audiomotor mismatch responsive neurons, we find a trend for increased responses with increasing running speed (Author response image 1). We analyzed the top 5% of cells that showed the strongest response to mismatch (MM) and divided the MM trials into three groups based on running speed: slow (10-20 cm/s), middle (20-30 cm/s), and fast (>30 cm/s). Given the fact that we have on average 14 mismatch events in total per neuron, the analysis when split by running speed is under-powered.  

      Author response image 1.

      The average response of strongest AM MM responders to AM mismatches as a function of running speed (data are from 51 cells, 11 fields of view, 6 mice).

      Regarding the relationship between mismatch response and firing rate prior to mismatch, we are not sure we understand the intuition. Does the reviewer mean, the average firing rate of the mismatch neuron? Or the population mean? The first is likely uninterpretable as it is bound to be confounded by regression to the mean type artefacts. But in either case, we would have no prediction of what to expect.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1:

      Comment:

      The authors quantified information in gesture and speech, and investigated the neural processing of speech and gestures in pMTG and LIFG, depending on their informational content, in 8 different time-windows, and using three different methods (EEG, HD-tDCS and TMS). They found that there is a time-sensitive and staged progression of neural engagement that is correlated with the informational content of the signal (speech/gesture).

      Strengths:

      A strength of the paper is that the authors attempted to combine three different methods to investigate speech-gesture processing.

      We sincerely appreciate the reviewer’s recognition of our efforts in employing a multi-method approach, which integrates three complementary experimental paradigms, each leveraging distinct neurophysiological techniques to provide converging evidence.

      In Experiment 1, we found that the degree of inhibition in the pMTG and LIFG was strongly associated with the overlap in gesture-speech representations, as quantified by mutual information. Experiment 2 revealed the time-sensitive dynamics of the pMTG-LIFG circuit in processing both unisensory (gesture or speech) and multisensory information. Experiment 3, utilizing high-temporal-resolution EEG, independently replicated the temporal dynamics of gesture-speech integration observed in Experiment 2, further validating our findings.

      The striking convergence across these methodologically independent approaches significantly bolsters the robustness and generalizability of our conclusions regarding the neural mechanisms underlying multisensory integration.

      Comment 1: I thank the authors for their careful responses to my comments. However, I remain not convinced by their argumentation regarding the specificity of their spatial targeting and the time-windows that they used.

      The authors write that since they included a sham TMS condition, that the TMS selectively disrupted the IFG-pMTG interaction during specific time windows of the task related to gesture-speech semantic congruency. This to me does not show anything about the specificity of the time-windows itself, nor the selectivity of targeting in the TMS condition.

      (1) Selection of brain regions (IFG/pMTG)

      We thank the reviewer for their thoughtful consideration. The choice of the left IFG and pMTG as regions of interest (ROIs) was informed by a meta-analysis of fMRI studies on gesture-speech integration, which consistently identified these regions as critical hubs (see Author response table 1 for detailed studies and coordinates).

      Author response table 1.

      Meta-analysis of previous studies on gesture-speech integration.

      Based on the meta-analysis of previous studies, we selected the IFG and pMTG as ROIs for gesture-speech integration. The rationale for selecting these brain regions is outlined in the introduction in Lines 63-66: “Empirical studies have investigated the semantic integration between gesture and speech by manipulating their semantic relationship[15-18] and revealed a mutual interaction between them19-21 as reflected by the N400 latency and amplitude14 as well as common neural underpinnings in the left inferior frontal gyrus (IFG) and posterior middle temporal gyrus (pMTG)[15,22,23].”

      And further described in Lines 77-78: “Experiment 1 employed high-definition transcranial direct current stimulation (HD-tDCS) to administer Anodal, Cathodal and Sham stimulation to either the IFG or the pMTG”. And Lines 85-88: ‘Given the differential involvement of the IFG and pMTG in gesture-speech integration, shaped by top-down gesture predictions and bottom-up speech processing [23], Experiment 2 was designed to assess whether the activity of these regions was associated with relevant informational matrices”.

      In the Methods section, we clarified the selection of coordinates in Lines 194-200: “Building on a meta-analysis of prior fMRI studies examining gesture-speech integration[22], we targeted Montreal Neurological Institute (MNI) coordinates for the left IFG at (-62, 16, 22) and the pMTG at (-50, -56, 10). In the stimulation protocol for HD-tDCS, the IFG was targeted using electrode F7 as the optimal cortical projection site[36], with four return electrodes placed at AF7, FC5, F9, and FT9. For the pMTG, TP7 was selected as the cortical projection site[36], with return electrodes positioned at C5, P5, T9, and P9.”

      The selection of IFG or pMTG as integration hubs for gesture and speech has also been validated in our previous studies. Specifically, Zhao et al. (2018, J. Neurosci) applied TMS to both areas. Results demonstrated that disrupting neural activity in the IFG or pMTG via TMS selectively impaired the semantic congruency effect (reaction time costs due to semantic incongruence), while leaving the gender congruency effect unaffected.

      These findings identified the IFG and pMTG as crucial hubs for gesture-speech integration, guiding the selection of brain regions for our subsequent studies.

      (2) Selection of time windows

      The five key time windows (TWs) analyzed in this study were derived from our previous TMS work (Zhao et al., 2021, J. Neurosci), where we segmented the gesture-speech integration period (0–320 ms post-speech onset) into eight 40-ms windows. This interval aligns with established literature on gesture-speech integration, particularly the 200–300 ms window noted by the reviewer. As detailed in Lines (776-779): “Procedure of Experiment 2. Eight time windows (TWs, duration = 40 ms) were segmented in relative to the speech IP. Among the eight TWs, five (TW1, TW2, TW3, TW6, and TW7) were chosen based on the significant results in our prior study[23]. Double-pulse TMS was delivered over each of the TW of either the pMTG or the IFG”.

      In our prior work (Zhao et al., 2021, J. Neurosci), we employed a carefully controlled experimental design incorporating two key factors: (1) gesture-speech semantic congruency (serving as our primary measure of integration) and (2) gesture-speech gender congruency (implemented as a matched control factor). Using a time-locked, double-pulse TMS protocol, we systematically targeted each of the eight predefined time windows (TWs) within the left IFG, left pMTG, or vertex (serving as a sham control condition). Our results demonstrated that a TW-selective disruption of gesture-speech integration, indexed by the semantic congruency effect (i.e., a cost of reaction time because of semantic conflict), when stimulating the left pMTG in TW1, TW2, and TW7 but when stimulating the left IFG in TW3 and TW6. Crucially, no significant effects were observed during either sham stimulation or the controlled gender congruency factor (Figure 3 from Zhao et al., 2021, J. Neurosci).

      This triple dissociation - showing effects only for semantic integration, only in active stimulation, and only at specific time points - provides compelling causal evidence that IFG-pMTG connectivity plays a temporally precise role in gesture-speech integration.

      Noted that this work has undergone rigorous peer review by two independent experts who both endorsed our methodological approach. Their original evaluations, provided below:

      Reviewer 1: “significance: Using chronometric TMS-stimulation the data of this experiment suggests a feedforward information flow from left pMTG to left IFG followed by an information flow from left IFG back to the left pMTG.  The study is the first to provide causal evidence for the temporal dynamics of the left pMTG and left IFG found during gesture-speech integration.”

      Reviewer 2: “Beyond the new results the manuscript provides regarding the chronometrical interaction of the left inferior frontal gyrus and middle temporal gyrus in gesture-speech interaction, the study more basically shows the possibility of unfolding temporal stages of cognitive processing within domain-specific cortical networks using short-time interval double-pulse TMS. Although this method also has its limitations, a careful study planning as shown here and an appropiate discussion of the results can provide unique insights into cognitive processing.”

      References:

      Willems, R.M., Ozyurek, A., and Hagoort, P. (2009). Differential roles for left inferior frontal and superior temporal cortex in multimodal integration of action and language. Neuroimage 47, 1992-2004. 10.1016/j.neuroimage.2009.05.066.

      Drijvers, L., Jensen, O., and Spaak, E. (2021). Rapid invisible frequency tagging reveals nonlinear integration of auditory and visual information. Human Brain Mapping 42, 1138-1152. 10.1002/hbm.25282.

      Drijvers, L., and Ozyurek, A. (2018). Native language status of the listener modulates the neural integration of speech and iconic gestures in clear and adverse listening conditions. Brain and Language 177, 7-17. 10.1016/j.bandl.2018.01.003.

      Drijvers, L., van der Plas, M., Ozyurek, A., and Jensen, O. (2019). Native and non-native listeners show similar yet distinct oscillatory dynamics when using gestures to access speech in noise. Neuroimage 194, 55-67. 10.1016/j.neuroimage.2019.03.032.

      Holle, H., and Gunter, T.C. (2007). The role of iconic gestures in speech disambiguation: ERP evidence. J Cognitive Neurosci 19, 1175-1192. 10.1162/jocn.2007.19.7.1175.

      Kita, S., and Ozyurek, A. (2003). What does cross-linguistic variation in semantic coordination of speech and gesture reveal?: Evidence for an interface representation of spatial thinking and speaking. J Mem Lang 48, 16-32. 10.1016/S0749-596x(02)00505-3.

      Bernardis, P., and Gentilucci, M. (2006). Speech and gesture share the same communication system. Neuropsychologia 44, 178-190. 10.1016/j.neuropsychologia.2005.05.007.

      Zhao, W.Y., Riggs, K., Schindler, I., and Holle, H. (2018). Transcranial magnetic stimulation over left inferior frontal and posterior temporal cortex disrupts gesture-speech integration. Journal of Neuroscience 38, 1891-1900. 10.1523/Jneurosci.1748-17.2017.

      Zhao, W., Li, Y., and Du, Y. (2021). TMS reveals dynamic interaction between inferior frontal gyrus and posterior middle temporal gyrus in gesture-speech semantic integration. The Journal of Neuroscience, 10356-10364. 10.1523/jneurosci.1355-21.2021.

      Hartwigsen, G., Bzdok, D., Klein, M., Wawrzyniak, M., Stockert, A., Wrede, K., Classen, J., and Saur, D. (2017). Rapid short-term reorganization in the language network. Elife 6. 10.7554/eLife.25964.

      Jackson, R.L., Hoffman, P., Pobric, G., and Ralph, M.A.L. (2016). The semantic network at work and rest: Differential connectivity of anterior temporal lobe subregions. Journal of Neuroscience 36, 1490-1501. 10.1523/JNEUROSCI.2999-15.2016.

      Humphreys, G. F., Lambon Ralph, M. A., & Simons, J. S. (2021). A Unifying Account of Angular Gyrus Contributions to Episodic and Semantic Cognition. Trends in neurosciences, 44(6), 452–463. https://doi.org/10.1016/j.tins.2021.01.006

      Bonner, M. F., & Price, A. R. (2013). Where is the anterior temporal lobe and what does it do?. The Journal of neuroscience : the official journal of the Society for Neuroscience, 33(10), 4213–4215. https://doi.org/10.1523/JNEUROSCI.0041-13.2013

      Comment 2: It could still equally well be the case that other regions or networks relevant for gesture-speech integration are targeted, and it can still be the case that these timewindows are not specific, and effects bleed into other time periods. There seems to be no experimental evidence here that this is not the case.

      The selection of IFG and pMTG as regions of interest was rigorously justified through multiple lines of evidence. First, a comprehensive meta-analysis of fMRI studies on gesture-speech integration consistently identified these regions as central nodes (see response to comment 1). Second, our own previous work (Zhao et al., 2018, JN; 2021, JN) provided direct empirical validation of their involvement. Third, by employing the same experimental paradigm, we minimized the likelihood of engaging alternative networks. Fourth, even if other regions connected to IFG or pMTG might be affected by TMS, the distinct engagement of specific time windows of IFG and pMTG minimizes the likelihood of consistent influence from other regions.

      Regarding temporal specificity, our 2021 study (Zhao et al., 2021, JN, see details in response to comment 1) systematically examined the entire 0-320ms integration window and found that only select time windows showed significant effects for gesture-speech semantic congruency, while remaining unaffected during gender congruency processing. This double dissociation (significant effects for semantic integration but not gender processing in specific windows) rules out broad temporal spillover.

      Comment 3: To be more specific, the authors write that double-pulse TMS has been widely used in previous studies (as found in their table). However, the studies cited in the table do not necessarily demonstrate the level of spatial and temporal specificity required to disentangle the contributions of tightly-coupled brain regions like the IFG and pMTG during the speech-gesture integration process. pMTG and IFG are located in very close proximity, and are known to be functionally and structurally interconnected, something that is not necessarily the case for the relatively large and/or anatomically distinct areas that the authors mention in their table.

      Our methodological approach is strongly supported by an established body of research employing double-pulse TMS (dpTMS) to investigate neural dynamics across both primary motor and higher-order cognitive regions. As documented in Author response table 1, multiple studies have successfully applied this technique to: (1) primary motor areas (tongue and lip representations in M1), and (2) semantic processing regions (including pMTG, PFC, and ATL). Particularly relevant precedents include:

      (1) Teige et al. (2018, Cortex): Demonstrated precise spatial and temporal specificity by applying 40ms-interval dpTMS to ATL, pMTG, and mid-MTG across multiple time windows (0-40ms, 125-165ms, 250-290ms, 450-490ms), revealing distinct functional contributions from ATL versus pMTG.

      (2) Vernet et al. (2015, Cortex): Successfully dissociated functional contributions of right IPS and DLPFC using 40ms-interval dpTMS, despite their anatomical proximity and functional connectivity.

      These studies confirm double-pulse TMS can discriminate interconnected nodes at short timescales. Our 2021 study further validated this for IFG-pMTG.

      Author response table 2.

      Double-pulse TMS studies on brain regions over 3-60 ms time interval

      References:

      Teige, C., Mollo, G., Millman, R., Savill, N., Smallwood, J., Cornelissen, P. L., & Jefferies, E. (2018). Dynamic semantic cognition: Characterising coherent and controlled conceptual retrieval through time using magnetoencephalography and chronometric transcranial magnetic stimulation. Cortex, 103, 329-349.

      Vernet, M., Brem, A. K., Farzan, F., & Pascual-Leone, A. (2015). Synchronous and opposite roles of the parietal and prefrontal cortices in bistable perception: a double-coil TMS–EEG study. Cortex, 64, 78-88.

      Comment 4: But also more in general: The mere fact that these methods have been used in other contexts does not necessarily mean they are appropriate or sufficient for investigating the current research question. Likewise, the cognitive processes involved in these studies are quite different from the complex, multimodal integration of gesture and speech. The authors have not provided a strong theoretical justification for why the temporal dynamics observed in these previous studies should generalize to the specific mechanisms of gesture-speech integration..

      The neurophysiological mechanisms underlying double-pulse TMS (dpTMS) are well-characterized. While it is established that single-pulse TMS can produce brief artifacts (typically within 0–10 ms) due to transient cortical depolarization (Romero et al., 2019, NC), the dynamics of double-pulse TMS (dpTMS) involve more intricate inhibitory interactions. Specifically, the first pulse increases membrane conductance via GABAergic shunting inhibition, effectively lowering membrane resistance and attenuating the excitatory impact of the second pulse. This results in a measurable reduction in cortical excitability at the paired-pulse interval, as evidenced by suppressed motor evoked potentials (MEPs) (Paulus & Rothwell, 2016, J Physiol). Importantly, this neurophysiological mechanism is independent of cognitive domain and has been robustly demonstrated across multiple functional paradigms.

      In our study, we did not rely on previously reported timing parameters but instead employed a dpTMS protocol using a 40-ms inter-pulse interval. Based on the inhibitory dynamics of this protocol, we designed a sliding temporal window sufficiently broad to encompass the integration period of interest. This approach enabled us to capture and localize the critical temporal window associated with ongoing integrative processing in the targeted brain region.

      We acknowledge that the previous phrasing may have been ambiguous, a clearer and more detailed description of the dpTMS protocol has now been provided in Lines 88-92: “To this end, we employed chronometric double-pulse transcranial magnetic stimulation, which is known to transiently reduce cortical excitability at the inter-pulse interval]27]. Within a temporal period broad enough to capture the full duration of gesture–speech integration[28], we targeted specific timepoints previously implicated in integrative processing within IFG and pMTG [23].”

      References:

      Romero, M.C., Davare, M., Armendariz, M. et al. Neural effects of transcranial magnetic stimulation at the single-cell level. Nat Commun 10, 2642 (2019). https://doi.org/10.1038/s41467-019-10638-7

      Paulus W, Rothwell JC. Membrane resistance and shunting inhibition: where biophysics meets state-dependent human neurophysiology. J Physiol. 2016 May 15;594(10):2719-28. doi: 10.1113/JP271452. PMID: 26940751; PMCID: PMC4865581.

      Obermeier, C., & Gunter, T. C. (2015). Multisensory Integration: The Case of a Time Window of Gesture-Speech Integration. Journal of Cognitive Neuroscience, 27(2), 292-307. https://doi.org/10.1162/jocn_a_00688

      Comment 5: Moreover, the studies cited in the table provided by the authors have used a wide range of interpulse intervals, from 20 ms to 100 ms, suggesting that the temporal precision required to capture the dynamics of gesture-speech integration (which is believed to occur within 200-300 ms; Obermeier & Gunter, 2015) may not even be achievable with their 40 ms time windows.

      Double-pulse TMS has been empirically validated across neurocognitive studies as an effective method for establishing causal temporal relationships in cortical networks, with demonstrated sensitivity at timescales spanning 3-60 m. Our selection of a 40-ms interpulse interval represents an optimal compromise between temporal precision and physiological feasibility, as evidenced by its successful application in dissociating functional contributions of interconnected regions including ATL/pMTG (Teige et al., 2018) and IPS/DLPFC (Vernet et al., 2015). This methodological approach combines established experimental rigor with demonstrated empirical validity for investigating the precisely timed IFG-pMTG dynamics underlying gesture-speech integration, as shown in our current findings and prior work (Zhao et al., 2021).

      Our experimental design comprehensively sampled the 0-320 ms post-stimulus period, fully encompassing the critical 200-300 ms window associated with gesture-speech integration, as raised by the reviewer. Notably, our results revealed temporally distinct causal dynamics within this period: the significantly reduced semantic congruency effect emerged at IFG at 200-240ms, followed by feedback projections from IFG to pMTG at 240-280ms. This precisely timed interaction provides direct neurophysiological evidence for the proposed architecture of gesture-speech integration, demonstrating how these interconnected regions sequentially contribute to multisensory semantic integration.

      Comment 6: I do appreciate the extra analyses that the authors mention. However, my 5th comment is still unanswered: why not use entropy scores as a continous measure?

      Analysis with MI and entropy as continuous variables were conducted employing Representational Similarity Analysis (RSA) (Popal et.al, 2019). This analysis aimed to build a model to predict neural responses based on these feature metrics.

      To capture dynamic temporal features indicative of different stages of multisensory integration, we segmented the EEG data into overlapping time windows (40 ms in duration with a 10 ms step size). The 40 ms window was chosen based on the TMS protocol used in Experiment 2, which also employed a 40 ms time window. The 10 ms step size (equivalent to 5 time points) was used to detect subtle shifts in neural responses that might not be captured by larger time windows, allowing for a more granular analysis of the temporal dynamics of neural activity.

      Following segmentation, the EEG data were reshaped into a four-dimensional matrix (42 channels × 20 time points × 97 time windows × 20 features). To construct a neural similarity matrix, we averaged the EEG data across time points within each channel and each time window. The resulting matrix was then processed using the pdist function to compute pairwise distances between adjacent data points. This allowed us to calculate correlations between the neural matrix and three feature similarity matrices, which were constructed in a similar manner. These three matrices corresponded to (1) gesture entropy, (2) speech entropy, and (3) mutual information (MI). This approach enabled us to quantify how well the neural responses corresponded to the semantic dimensions of gesture and speech stimuli at each time window.

      To determine the significance of the correlations between neural activity and feature matrices, we conducted 1000 permutation tests. In this procedure, we randomized the data or feature matrices and recalculated the correlations repeatedly, generating a null distribution against which the observed correlation values were compared. Statistical significance was determined if the observed correlation exceeded the null distribution threshold (p < 0.05). This permutation approach helps mitigate the risk of spurious correlations, ensuring that the relationships between the neural data and feature matrices are both robust and meaningful.

      Finally, significant correlations were subjected to clustering analysis, which grouped similar neural response patterns across time windows and channels. This clustering allowed us to identify temporal and spatial patterns in the neural data that consistently aligned with the semantic features of gesture and speech stimuli, thus revealing the dynamic integration of these multisensory modalities across time. Results are as follows:

      (1)  Two significant clusters were identified for gesture entropy (Figure 1 left). The first cluster was observed between 60-110 ms (channels F1 and F3), with correlation coefficients (r) ranging from 0.207 to 0.236 (p < 0.001). The second cluster was found between 210-280 ms (channel O1), with r-values ranging from 0.244 to 0.313 (p < 0.001).

      (2)  For speech entropy (Figure 1 middle), significant clusters were detected in both early and late time windows. In the early time windows, the largest significant cluster was found between 10-170 ms (channels F2, F4, F6, FC2, FC4, FC6, C4, C6, CP4, and CP6), with r-values ranging from 0.151 to 0.340 (p = 0.013), corresponding to the P1 component (0-100 ms). In the late time windows, the largest significant cluster was observed between 560-920 ms (across the whole brain, all channels), with r-values ranging from 0.152 to 0.619 (p = 0.013).

      (3)  For mutual information (MI) (Figure 1 right), a significant cluster was found between 270-380 ms (channels FC1, FC2, FC3, FC5, C1, C2, C3, C5, CP1, CP2, CP3, CP5, FCz, Cz, and CPz), with r-values ranging from 0.198 to 0.372 (p = 0.001).

      Author response image 1.

      Results of RSA analysis.

      These additional findings suggest that even using a different modeling approach, neural responses, as indexed by feature metrics of entropy and mutual information, are temporally aligned with distinct ERP components and ERP clusters, as reported in the current manuscript. This alignment serves to further consolidate the results, reinforcing the conclusion we draw. Considering the length of the manuscript, we did not include these results in the current manuscript.

      Reference:

      Popal, H., Wang, Y., & Olson, I. R. (2019). A guide to representational similarity analysis for social neuroscience. Social cognitive and affective neuroscience, 14(11), 1243-1253.

      Comment 7: In light of these concerns, I do not believe the authors have adequately demonstrated the spatial and temporal specificity required to disentangle the contributions of the IFG and pMTG during the gesture-speech integration process. While the authors have made a sincere effort to address the concerns raised by the reviewers, and have done so with a lot of new analyses, I remain doubtful that the current methodological approach is sufficient to draw conclusions about the causal roles of the IFG and pMTG in gesture-speech integration.

      To sum up:

      (1) Empirical validation from our prior work (Zhao et al., 2018,2021,JN): The selection of IFG and pMTG as target regions was informed by both: (1) a comprehensive meta-analysis of fMRI studies on gesture-speech integration, and (2) our own prior causal evidence from Zhao et al. (2018, J Neurosci), with detailed stereotactic coordinates provided in the attached Response to Editors and Reviewers letter. The temporal parameters were similarly grounded in empirical data from Zhao et al. (2021, J Neurosci), where we systematically examined eight consecutive 40-ms windows spanning the full integration period (0-320 ms). This study revealed a triple dissociation of effects - occurring exclusively during: (i)semantic integration (but not control tasks), (ii) active stimulation (but not sham), and (iii) specific time windows (but not all time windows)- providing robust causal evidence for the spatiotemporal specificity of IFG-pMTG interactions in gesture-speech processing. Notably, all reviewers recognized the methodological strength of this dpTMS approach in their evaluations (see attached JN assessment for details).

      (2) Convergent evidence from Experiment 3: Our study employed a multi-method approach incorporating three complementary experimental paradigms, each utilizing distinct neurophysiological techniques to provide converging evidence. Specifically, Experiment 3 implemented high-temporal-resolution EEG, which independently replicated the time-sensitive dynamics of gesture-speech integration observed in our double-pulse TMS experiments. The remarkable convergence between these methodologically independent approaches -demonstrating consistent temporal staging of IFG-pMTG interactions across both causal (TMS) and correlational (EEG) measures - significantly strengthens the validity and generalizability of our conclusions regarding the neural mechanisms underlying multisensory integration.

      (3) Established precedents in double-pulse TMS literature: The double-pulse TMS methodology employed in our study is firmly grounded in established neuroscience research. As documented in our detailed Response to Editors and Reviewers letter (citing 11 representative studies), dpTMS has been extensively validated for investigating causal temporal dynamics in cortical networks, with demonstrated sensitivity at timescales ranging from 3-60 ms. Particularly relevant precedents include: 1. Teige et al. (2018, Cortex) successfully dissociated functional contributions of anatomically proximal regions (ATL vs. pMTG vs.mid-MTG) using 40-ms-interval double-pulse TMS; 2. Vernet et al. (2015, Cortex) effectively distinguished neural processing in interconnected frontoparietal regions (right IPS vs. DLPFC) using 40-ms double-pulse TMS parameters. Both parameters are identical to those employed in our current study.

      (4) Neurophysiological Plausibility: The neurophysiological basis for the transient double-pulse TMS effects is well-established through mechanistic studies of TMS-induced cortical inhibition (Romero et al.,2019; Paulus & Rothwell, 2016).

      Taking together, we respectfully submit that our methodology provides robust support for our conclusions.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      Summary:

      The authors tried to identify the relationships between gut microbiota, lipid metabolites and the host in type 2 diabetes (T2DM) by using spontaneously developed T2DM in macaques, considered among the best human models.

      Strengths:

      The authors compared comprehensively the gut microbiota, plasma fatty acids between spontaneous T2DM and the control macaques, and tried verified the results with macaques in high-fat diet-fed mice model.

      Weaknesses:

      The observed multi-omics on macaques can be done on humans, which weakens the conclusion of the manuscript, unless the observation/data on macaques could cover during the onset of T2DM that would be difficult to obtain from humans.

      Regarding the metabolomic analysis on fatty acids, the authors did not include the results obtained form the macaque fecal samples which should be important considering the authors claimed the importance of gut microbiota in the pathogenesis of T2DM. Instead, the authors measured palmitic acid in the mouse model and tried to validate their conclusions with that.

      In murine experiments, palmitic acid-containing diet were fed to mice to induce diabetic condition, but this does not mimic spontaneous T2DM in macaques, since the authors did not measure in macaque feces (or at least did not show the data from macaque feces of) palmitic acid or other fatty acids; instead, they assumed from blood metabolome data that palmitic acid would be absorbed from the intestine to affect the host metabolism, and added palmitic acid in the diet in mouse experiments. Here involves the probable leap of logic to support their conclusions and title of the study.

      In addition, the authors measured omics data after, but not before, the onset of spontaneous T2DM of macaques. This can reveal microbiota dysbiosis driven purely by disease progression, but does not support the causative effect of gut microbiota on T2DM development that the authors claims.

      We are sorry for misunderstanding your point and failing to address your question regarding macaque fecal metabolomics in our previous response. Our study performed untargeted metabolomics on macaque feces and indeed detected the differential metabolite palmitic acid (PA) content, which showed an obvious decrease in T2DM macaques compared with the control (Table 1). However, the difference in PA level between the two groups was not significant (p = 0.17). It may be attributed to the limitation of untargeted metabolomics methodology in absolute quantitative analysis. In addition, we found many other long-chain fatty acids were down-regulated in the T2DM macaque feces (Table 1). Such results are consistent with our observation in murine experiments. We examined PA levels in the feces, ileum, and serum in mice and found that PA level was significantly decreased in fecal samples but increased in the ileum and serum. These findings demonstrated that without the transplantation of gut microbiota, the ileum could not absorb the PA effectively even at a high concentration of ingested PA. Only mice receiving fecal microbiota transplants from T2DM macaques and fed a high-PA diet showed a significant increase in the ileum and serum alongside a decrease in fecal PA concentration. Both the macaque metabolomics and mice experiment results suggest that gut microbiota mediated the absorption of excess PA in the ileum leading to the accumulation of PA in the serum. In the revised manuscript, we added the results of all differential metabolites in Table S2.

      Author response table 1.

      Table 1. Differential analysis of palmitic acid and other fatty acids from fecal untargeted metabolomics in macaques.

      Regarding the causative effect of gut microbiota on T2DM development, we agree with the reviewer that the omics data were obtained after, but not before, the onset of spontaneous T2DM macaques, the microbiota dysbiosis is probably driven by disease progression. For this reason, we have changed the title of our manuscript and some of our conclusions, which can be found in our response below.

      Reviewer #1 (Recommendations for the authors):

      As described above, the data presented does not support the notion that gut microbiota change in T2DM macaques promote the disease - rather it showed the outcome of the disease progression. In addition, the involvement of palmitic acid absorption was only shown in mice but not in macaques. Therefore, the authors should change their title and conclusions to more precisely reflect their observation.

      According to your suggestion, we changed the title and the conclusion to make them more precise and avoid emphasizing the causative effect of gut microbiota on T2DM. The new title is “Multi-omics investigation of spontaneous T2DM macaque emphasizes gut microbiota could up-regulate the absorption of excess palmitic acid in the T2DM progression”. We also revised the wording of the results and conclusions to acknowledge the limitation of our study, “We also revealed the specific structure of gut microbiota that promoted T2DM development by regulating the absorption of excess PA in mice, providing experimental evidence for the functional role of gut microbiota in T2DM pathogenesis.” (Lines 122-125), “In particular, concentrations of PA, palmitoleic acid, and oleic acid were significantly higher in the T2DM group than control group (p<0.05 and VIP>1). The concentration of PA in the plasma of T2DM macaques increased, while the concentration of palmitic acid in the feces decreased (Figures 3F and G, Table S2).” (Lines 228-233), and “Our study confirms the functional role of gut microbiota and PA in the T2DM progression. The microbiota composition, specifically higher abundance of R. gnavus (current name: M. gnavus) and Coprococcus sp., and lower abundance of Treponema, F. succinogenes, Christensenellaceae, and F16, promoted the absorption of excess PA which is important for the development of T2DM. However, in this study, such microbial alterations were detected in macaques after they had developed the disease of T2DM instead of before or onset of T2DM, the causative effect of gut microbiota and their action mechanism on the development of T2DM is worth further investigation.” (Lines 450-458).

    1. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer #2 (Public Review):

      The authors responded that they would lose statistical power by studying RTE subfamilies with limited microarray probes, which is a fair point. However, the suggested analysis could have been conducted using the RNA-seq data they explored in the second round of revision. Choosing not to leverage RNA-seq to increase the granularity of their analysis is a matter of choice. In my opinion, however, the authors could have acknowledged in the discussion that some smaller yet potentially influential RTE species may be masked by their global approach."

      We will add one sentence addressing this in the Version of Record.


      The following is the authors’ response to the original reviews.

      We thank Reviewer #1 for their constructive comments.

      Public Reviews: 

      Reviewer #1 (Public Review): 

      Tsai and Seymen et al. investigate associations between RTE expression and methylation and age and inflammation, using multiple public datasets. Compared to the previous round of review, the text of the manuscript has been polished and the phrasing of several findings has been made clearer and more precise. The authors also provided ample discussion to the prior reviewer comments in their rebuttal, including new analyses. All these changes are in the correct direction, however, I believe that part of the content of the rebuttal should be incorporated in the main text, for reasons that I will outline below. 

      Both reviewers found the reliance on microarray expression data to detract from the study. The authors argued that their choices are supported by existing publications which performed a similar quantification of TE expression using microarray data. It could still be argued that (as far as I can tell) Reichmann et al. used a substantially larger number of probes than this study, as a consequence of starting from different arrays, however, this is a minor point which the authors do not need to address. It is still undeniable that including the validation with RNA-seq data performed in the rebuttal would strengthen the manuscript. I especially believe that many readers would want to see this analysis be prominent in the manuscript, considering that both reviewers independently converged on the issue with microarray expression data. Personally, I would have included an RNA-seq dataset next to the microarray data in the main figures, however, I understand that this would require considerable restructuring and that placing RNAseq data besides array data might be misleading. Instead, I would ask that the authors include their rebuttal figures R1 and R2 as supplementary figures. 

      I would suggest introducing a new paragraph, between the section dedicated to expression data and the one dedicated to DNA methylation, mentioning the issues with microarray data (Some of which were mentioned by the reviewers and other which were mentioned by the authors in the discussion and introduction) to then introduce the validation with RNA-seq data. 

      We appreciate the reviewer’s understanding and detailed feedback. As suggested, Author response images 1 and 2 were added as supplementary figures to the manuscript, and one paragraph was added to the section investigating the correlation between RTE expression and chronological age. We have also added new descriptions to the introduction, discussion, and BAR analysis sections.

      Author response image 3 is also a good addition and should be expanded to include the GTP and MESA study and possibly mentioned in the paragraph titled "RTE expression positively correlates with BAR gene signature scores except for SINEs." 

      We have updated Author response image 3 (now Author response image 1) to include GTP and MESA cohorts in the analysis. As shown in Author response image 1, except IFN-I and senescence scores on the MESA cohort that positively correlate with chronological ageing, the rest of the gene signatures display no positive correlation with chronological ageing.  

      Author response image 1 was originally created to separate the effect of chronological age and RTE expression on BAR gene signature scores. As it was meant to discriminate between BAR and chronological age, it doesn't provide additional information regarding the positive correlation between RTE expression and BAR gene signature that was not already present in the manuscript. Therefore, we did not add it to the manuscript.

      Author response image 1.

      Generalized linear models (GLM) analysis (BAR gene signature scores ~ RTE expression +chronological age). For each RTE family, we separately performed GLM. Age (RTE family) indicates the chronological age when used in the design formula for that specific RTE family.

      "In this study, we did not compare MESA with GTP etc. We have analysed each dataset separately based on the available data for that dataset. Therefore, sacrificing one analysis because of the lack of information from the other does not make sense. We would do that if we were after comparing different datasets. Moreover, the datasets are not comparable because they were collected from different types of blood samples." 

      Indeed, the datasets are not compared directly, but the associations between age, BER and TE expression for each dataset are plotted and discussed right next to each other. It is therefore natural to wonder if the differences between datasets are due to differences in the type of blood sample or if they are a consequence of the different probe sets. Using a common set of probes would help answer that question.  

      We understand that the reviewer is proposing a method to eliminate the possible causes of differences across datasets. However, incorporating such change would compromise the statistic power of MESA and GARP cohorts and also change our analysis structurally and digress from our main focus. Hence, we disagree to use the identical set of probes for all three cohorts.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      We thank you for the time you took to review our work and for your feedback! 

      The major changes to the manuscript are:

      (1) We have added visual flow speed and locomotion velocity traces to Figure 5 as suggested.

      (2) We have rephrased the abstract to more clearly indicate that our statement regarding acetylcholine enabling faster switching of internal representations in layer 5 is speculative.

      (3) We have further clarified the positioning of our findings regarding the basal forebrain cholinergic signal in visual cortex in the introduction.

      (4) We have added a video (Video S1) to illustrate different mouse running speeds covered by our data.

      A detailed point-by-point response to all reviewer concerns is provided below.

      Reviewer #1 (Recommendations For The Authors):

      The authors have addressed most of the concerns raised in the initial review. While the paper has been improved, there are still some points of concern in the revised version. 

      Major comments

      (1) Page 1, Line 21: The authors claim, "Our results suggest that acetylcholine augments the responsiveness of layer 5 neurons to inputs from outside of the local network, enabling faster switching between internal representations during locomotion." However, it is not clear which specific data or results support the claim of "switching between internal representations." ... 

      Authors' response: "... That acetylcholine enables a faster switching between internal representations in layer 5 is a speculation. We have attempted to make this clearer in the discussion. ..." 

      In the revised version, there is no new data added to directly support the claim - "Our results suggest acetylcholine ..., enabling faster switching between internal representations during locomotion" (in the abstract). The authors themselves acknowledge that this statement is speculative. The present data only demonstrate that ACh reduces the response latency of L5 neurons to visual stimuli, but not that ACh facilitates quicker transitions in neuronal responses from one visual stimulus to another. To maintain scientific rigor and clarity, I recommend the authors amend this sentence to more accurately reflect the findings. 

      This might be a semantic disagreement? We would argue both a gray screen and a grating are visual stimuli. Hence, we are not sure we understand what the reviewer means by “but not that ACh facilitates quicker transitions in neuronal responses from one visual stimulus to another”. We concur, our data only address one of many possible transitions, but it is a switch between distinct visual stimuli that is sped up by ACh. Nevertheless, we have rephrased the sentence in question by changing “our data suggest” to “based on this we speculate” - but are not sure whether this addresses the reviewer’s concern.  

      (2) Page 4, Line 103: "..., a direct measurement of the activity of cholinergic projection from basal forebrain to the visual cortex during locomotion has not been made." This statement is incorrect. An earlier study by Reimer et al. indeed imaged cholinergic axons in the visual cortex of mice running on a wheel. 

      Authors' response: "We have clarified this as suggested. However, we disagree slightly with the reviewer here. The key question is whether the cholinergic axons imaged originate in basal forebrain. While Reimer et al. 2016 did set out to do this, we believe a number of methodological considerations prevent this conclusion: ... Collins et al. 2023 inject more laterally and thus characterize cholinergic input to S1 and A1, ..."

      The authors pointed out some methodological caveats in previous studies that measured the BF input in V1, and I agree with them on several points. Nonetheless, the statement that "a direct measurement of the activity of cholinergic projection from basal forebrain to visual cortex during locomotion has not been made. ... Prior measurements of the activity of cholinergic axons in visual cortex have all relied on data from a cross of ChAT-Cre mice with a reporter line ..." (Page 4, Line 103) seems to be an oversimplification. In fact, contrary to what the authors noted, Collins et al. (2023) conducted direct imaging of BF cholinergic axons in V1 (Fig. 1) - "Selected axon segments were chosen from putative retrosplenial, somatosensory, primary and secondary motor, and visual cortices". They used a viral approach to express GCaMP in BF axons to bypass the limitations associated with the use of a GCaMP reporter mouse line - "Viral injections were used for BF- ACh studies to avoid imaging axons or dendrites from cholinergic projections not arising from the BF (e.g. cortical cholinergic interneurons)." The authors should reconsider the text. 

      The reason we think that our statement here was – while simplified – accurate, is that Collins et al. do record from cholinergic axons in V1, but they don’t show these data (they only show pooled data across all recordings sites). By superimposing the recording locations of the Collins paper on the Allen mouse brain atlas (Figure R1), we estimate that of the approximately 50 recording sites, most are in somatosensory and somatomotor areas of cortex, and only 1 appears to be in V1, something that is often missed as it is not really highlighted in that paper. If this is indeed correct, we would argue that the data in the Collins et al. paper are not representative of cholinergic activity in visual cortex (we fear only the authors would know for sure). Nevertheless, we have rephrased again. 

      Author response image 1.

      Overlay of the Collins et al. imaging sites (red dots, black outline and dashed circle) on the Allen mouse brain atlas (green shading). Very few (we estimate that it was only 1) of the recording sites appear to be in V1 (the lightest green area), and maybe an additional 4 appear to be in secondary visual areas.  

      Minor comments

      (1) It is unclear which BF subregion(s) were targeted in this study. 

      Authors' response: Thanks for pointing this out. We targeted the entire basal forebrain (medial septum, vertical and horizontal limbs of the diagonal band, and nucleus basalis) with our viral injections. ... We have now added the labels for basal forebrain subregions targeted next to the injection coordinates in the manuscript. 

      The authors provided the coordinates for their virus injections targeting the BF subregions - "(AP, ML, DV (in mm): ... ; +0.6, +0.6, -4.9 (nucleus basalis) ..." Is this the right coordinates for the nucleus basalis? 

      Thank you for catching this - this was indeed incorrect. The coordinates were correct, but our annotation of brain region was not (as the reviewer correctly points out, these coordinates are in the horizontal limb of the diagonal band, not the nucleus basalis). We have corrected this.

      Reviewer #2 (Recommendations For The Authors):

      Thank you for addressing most of the points raised in my original review. I still some concerns relating to the analysis of the data. 

      (1) I appreciate the authors point that getting mice to reliably during head-fixed recordings can require training. Since mice in this study were not trained to run, their low speed of locomotion limits the interpretation of the results. I think this is an important potential caveat and I have retained it in the public review. 

      This might be a misunderstanding. The Jordan paper was a bit of an outlier in that we needed mice to run at very high rates due to fact that our recording times was only minutes. Mice were chosen such that they would more or less continuously run, to maximize the likelihood that they would run during the intracellular recordings. This was what we tried to convey in our previous response. The speed range covered by the analysis in this paper is 0 cm/s to 36 cm/s. 36 cm/s is not far away from the top speed mice can reach on this treadmill (30 cm/s is 1 revolution of the treadmill per second). In our data, the top speed we measured across all mice was 36 cm/s. In the Jordan paper, the peak running speed across the entire dataset was 44 cm/s. Based on the reviewer’s comment, we suspect that the reviewer may be under the impression that 30 cm/s is a relatively slow running speed. To illustrate what this looks like we have made added a video (Video S1) to illustrate different running speeds. 

      (2) The majority of the analyses in the revised manuscript focus on grand average responses, which may mask heterogeneity in the underlying neural populations. This could be addressed by analysing the magnitude and latency of responses for individual neurons. For example, if I understand correctly, the analyses include all neurons, whether or not they are activated, inhibited, or unaffected by visual stimulation and locomotion. For example, while on average layer 2/3 neurons are suppressed by the grating stimulus (Figure 4A), presumable a subset are activated. Evaluating the effects of optogenetic stimulation and locomotion without analyzing them at the level of individual neurons could result in misleading conclusions. This could be presented in the form of a scatter plot, depicting the magnitude of neuronal responses in locomotion vs stationary condition, and opto+ vs no opto conditions. 

      We might be misunderstanding. The first part of the comment is a bit too unspecific to address directly. In cases in which we find the variability is relevant to our conclusions, we do show this for individual cells (e.g.the latencies to running onset are shown as histograms for all cells and axons in Figure S1). It is also unclear to us what the reviewer means by “Evaluating the effects of optogenetic stimulation and locomotion without analyzing them at the level of individual neurons could result in misleading conclusions”. Our conclusions relate to the average responses in L2/3, consistent with the analysis shown. All data will be freely available for anyone to perform follow-up analysis of things we may have missed. E.g., the specific suggestion of presenting the data shown in Figure 4 as a scatter plot is shown below (Figure R2). This is something we had looked at but found not to be relevant to our conclusions. The problem with this analysis is that it is difficult to estimate how much the different sources of variability contribute to the total variability observed in the data, and no interesting pattern is clearly apparent. All relevant and clear conclusions are already captured by the mean differences shown in Figure 4. 

      Author response image 2.

      Optogenetic activation of cholinergic axons in visual cortex primarily enhances responses of layer 5, but not layer 2/3 neurons. Related to Figure 4. (A) Average calcium response of layer 2/3 neurons in visual cortex to full field drifting grating in the absence or presence of locomotion. Each dot is the average calcium activity of an individual neuron during the two conditions. (B) As in A, but for layer 5 neurons. (C) As in A, but comparing the average response while the mice were stationary, to that while cholinergic axons were optogenetically stimulated. (D) As in C, but for layer 5 neurons. (E) Average calcium response of layer 2/3 neurons in visual cortex to visuomotor mismatch, without and with optogenetic stimulation of cholinergic axons in visual cortex. (F) As in E, but for layer 5 neurons. (G) Average calcium response of layer 2/3 neurons in visual cortex to locomotion onset in closed loop, without and with optogenetic stimulation of cholinergic axons in visual cortex. (H) As in G, but for layer 5 neurons.

      (3) To help the reader understand the experimental conditions in open loop experiments, please include average visual flow speed traces for each condition in Figure 5. 

      We have added the locomotion velocity and visual flow speeds to the corresponding conditions in Figure

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1:

      Summary:

      The work by Combrisson and colleagues investigates the degree to which reward and punishment learning signals overlap in the human brain using intracranial EEG recordings. The authors used information theory approaches to show that local field potential signals in the anterior insula and the three sub regions of the prefrontal cortex encode both reward and punishment prediction errors, albeit to different degrees. Specifically, the authors found that all four regions have electrodes that can selectively encode either the reward or the punishment prediction errors. Additionally, the authors analyzed the neural dynamics across pairs of brain regions and found that the anterior insula to dorsolateral prefrontal cortex neural interactions were specific for punishment prediction errors whereas the ventromedial prefrontal cortex to lateral orbitofrontal cortex interactions were specific to reward prediction errors. This work contributes to the ongoing efforts in both systems neuroscience and learning theory by demonstrating how two differing behavioral signals can be differentiated to a greater extent by analyzing neural interactions between regions as opposed to studying neural signals within one region.

      Strengths:

      The experimental paradigm incorporates both a reward and punishment component that enables investigating both types of learning in the same group of subjects allowing direct comparisons.

      The use of intracranial EEG signals provides much needed insight into the timing of when reward and punishment prediction errors signals emerge in the studied brain regions.

      Information theory methods provide important insight into the interregional dynamics associated with reward and punishment learning and allows the authors to assess that reward versus punishment learning can be better dissociated based on interregional dynamics over local activity alone.

      We thank the reviewer for this accurate summary. Please find below our answers to the weaknesses raised by the reviewer.

      Weaknesses:

      The analysis presented in the manuscript focuses solely on gamma band activity. The presence and potential relevance of other frequency bands is not discussed. It is possible that slow oscillations, which are thought to be important for coordinating neural activity across brain regions could provide additional insight.

      We thank the reviewer for pointing us to this missing discussion in the first version of the manuscript. We now made this point clearer in the Methods sections entitled “iEEG data analysis” and “Estimate of single-trial gamma-band activity”:

      “Here, we focused solely on broadband gamma for three main reasons. First, it has been shown that the gamma band activity correlates with both spiking activity and the BOLD fMRI signals (Lachaux et al., 2007; Mukamel et al., 2004; Niessing et al., 2005; Nir et al., 2007), and it is commonly used in MEG and iEEG studies to map task-related brain regions (Brovelli et al., 2005; Crone et al., 2006; Vidal et al., 2006; Ball et al., 2008; Jerbi et al., 2009; Darvas et al., 2010; Lachaux et al., 2012; Cheyne and Ferrari, 2013; Ko et al., 2013). Therefore, focusing on the gamma band facilitates linking our results with the fMRI and spiking literatures on probabilistic learning. Second, single-trial and time-resolved high-gamma activity can be exploited for the analysis of cortico-cortical interactions in humans using MEG and iEEG techniques (Brovelli et al., 2015; 2017; Combrisson et al., 2022). Finally, while previous analyses of the current dataset (Gueguen et al., 2021) reported an encoding of PE signals at different frequency bands, the power in lower frequency bands were shown to carry redundant information compared to the gamma band power.”

      The data is averaged across all electrodes which could introduce biases if some subjects had many more electrodes than others. Controlling for this variation in electrode number across subjects would ensure that the results are not driven by a small subset of subjects with more electrodes.

      We thank the reviewer for raising this important issue. We would like to point out that the gamma activity was not averaged across bipolar recordings within an area, nor measures of connectivity. Instead, we used a statistical approach proposed in a previous paper that combines non-parametric permutations with measures of information (Combrisson et al., 2022). As we explain in the “Statistical analysis” section, mutual information (MI) is estimated between PE signals and single-trial modulations in gamma activity separately for each contact (or for each pair of contacts). Then, a one-sample t-test is computed across all of the recordings of all subjects to form the effect size at the group-level. We will address the point of the electrode number in our answer below.

      The potential variation in reward versus punishment learning across subjects is not included in the manuscript. While the time course of reward versus punishment prediction errors is symmetrical at the group level, it is possible that some subjects show faster learning for one versus the other type which can bias the group average. Subject level behavioral data along with subject level electrode numbers would provide more convincing evidence that the observed effects are not arising from these potential confounds.

      We thank the reviewer for the two points raised. We performed additional analyses at the single-participant level to address the issues raised by the reviewer. We should note, however, that these results are descriptive and cannot be generalized to account for population-level effects. As suggested by the reviewer, we prepared two new figures. The first supplementary figure summarizes the number of participants that had iEEG contacts per brain region and pair of brain regions (Fig. S1A in the Appendix). It can be seen that the number of participants sampled in different brain regions is relatively constant (left panel) and the number of participants with pairs of contacts across brain regions is relatively homogeneous, ranging from 7 to 11 (right panel). Fig. S1B shows the number of bipolar derivations per subject and per brain region.

      Author response image 1.

      Single subject anatomical repartition. (A) Number of unique subject per brain region and per pair of brain regions (B) Number of bipolar derivations per subject and per brain region

      The second supplementary figure describes the estimated prediction error for rewarding and punishing trials for each subject (Fig. S2). The single-subject error bars represent the 95th percentile confidence interval estimated using a bootstrap approach across the different pairs of stimuli presented during the three to six sessions. As the reviewer anticipated, there are indeed variations across subjects, but we observe that RPE and PPE are relatively symmetrical, even at the subject level, and tend toward zero around trial number 10. These results therefore corroborate the patterns observed at the group-level.

      Author response image 2.

      Single-subject estimation of predictions errors. Single-subject trial-wise reward PE (RPE - blue) and punishment PE (PPE - red), ± 95% confidence interval.

      Finally, to assess the variability of local encoding of prediction errors across participants, we quantified the proportion of subjects having at least one significant bipolar derivation encoding either the RPE or PPE (Fig. S4). As expected, we found various proportions of unique subjects with significant R/PPE encoding per region. The lowest proportion was achieved in the ventromedial prefrontal cortex (vmPFC) and lateral orbitofrontal cortex (lOFC) for encoding PPE and RPE, respectively, with approximately 30% of the subjects having the effect. Conversely, we found highly reproducible encodings in the anterior insula (aINS) and dorsolateral prefrontal cortex (dlPFC) with a maximum of 100% of the 9 subjects having at least one bipolar derivation encoding PPE in the dlPFC.

      Author response image 3.

      Taken together, we acknowledge a certain variability per region and per condition. Nevertheless, the results presented in the supplementary figures suggest that the main results do not arise from a minority of subjects.

      We would like to point out that in order to assess across-subject variability, a much larger number of participants would have been needed, given the low signal-to-noise ratios observed at the single-participant level. We thus prefer to add these results as supplementary material in the Appendix, rather than in the main text.

      It is unclear if the findings in Figures 3 and 4 truly reflect the differential interregional dynamics in reward versus punishment learning or if these results arise as a statistical byproduct of the reward vs punishment bias observed within each region. For instance, the authors show that information transfer from anterior insula to dorsolateral prefrontal cortex is specific to punishment prediction error. However, both anterior insula and dorsolateral prefrontal cortex have higher prevalence of punishment prediction error selective electrodes to begin with. Therefore the findings in Fig 3 may simply be reflecting the prevalence of punishment specificity in these two regions above and beyond a punishment specific neural interaction between the two regions. Either mathematical or analytical evidence that assesses if the interaction effect is simply reflecting the local dynamics would be important to make this result convincing.

      This is an important point that we partly addressed in the manuscript. More precisely, we investigated whether the synergistic effects observed between the dlPFC and vmPFC encoding global PEs (Fig. 5) could be explained by their respective local specificity. Indeed, since we reported larger proportions of recordings encoding the PPE in the dlPFC and the RPE in the vmPFC (Fig. 2B), we checked whether the synergy between dlPFC and vmPFC could be mainly due to complementary roles where the dlPFC brings information about the PPE only and the vmPFC brings information to the RPE only. To address this point, we selected PPE-specific bipolar derivations from the dlPFC and RPE-specific from the vmPFC and, as the reviewer predicted, we found synergistic II between the two regions probably mainly because of their respective specificity. In addition, we included the II estimated between non-selective bipolar derivations (i.e. recordings with significant encoding for both RPE and PPE) and we observed synergistic interactions (Fig. 5C and Fig. S9). Taken together, the local specificity certainly plays a role, but this is not the only factor in defining the type of interactions.

      Concerning the interaction information results (II, Fig. 3), several lines of evidence suggest that local specificity cannot account alone for the II effects. For example, the local specificity for PPE is observed across all four areas (Fig. 2A) and the percentage of bipolar derivations displaying an effect is large (equal or above 10%) for three brain regions (aINS, dlPLF and lOFC). If the local specificity were the main driving cause, we would have observed significant redundancy between all pairs of brain regions. On the other hand, the interaction between the aINS and lOFC displayed no significant redundant effect (Fig. 3B). Another example is the result observed in lOFC: approximately 30% of bipolar derivations display a selectivity for PPE (Fig. 2B, third panel from the left), but do not show clear signs of redundant encoding at the level of within-area interactions (Fig. 3A, bottom-left panel). Similarly, the local encoding for RPE is observed across all four brain regions (Fig. 2A) and the percentage of bipolar derivations displaying an effect is large (equal or above 10%) for three brain regions (aINS, dlPLF and vmPFC). Nevertheless, significant between-regions interactions have been observed only between the lOFC and vmPFC (Fig. 3B bottom right panel).

      To further support the reasoning, we performed a simulation to show that it is possible to observe synergistic interactions between two regions with the same specificity. As an example, we may consider one region locally encoding early trials of RPE and a second region encoding the late trials of the RPE. Combining the two with the II would lead to synergistic interactions, because each one of them carries information that is not carried by the other. To illustrate this point, we simulated the data of two regions (x and y). To simulate redundant interactions (first row), each region receives a copy of the prediction (one-to-all) and for the synergy (second row), x and y receive early and late PE trials, respectively (all-to-one). This toy example illustrates that the local specificity is not the only factor determining the type of their interactions. We added the following result to the Appendix.

      Author response image 4.

      Local specificity does not fully determine the type of interactions. Within-area local encoding of PE using the mutual information (MI, in bits) for regions X and Y and between-area interaction information (II, in bits) leading to (A) redundant interactions and (B) synergistic interactions about the PE

      Regarding the information transfer results (Fig. 4), similar arguments hold and suggest that the prevalence is not the main factor explaining the arising transfer entropy between the anterior insula (aINS) and dorsolateral prefrontal cortex (dlPFC). Indeed, the lOFC has a strong local specificity for PPE, but the transfer entropy between the lOFC and aINS (or dlPFC) is shown in Fig. S7 does not show significant differences in encoding between PPE and RPE.

      Indeed, such transfer can only be found when there is a delay between the gamma activity of the two regions. In this example, the transfer entropy quantifies the amount of information shared between the past activity of the aINS and the present activity of the dlPFC conditioned on the past activity of the dlPFC. The conditioning ensures that the present activity of the dlPFC is not only explained by its own past. Consequently, if both regions exhibit various prevalences toward reward and punishment but without delay (i.e. at the same timing), the transfer entropy would be null because of the conditioning. As a fact, between 10 to -20% of bipolar recordings show a selectivity to the reward PE (represented by a proportion of 40-60% of subjects, Fig.S4). However, the transfer entropy estimated from the aINS to the dlPFC across rewarding trials is flat and clearly non-significant. If the transfer entropy was a byproduct of the local specificity then we should observe an increase, which is not the case here.

      Reviewer #2:

      Summary:

      Reward and punishment learning have long been seen as emerging from separate networks of frontal and subcortical areas, often studied separately. Nevertheless, both systems are complimentary and distributed representations of rewards and punishments have been repeatedly observed within multiple areas. This raised the unsolved question of the possible mechanisms by which both systems might interact, which this manuscript went after. The authors skillfully leveraged intracranial recordings in epileptic patients performing a probabilistic learning task combined with model-based information theoretical analyses of gamma activities to reveal that information about reward and punishment was not only distributed across multiple prefrontal and insular regions, but that each system showed specific redundant interactions. The reward subsystem was characterized by redundant interactions between orbitofrontal and ventromedial prefrontal cortex, while the punishment subsystem relied on insular and dorsolateral redundant interactions. Finally, the authors revealed a way by which the two systems might interact, through synergistic interaction between ventromedial and dorsolateral prefrontal cortex.

      Strengths:

      Here, the authors performed an excellent reanalysis of a unique dataset using innovative approaches, pushing our understanding on the interaction at play between prefrontal and insular cortex regions during learning. Importantly, the description of the methods and results is truly made accessible, making it an excellent resource to the community.

      This manuscript goes beyond what is classically performed using intracranial EEG dataset, by not only reporting where a given information, like reward and punishment prediction errors, is represented but also by characterizing the functional interactions that might underlie such representations. The authors highlight the distributed nature of frontal cortex representations and propose new ways by which the information specifically flows between nodes. This work is well placed to unify our understanding of the complementarity and specificity of the reward and punishment learning systems.

      We thank the reviewer for the positive feedback. Please find below our answers to the weaknesses raised by the reviewer.

      Weaknesses:

      The conclusions of this paper are mostly supported by the data, but whether the findings are entirely generalizable would require further information/analyses.

      First, the authors found that prediction errors very quickly converge toward 0 (less than 10 trials) while subjects performed the task for sets of 96 trials. Considering all trials, and therefore having a non-uniform distribution of prediction errors, could potentially bias the various estimates the authors are extracting. Separating trials between learning (at the start of a set) and exploiting periods could prove that the observed functional interactions are specific to the learning stages, which would strengthen the results.

      We thank the reviewer for this question. We would like to note that the probabilistic nature of the learning task does not allow a strict distinction between the exploration and exploitation phases. Indeed, the probability of obtaining the less rewarding outcome was 25% (i.e., for 0€ gain in the reward learning condition and -1€ loss in the punishment learning condition). Thus, participants tended to explore even during the last set of trials in each session. This is evident from the average learning curves shown in Fig. 1B of (Gueguen et al., 2021). Learning curves show rates of correct choice (75% chance of 1€ gain) in the reward condition (blue curves) and incorrect choice (75% chance of 1€ loss) in the punishment condition (red curves).

      For what concerns the evolution of PEs, as reviewer #1 suggested, we added a new figure representing the single-subject estimates of the R/PPE (Fig S2). Here, the confidence interval is obtained across all pairs of stimuli presented during the different sessions. We retrieved the general trend of the R/PPE converging toward zero around 10 trials. Both average reward and punishment prediction errors converge toward zero in approximately 10 trials, single-participant curves display large variability, also at the end of each session. As a reminder, the 96 trials represent the total number of trials for one session for the four pairs and the number of trials for each stimulus was only 24.

      Author response image 5.

      Single-subject estimation of predictions errors. Single-subject trial-wise reward PE (RPE - blue) and punishment PE (PPE - red), ± 95% confidence interval

      However, the convergence of the R/PPE is due to the average across the pairs of stimuli. In the figure below, we superimposed the estimated R/PPE, per pair of stimuli, for each subject. It becomes very clear that high values of PE can be reached, even for late trials. Therefore, we believe that the split into early/late trials because of the convergence of PE is far from being trivial.

      Author response image 6.

      Single-subject estimation of predictions errors per pair of stimuli. Single-subject trial-wise reward PE (RPE - blue) and punishment PE (PPE - red)

      Consequently, nonzero PRE and PPE occur during the whole session and separating trials between learning (at the start of a set) and exploiting periods, as suggested by the reviewer, does not allow a strict dissociation between learning vs no-learning. Nevertheless, we tested the analysis proposed by the reviewer, at the local level. We splitted the 24 trials of each pair of stimuli into early, middle and late trials (8 trials each). We then reproduced Fig. 2 by computing the mutual information between the gamma activity and the R/PPE for subsets of trials: early (first row) and late trials (second row). We retrieved significant encoding of both R/PPE in the aINS, dlPFC and lOFC in both early and late trials. The vmPFC also showed significant encoding of both during early trials. The only difference emerges in the late trials of the vmPFC where we found a strong encoding of the RPE only. It should also be noted that here since we are sub-selecting the trials, the statistical analyses are only performed using a third of the trials.

      Taken together, the combination of high values of PE achieved even for late trials and the fact that most of the findings are reproduced even with a third of the trials does not justify the split into early and late trials here. Crucially, this latest analysis confirms that the neural correlates of learning that we observed reflect PE signals rather than early versus late trials in the session.

      Author response image 7.

      MI between gamma activity and R/PPE using early and late trials. Time courses of MI estimated between the gamma power and both RPE (blue) and PPE (red) using either early or late trials (first and second row, respectively). Horizontal thick lines represent significant clusters of information (p<0.05, cluster-based correction, non-parametric randomization across epochs).

      Importantly, it is unclear whether the results described are a common feature observed across subjects or the results of a minority of them. The authors should report and assess the reliability of each result across subjects. For example, the authors found RPE-specific interactions between vmPFC and lOFC, even though less than 10% of sites represent RPE or both RPE/PPE in lOFC. It is questionable whether such a low proportion of sites might come from different subjects, and therefore whether the interactions observed are truly observed in multiple subjects. The nature of the dataset obviously precludes from requiring all subjects to show all effects (given the known limits inherent to intracerebral recording in patients), but it should be proven that the effects were reproducibly seen across multiple subjects.

      We thank the reviewer for this remark that has also been raised by the first reviewer. This issue was raised by the first reviewer. Indeed, we added a supplementary figure describing the number of unique subjects per brain region and per pair of brain regions (Fig. S1A) such as the number of bipolar derivations per region and per subject (Fig. S1B).

      Author response image 8.

      Single subject anatomical repartition. (A) Number of unique subject per brain region and per pair of brain regions (B) Number of bipolar derivations per subject and per brain region

      Regarding the reproducibility of the results across subjects for the local analysis (Fig. 2), we also added the instantaneous proportion of subjects having at least one bipolar derivation showing a significant encoding of the RPE and PPE (Fig. S4). We found a minimum proportion of approximately 30% of unique subjects having the effect in the lOFC and vmPFC, respectively with the RPE and PPE. On the other hand, both the aINS and dlPFC showed between 50 to 100% of the subjects having the effect. Therefore, local encoding of RPE and PPE was never represented by a single subject.

      Author response image 9.

      Similarly, we performed statistical analysis on interaction information at the single-subject level and counted the proportion of unique subjects having at least one pair of recordings with significant redundant and synergistic interactions about the RPE and PPE (Fig. S5). Consistently with the results shown in Fig. 3, the proportions of significant redundant and synergistic interactions are negative and positive, respectively. For the within-regions interactions, approximately 60% of the subjects with redundant interactions are about R/PPE in the aINS and about the PPE in the dlPFC and 40% about the RPE in the vmPFC. For the across-regions interactions, 60% of the subjects have redundant interactions between the aINS-dlPFC and dlPFC-lOFC about the PPE, and 30% have redundant interactions between lOFC-vmPFC about the RPE. Globally, we reproduced the main results shown in Fig. 3.

      Author response image 10.

      Inter-subjects reproducibility of redundant interactions about PE signals. Time-courses of proportion of subjects having at least one pair of bipolar derivation with a significant interaction information (p<0.05, cluster-based correction, non-parametric randomization across epochs) about the RPE (blue) or PPE (red). Data are aligned to the outcome presentation (vertical line at 0 seconds). Proportion of subjects with redundant (solid) and synergistic (dashed) interactions are respectively going downward and upward.

      Finally, the timings of the observed interactions between areas preclude one of the authors' main conclusions. Specifically, the authors repeatedly concluded that the encoding of RPE/PPE signals are "emerging" from redundancy-dominated prefrontal-insular interactions. However, the between-region information and transfer entropy between vmPFC and lOFC for example is observed almost 500ms after the encoding of RPE/PPE in these regions, questioning how it could possibly lead to the encoding of RPE/PPE. It is also noteworthy that the two information measures, interaction information and transfer entropy, between these areas happened at non overlapping time windows, questioning the underlying mechanism of the communication at play (see Figures 3/4). As an aside, when assessing the direction of information flow, the authors also found delays between pairs of signals peaking at 176ms, far beyond what would be expected for direct communication between nodes. Discussing this aspect might also be of importance as it raises the possibility of third-party involvement.

      The local encoding of RPE in the vmPFC and lOFC is observed in a time interval ranging from approximately 0.2-0.4s to 1.2-1.4s after outcome presentation (blue bars in Fig. 2A). The encoding of RPE by interaction information covers a time interval from approximately 1.1s to 1.5s (blue bars in Fig. 3B, bottom right panel). Similarly, significant TE modulations between the vmPFC and lOFC specific for PPE occur mainly in the 0.7s-1.1s range. Thus, it seems that the local encoding of PPE precedes the effects observed at the level of the neural interactions (II and TE). On the other hand, the modulations in MI, II and TE related to PPE co-occur in a time window from 0.2s to 0.7s after outcome presentation. Thus, we agree with the reviewer that a generic conclusion about the potential mechanisms relating the three levels of analysis cannot be drawn. We thus replaced the term “emerge from” by “occur with” from the manuscript which may be misinterpreted as hinting at a potential mechanism. We nevertheless concluded that the three levels of analysis (and phenomena) co-occur in time, thus hinting at a potential across-scales interaction that needs further study. Indeed, our study suggests that further work, beyond the scope of the current study, is required to better understand the interaction between scales.

      Regarding the delay for the conditioning of the transfer entropy, the value of 176 ms reflects the delay at which we observed a maximum of transfer entropy. However, we did not use a single delay for conditioning, we used every possible delay between [116, 236] ms, as explained in the Method section. We would like to stress that transfer entropy is a directed metric of functional connectivity, and it can only be interpreted as quantifying statistical causality defined in terms of predictacìbility according to the Wiener-Granger principle, as detailed in the methods. Thus, it cannot be interpreted in Pearl’s causal terms and as indexing any type of direct communication between nodes. This is a known limitation of the method, which has been stressed in past literature and that we believe does not need to be addressed here.

      To account for this, we revised the discussion to make sure this issue is addressed in the following paragraph:

      “Here, we quantified directional relationships between regions using the transfer entropy (Schreiber, 2000), which is a functional connectivity measure based on the Granger-Wiener causality principle. Tract tracing studies in the macaque have revealed strong interconnections between the lOFC and vmPFC in the macaque (Carmichael and Price, 1996; Öngür and Price, 2000). In humans, cortico-cortical anatomical connections have mainly been investigated using diffusion magnetic resonance imaging (dMRI). Several studies found strong probabilities of structural connectivity between the anterior insula with the orbitofrontal cortex and dorsolateral part of the prefrontal cortex (Cloutman et al., 2012; Ghaziri et al., 2017), and between the lOFC and vmPFC (Heather Hsu et al., 2020). In addition, the statistical dependency (e.g. coherence) between the LFP of distant areas could be potentially explained by direct anatomical connections (Schneider et al., 2021; Vinck et al., 2023). Taken together, the existence of an information transfer might rely on both direct or indirect structural connectivity. However, here we also reported differences of TE between rewarding and punishing trials given the same backbone anatomical connectivity (Fig. 4). [...] “

      Reviewer #3:

      Summary:

      The authors investigated that learning processes relied on distinct reward or punishment outcomes in probabilistic instrumental learning tasks were involved in functional interactions of two different cortico-cortical gamma-band modulations, suggesting that learning signals like reward or punishment prediction errors can be processed by two dominated interactions, such as areas lOFC-vmPFC and areas aINS-dlPFC, and later on integrated together in support of switching conditions between reward and punishment learning. By performing the well-known analyses of mutual information, interaction information, and transfer entropy, the conclusion was accomplished by identifying directional task information flow between redundancy-dominated and synergy-dominated interactions. Also, this integral concept provided a unifying view to explain how functional distributed reward and/or punishment information were segregated and integrated across cortical areas.

      Strengths:

      The dataset used in this manuscript may come from previously published works (Gueguen et al., 2021) or from the same grant project due to the methods. Previous works have shown strong evidence about why gamma-band activities and those 4 areas are important. For further analyses, the current manuscript moved the ideas forward to examine how reward/punishment information transfer between recorded areas corresponding to the task conditions. The standard measurements such mutual information, interaction information, and transfer entropy showed time-series activities in the millisecond level and allowed us to learn the directional information flow during a certain window. In addition, the diagram in Figure 6 summarized the results and proposed an integral concept with functional heterogeneities in cortical areas. These findings in this manuscript will support the ideas from human fMRI studies and add a new insight to electrophysiological studies with the non-human primates.

      We thank the reviewer for the summary such as for highlighting the strengths. Please find below our answers regarding the weaknesses of the manuscript.

      Weaknesses:

      After reading through the manuscript, the term "non-selective" in the abstract confused me and I did not actually know what it meant and how it fits the conclusion. If I learned the methods correctly, the 4 areas were studied in this manuscript because of their selective responses to the RPE and PPE signals (Figure 2). The redundancy- and synergy-dominated subsystems indicated that two areas shared similar and complementary information, respectively, due to the negative and positive value of interaction information (Page 6). For me, it doesn't mean they are "non-selective", especially in redundancy-dominated subsystem. I may miss something about how you calculate the mutual information or interaction information. Could you elaborate this and explain what the "non-selective" means?

      In the study performed by Gueguen et al. in 2021, the authors used a general linear model (GLM) to link the gamma activity to both the reward and punishment prediction errors and they looked for differences between the two conditions. Here, we reproduced this analysis except that we used measures from the information theory (mutual information) that were able to capture linear and non-linear relationships (although monotonic) between the gamma activity and the prediction errors. The clusters we reported reflect significant encoding of either the RPE and/or the PPE. From Fig. 2, it can be seen that the four regions have a gamma activity that is modulated according to both reward and punishment PE. We used the term “non-selective”, because the regions did not encode either one or the other, but various proportions of bipolar derivations encoding either one or both of them.

      The directional information flows identified in this manuscript were evidenced by the recording contacts of iEEG with levels of concurrent neural activities to the task conditions. However, are the conclusions well supported by the anatomical connections? Is it possible that the information was transferred to the target via another area? These questions may remain to be elucidated by using other approaches or animal models. It would be great to point this out here for further investigation.

      We thank the reviewer for this interesting question. We added the following paragraph to the discussion to clarify the current limitations of the transfer entropy and the link with anatomical connections :

      “Here, we quantified directional relationships between regions using the transfer entropy (Schreiber, 2000), which is a functional connectivity measure based on the Granger-Wiener causality principle. Tract tracing studies in the macaque have revealed strong interconnections between the lOFC and vmPFC in the macaque (Carmichael and Price, 1996; Öngür and Price, 2000). In humans, cortico-cortical anatomical connections have mainly been investigated using diffusion magnetic resonance imaging (dMRI). Several studies found strong probabilities of structural connectivity between the anterior insula with the orbitofrontal cortex and dorsolateral part of the prefrontal cortex (Cloutman et al., 2012; Ghaziri et al., 2017), and between the lOFC and vmPFC (Heather Hsu et al., 2020). In addition, the statistical dependency (e.g. coherence) between the LFP of distant areas could be potentially explained by direct anatomical connections (Schneider et al., 2021). Taken together, the existence of an information transfer might rely on both direct or indirect structural connectivity. However, here we also reported differences of TE between rewarding and punishing trials given the same backbone anatomical connectivity (Fig. 4). Our results are further supported by a recent study involving drug-resistant epileptic patients with resected insula who showed poorer performance than healthy controls in case of risky loss compared to risky gains (Von Siebenthal et al., 2017).”

      References

      Carmichael ST, Price J. 1996. Connectional networks within the orbital and medial prefrontal cortex of macaque monkeys. J Comp Neurol 371:179–207.

      Cloutman LL, Binney RJ, Drakesmith M, Parker GJM, Lambon Ralph MA. 2012. The variation of function across the human insula mirrors its patterns of structural connectivity: Evidence from in vivo probabilistic tractography. NeuroImage 59:3514–3521. oi:10.1016/j.neuroimage.2011.11.016

      Combrisson E, Allegra M, Basanisi R, Ince RAA, Giordano BL, Bastin J, Brovelli A. 2022. Group-level inference of information-based measures for the analyses of cognitive brain networks from neurophysiological data. NeuroImage 258:119347. doi:10.1016/j.neuroimage.2022.119347

      Ghaziri J, Tucholka A, Girard G, Houde J-C, Boucher O, Gilbert G, Descoteaux M, Lippé S, Rainville P, Nguyen DK. 2017. The Corticocortical Structural Connectivity of the Human Insula. Cereb Cortex 27:1216–1228. doi:10.1093/cercor/bhv308

      Gueguen MCM, Lopez-Persem A, Billeke P, Lachaux J-P, Rheims S, Kahane P, Minotti L, David O, Pessiglione M, Bastin J. 2021. Anatomical dissociation of intracerebral signals for reward and punishment prediction errors in humans. Nat Commun 12:3344. doi:10.1038/s41467-021-23704-w

      Heather Hsu C-C, Rolls ET, Huang C-C, Chong ST, Zac Lo C-Y, Feng J, Lin C-P. 2020. Connections of the Human Orbitofrontal Cortex and Inferior Frontal Gyrus. Cereb Cortex 30:5830–5843. doi:10.1093/cercor/bhaa160

      Lachaux J-P, Fonlupt P, Kahane P, Minotti L, Hoffmann D, Bertrand O, Baciu M. 2007. Relationship between task-related gamma oscillations and BOLD signal: new insights from combined fMRI and intracranial EEG. Hum Brain Mapp 28:1368–1375. doi:10.1002/hbm.20352

      Mukamel R, Gelbard H, Arieli A, Hasson U, Fried I, Malach R. 2004. Coupling Between Neuronal Firing, Field Potentials, and fMRI in Human Auditory Cortex. Cereb Cortex 14:881.

      Niessing J, Ebisch B, Schmidt KE, Niessing M, Singer W, Galuske RA. 2005. Hemodynamic signals correlate tightly with synchronized gamma oscillations. science 309:948–951.

      Nir Y, Fisch L, Mukamel R, Gelbard-Sagiv H, Arieli A, Fried I, Malach R. 2007. Coupling between neuronal firing rate, gamma LFP, and BOLD fMRI is related to interneuronal correlations. Curr Biol 17:1275–1285.

      Öngür D, Price JL. 2000. The organization of networks within the orbital and medial prefrontal cortex of rats, monkeys and humans. Cereb Cortex 10:206–219.

      Schneider M, Broggini AC, Dann B, Tzanou A, Uran C, Sheshadri S, Scherberger H, Vinck M. 2021. A mechanism for inter-areal coherence through communication based on connectivity and oscillatory power. Neuron 109:4050-4067.e12. doi:10.1016/j.neuron.2021.09.037

      Schreiber T. 2000. Measuring information transfer. Phys Rev Lett 85:461.

      Von Siebenthal Z, Boucher O, Rouleau I, Lassonde M, Lepore F, Nguyen DK. 2017. Decision-making impairments following insular and medial temporal lobe resection for drug-resistant epilepsy. Soc Cogn Affect Neurosci 12:128–137. doi:10.1093/scan/nsw152

      Recommendations for the authors

      Reviewer #1

      (1) Overall, the writing of the manuscript is dense and makes it hard to follow the scientific logic and appreciate the key findings of the manuscript. I believe the manuscript would be accessible to a broader audience if the authors improved the writing and provided greater detail for their scientific questions, choice of analysis, and an explanation of their results in simpler terms.

      We extensively modified the introduction to better describe the rationale and research question.

      (2) In the introduction the authors state "we hypothesized that reward and punishment learning arise from complementary neural interactions between frontal cortex regions". This stated hypothesis arrives rather abruptly after a summary of the literature given that the literature summary does not directly inform their stated hypothesis. Put differently, the authors should explicitly state what the contradictions and/or gaps in the literature are, and what specific combinations of findings guide them to their hypothesis. When the authors state their hypothesis the reader is still left asking: why are the authors focusing on the frontal regions? What do the authors mean by complementary interactions? What specific evidence or contradiction in the literature led them to hypothesize that complementary interactions between frontal regions underlie reward and punishment learning?

      We extensively modified the introduction and provided a clearer description of the brain circuits involved and the rationale for searching redundant and synergistic interactions between areas.

      (3) Related to the above point: when the authors subsequently state "we tested whether redundancy- or synergy dominated interactions allow the emergence of collective brain networks differentially supporting reward and punishment learning", the Introduction (up to the point of this sentence) has not been written to explain the synergy vs. redundancy framework in the literature and how this framework comes into play to inform the authors' hypothesis on reward and punishment learning.

      We extensively modified the introduction and provided a clearer description of redundant and synergistic interactions between areas.

      (4) The explanation of redundancy vs synergy dominated brain networks itself is written densely and hard to follow. Furthermore, how this framework informs the question on the neural substrates of reward versus punishment learning is unclear. The authors should provide more precise statements on how and why redundancy vs. synergy comes into play in reward and punishment learning. Put differently, this redundancy vs. synergy framework is key for understanding the manuscript and the introduction is not written clearly enough to explain the framework and how it informs the authors' hypothesis and research questions on the neural substrates of reward vs. punishment learning.

      Same as above

      (5) While the choice of these four brain regions in context of reward and punishment learning does makes sense, the authors do not outline a clear scientific justification as to why these regions were selected in relation to their question.

      Same as above

      (6) Could the authors explain why they used gamma band power (as opposed to or in addition to the lower frequency bands) to investigate MI. Relatedly, when the authors introduce MI analysis, it would be helpful to briefly explain what this analysis measures and why it is relevant to address the question they are asking.

      Please see our answer to the first public comment. We added a paragraph to the discussion section to justify our choice of focusing on the gamma band only. We added the following sentence to the result section to justify our choice for using mutual-information:

      The MI allowed us to detect both linear and non-linear relationships between the gamma activity and the PE

      An extended explanation justifying our choice for the MI was already present in the method section.

      (7) The authors state that "all regions displayed a local "probabilistic" encoding of prediction errors with temporal dynamics peaking around 500 ms after outcome presentation". It would be helpful for the reader if the authors spelled out what they mean by probabilistic in this context as the term can be interpreted in many different ways.

      We agree with the reviewer that the term “probabilistic” can be interpreted in different ways. In the revised manuscript we changed “probabilistic” for “mixed”.

      (8) The authors should include a brief description of how they compute RPE and PPE in the beginning of the relevant results section.

      The explanation of how we estimated the PE is already present in the result section: “We estimated trial-wise prediction errors by fitting a Q-learning model to behavioral data. Fitting the model consisted in adjusting the constant parameters to maximize the likelihood of observed choices etc.”

      (9) It is unclear from the Methods whether the authors have taken any measures to address the likely difference in the number of electrodes across subjects. For example, it is likely that some subjects have 10 electrodes in vmPFC while others may have 20. In group analyses, if the data is simply averaged across all electrodes then each subject contributes a different number of data points to the analysis. Hence, a subject with more electrodes can bias the group average. A starting point would be to state the variation in number of electrodes across subjects per brain region. If this variation is rather small, then simple averaging across electrodes might be justified. If the variation is large then one idea would be to average data across electrodes within subjects prior to taking the group average or use a resampling approach where the minimum number of electrodes per brain area is subsampled.

      We addressed this point in our public answers. As a reminder, the new version of the manuscript contains a figure showing the number of unique patients per region, the PE at per participant level together with local-encoding at the single participant level.

      (10) One thing to consider is whether the reward and punishment in the task is symmetrical in valence. While 1$ increase and 1$ decrease is equivalent in magnitude, the psychological effect of the positive (vs. the negative) outcome may still be asymmetrical and the direction and magnitude of this asymmetry can vary across individuals. For instance, some subjects may be more sensitive to the reward (over punishment) while others are more sensitive to the punishment (over reward). In this scenario, it is possible that the differentiation observed in PPE versus RPE signals may arise from such psychological asymmetry rather than the intrinsic differences in how certain brain regions (and their interactions) may encode for reward vs punishment. Perhaps the authors can comment on this possibility, and/or conduct more in depth behavioral analysis to determine if certain subjects adjust their choice behavior faster in response to reward vs. punishment contexts.

      While it could be possible that individuals display different sensitivities vis-à-vis positive and negative prediction errors (and, indeed, a vast body of human reinforcement learning literature seems to point in this direction; Palminteri & Lebreton, 2022), it is unclear to us how such differences would explain into the recruitment of anatomically distinct areas reward and punishment prediction errors. It is important to note here that our design partially orthogonalized positive and reward vs. negative and punishment PEs, because the neutral outcome can generate both positive and negative prediction errors, as a function of the learning context (reward-seeking and punishment avoidance). Back to the main question, for instance, Lefebvre et al (2017) investigated with fMRI the neural correlates of reward prediction errors only and found that inter-individual differences in learning rates for positive and negative prediction errors correlated with differences in the degree of striatal activation and not with the recruitment of different areas. To sum up, while we acknowledge that individuals may display different sensitivity to prediction errors (and reward magnitudes), we believe that such differences should translated in difference in the degree of activation of a given system (the reward systems vs the punishment one) rather than difference in neural system recruitment

      (11) As summarized in Fig 6, the authors show that information transfer between aINS to dlPFC was PPE specific whereas the information transfer between vmPFC to lOFC was RPE specific. What is unclear is if these findings arise as an inevitable statistical byproduct of the fact that aINS has high PPE-specificity and that vmPFC has high RPE-specificity. In other words, it is possible that the analysis in Fig 3,4 are sensitive to fact that there is a larger proportion of electrodes with either PPE or RPE sensitivity in aINS and vmPFC respectively - and as such, the II analysis might reflect the dominant local encoding properties above and beyond reflecting the interactions between regions per se. Simply put, could the analysis in Fig 3B turn out in any other way given that there are more PPE specific electrodes in aINS and more RPE specific electrodes in vmPFC? Some options to address this question would be to limit the electrodes included in the analyses (in Fig 3B for example) so that each region has the same number of PPE and RPE specific electrodes included.

      Please see the simulation we added to the revised manuscript (Fig. S10) demonstrating that synergistic interactions can emerge between regions with the same specificity.

      Regarding the possibility that Fig. 3 and 4 are sensitive to the number of bipolar derivations being R/PPE specific, a counter-example is the vmPFC. The vmPFC has a few recordings specific to punishment (Fig. 2) in almost 30% of the subjects (Fig. S4). However, there is no II about the PPE between recordings of the vmPFC (Fig. 3). The same reasoning also holds for the lOFC. Therefore, the proportion of recordings being RPE or PPE-specific is not sufficient to determine the type of interactions.

      (12)  Related to the point above, what would the results presented in Fig 3A (and 3B) look like if the authors ran the analyses on RPE specific and PPE specific electrodes only. Is the vmPFC-vmPFC RPE effect in Fig 3A arising simply due to the high prevalence of RPE specific electrodes in vmPFC (as shown in Fig. 2)?

      Please see our answer above.

      Reviewer #2:

      Regarding Figure 2A, the authors argued that their findings "globally reproduced their previously published findings" (from Gueguen et al, 2021). It is worth noting though that in their original analysis, both aINS and lOFC show differential effects (aINS showing greater punishment compared to reward, and the opposite for lOFC) compared to the current analysis. Although I would be akin to believe that the nonlinear approach used here might explain part of the differences (as the authors discussed), I am very wary of the other argument advanced: "the removal of iEEG sites contaminated with pathological activity". This raised some red flags. Does that mean some of the conclusions observed in Gueguen et al (2021) are only the result of noise contamination, and therefore should be disregarded? The author might want to add a short supplementary figure using the same approach as in Gueguen (2021) but using the subset of contacts used here to comfort potential readers of the validity of their previous manuscript.

      We appreciate the reviewer's concerns and understand the request for additional information. However, we would like to point out that the figure suggested by the reviewer is already present in the supplementary files of Gueguen et al. 2021 (see Fig. S2). The results of this study should not be disregarded, as the supplementary figure reproduces the results of the main text after excluding sites with pathological activity. Including or excluding sites contaminated with epileptic activity does not have a significant impact on the results, as analyses are performed at each time-stamp and across trials, and epileptic spikes are never aligned in time across trials.

      That being said, there are some methodological differences between the two studies. To extract gamma power, Gueguen et al. filtered and averaged 10 Hz sub-bands, while we used multi-tapers. Additionally, they used a temporal smoothing of 250 ms, while we used less smoothing. However, as explained in the main text, we used information-theoretical approaches to capture the statistical dependencies between gamma power and PE. Despite divergent methodologies, we obtained almost identical results.

      The data and code supporting this manuscript should be made available. If raw data cannot be shared for ethical reasons, single-trial gamma activities should at least be provided. Regarding the code used to process the data, sharing it could increase the appeal (and use) of the methods applied.

      We thank the reviewer for this suggestion. We added a section entitled “Code and data availability” and gave links to the scripts, notebooks and preprocessed data.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      I appreciate the efforts the authors made to clarify and justify their statements and methodology, respectively. I additionally appreciate the efforts they made to provide me with detailed information - including figures - to aid my comprehension. However, there are two things I nevertheless recommend the authors to include in the main manuscript.

      (1) Statement about animal wellbeing: The authors state that they were constrained in their imaging session duration not because of a commonly reported technical limitation, such as photobleaching (which I honestly assumed), but rather the general wellbeing of the animals, who exhibited signs of distress after longer imaging periods. I find this to be a critical issue and perhaps the best argument against performing longer imaging experiments (which would have increased the number of trials, thus potentially boosting the performance of their model). To say that they put animal welfare above all other scientific and technical considerations speaks to a strong ethical adherence to animal welfare policy, and I believe this should be somehow incorporated into the methods.

      We have now included this at the top of page 26:

      “Mice fully recovered from the brief isoflurane anesthesia, showing a clear blinking reflex, whisking and sniffing behaviors and normal body posture and movements, immediately after head fixation. In our experimental conditions, mice were imaged in sessions of up to 25 min since beyond this time we started observing some signs of distress or discomfort. Thus, we avoided longer recording times at the expense of collecting larger trial numbers, in strong adherence of animal welfare and ethics policy. A pilot group of mice were habituated to the head fixed condition in daily 20 min sessions for 3 days, however we did not observe a marked contrast in the behavior of habituated versus unhabituated mice beyond our relatively short 25 min imaging sessions. In consequence imaging sessions never surpassed a maximum of 25 min, after which the mouse was returned to its home cage.”

      (2) Author response image 2: I sincerely thank the authors for providing us reviewers with this figure, which compares the performance of the naïve Bayesian classifier their ultimately use in the study with other commonly implemented models. Also here I falsely assumed that other models, which take correlated activity into account, did not generally perform better than their ultimate model of choice. Although dwelling on it would be distractive (and outside the primary scope of the study), I would encourage the authors to include it as a figure supplement (and simply mention these controls en passant when they justify their choice of the naïve Bayesian classifier).

      This figure was now included in the revised manuscript as supplemental figure 3.

      Page 10 now reads:

      “We performed cross-validated, multi-class classification of the single-trial population responses (decoding, Fig. 2A) using a naive Bayes classifier to evaluate the prediction errors as the absolute difference between the stimulus azimuth and the predicted azimuth (Fig. 2A). We chose this classification algorithm over others due to its generally good performance with limited available data. We visualized the cross-validated prediction error distribution in cumulative plots where the observed prediction errors were compared to the distribution of errors for random azimuth sampling (Fig. 2B). When decoding all simultaneously recorded units, the observed classifier output was not significantly better (shifted towards smaller prediction errors) than the chance level distribution (Fig. 2B). The classifier also failed to decode complete DCIC population responses recorded with neuropixels probes (Fig. 3A). Other classifiers performed similarly (Suppl. Fig. 3A).”

      The bottom paragraph in page 19 now reads:

      “To characterize how the observed positive noise correlations could affect the representation of stimulus azimuth by DCIC top ranked unit population responses, we compared the decoding performance obtained by classifying the single-trial response patterns from top ranked units in the modeled decorrelated datasets versus the acquired data (with noise correlations). With the intention to characterize this with a conservative approach that would be less likely to find a contribution of noise correlations as it assumes response independence, we relied on the naive Bayes classifier for decoding throughout the study. Using this classifier, we observed that the modeled decorrelated datasets produced stimulus azimuth prediction error distributions that were significantly shifted towards higher decoding errors (Fig. 6B, C) and, in our imaging datasets, were not significantly different from chance level (Fig. 6B). Altogether, these results suggest that the detected noise correlations in our simultaneously acquired datasets can help reduce the error of the IC population code for sound azimuth. We observed a similar, but not significant tendency with another classifier that does not assume response independence (KNN classifier), though overall producing larger decoding errors than the Bayes classifier (Suppl. Fig. 3B).”

      Reviewer #3 (Recommendations for the authors):

      I am generally happy with the response to the reviews.

      I find the Author response image 3 quite interesting. The neuropixel data looks somewhat like I expected (especially for mouse #3 and maybe mouse #4). I find the distribution of weights across units in the imaging dataset compared to in the pixel dataset intriguing (though it probably is just the dimensionality of the data being so much higher).

      I'm not too familiar with facial movements but is it the case that the DCIC would be more modulated by ipsilateral movement compared to contralateral movements? Are face movements in mice conjugate or do both sides of the face move more or less independently? If not it may be interesting in future work to record bilaterally and see if that provides more information about DCIC responses.

      We sincerely thank the editors and reviewers for their careful appraisal, commendation of our effort and helpful constructive feedback which greatly improved the presentation of our study. Below in green font is a point by point reply to the comments provided by the reviewers.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary: In this study, the authors address whether the dorsal nucleus of the inferior colliculus (DCIC) in mice encodes sound source location within the front horizontal plane (i.e., azimuth). They do this using volumetric two-photon Ca2+ imaging and high-density silicon probes (Neuropixels) to collect single-unit data. Such recordings are beneficial because they allow large populations of simultaneous neural data to be collected. Their main results and the claims about those results are the following:

      (1) DCIC single-unit responses have high trial-to-trial variability (i.e., neural noise);

      (2) approximately 32% to 40% of DCIC single units have responses that are sensitive to sound source azimuth;

      (3) single-trial population responses (i.e., the joint response across all sampled single units in an animal) encode sound source azimuth "effectively" (as stated in title) in that localization decoding error matches average mouse discrimination thresholds;

      (4) DCIC can encode sound source azimuth in a similar format to that in the central nucleus of the inferior colliculus (as stated in Abstract);

      (5) evidence of noise correlation between pairs of neurons exists;

      and (6) noise correlations between responses of neurons help reduce population decoding error.

      While simultaneous recordings are not necessary to demonstrate results #1, #2, and #4, they are necessary to demonstrate results #3, #5, and #6.

      Strengths:

      - Important research question to all researchers interested in sensory coding in the nervous system.

      - State-of-the-art data collection: volumetric two-photon Ca2+ imaging and extracellular recording using high-density probes. Large neuronal data sets.

      - Confirmation of imaging results (lower temporal resolution) with more traditional microelectrode results (higher temporal resolution).

      - Clear and appropriate explanation of surgical and electrophysiological methods. I cannot comment on the appropriateness of the imaging methods.

      Strength of evidence for claims of the study:

      (1) DCIC single-unit responses have high trial-to-trial variability - The authors' data clearly shows this.

      (2) Approximately 32% to 40% of DCIC single units have responses that are sensitive to sound source azimuth - The sensitivity of each neuron's response to sound source azimuth was tested with a Kruskal-Wallis test, which is appropriate since response distributions were not normal. Using this statistical test, only 8% of neurons (median for imaging data) were found to be sensitive to azimuth, and the authors noted this was not significantly different than the false positive rate. The Kruskal-Wallis test was not performed on electrophysiological data. The authors suggested that low numbers of azimuth-sensitive units resulting from the statistical analysis may be due to the combination of high neural noise and relatively low number of trials, which would reduce statistical power of the test. This may be true, but if single-unit responses were moderately or strongly sensitive to azimuth, one would expect them to pass the test even with relatively low statistical power. At best, if their statistical test missed some azimuthsensitive units, they were likely only weakly sensitive to azimuth. The authors went on to perform a second test of azimuth sensitivity-a chi-squared test-and found 32% (imaging) and 40% (e-phys) of single units to have statistically significant sensitivity. This feels a bit like fishing for a lower p-value. The Kruskal-Wallis test should have been left as the only analysis. Moreover, the use of a chi-squared test is questionable because it is meant to be used between two categorical variables, and neural response had to be binned before applying the test.

      The determination of what is a physiologically relevant “moderate or strong azimuth sensitivity” is not trivial, particularly when comparing tuning across different relays of the auditory pathway like the CNIC, auditory cortex, or in our case DCIC, where physiologically relevant azimuth sensitivities might be different. This is likely the reason why azimuth sensitivity has been defined in diverse ways across the bibliography (see Groh, Kelly & Underhill, 2003 for an early discussion of this issue). These diverse approaches include reaching a certain percentage of maximal response modulation, like used by Day et al. (2012, 2015, 2016) in CNIC, and ANOVA tests, like used by Panniello et al. (2018) and Groh, Kelly & Underhill (2003) in auditory cortex and IC respectively. Moreover, the influence of response variability and biases in response distribution estimation due to limited sampling has not been usually accounted for in the determination of azimuth sensitivity.

      As Reviewer #1 points out, in our study we used an appropriate ANOVA test (KruskalWallis) as a starting point to study response sensitivity to stimulus azimuth at DCIC. Please note that the alpha = 0.05 used for this test is not based on experimental evidence about physiologically relevant azimuth sensitivity but instead is an arbitrary p-value threshold. Using this test on the electrophysiological data, we found that ~ 21% of the simultaneously recorded single units reached significance (n = 4 mice). Nevertheless these percentages, in our small sample size (n = 4) were not significantly different from our false positive detection rate (p = 0.0625, Mann-Whitney, See Author response image 1).  In consequence, for both our imaging (Fig. 3C) and electrophysiological data, we could not ascertain if the percentage of neurons reaching significance in these ANOVA tests were indeed meaningfully sensitive to azimuth or this was due to chance.

      Author response image 1.

      Percentage of the neuropixels recorded DCIC single units across mice that showed significant median response tuning, compared to false positive detection rate (α = 0.05, chance level).

      We reasoned that the observed markedly variable responses from DCIC units, which frequently failed to respond in many trials (Fig. 3D, 4A), in combination with the limited number of trial repetitions we could collect, results in under-sampled response distribution estimations. This under-sampling can bias the determination of stochastic dominance across azimuth response samples in Kruskal-Wallis tests. We would like to highlight that we decided not to implement resampling strategies to artificially increase the azimuth response sample sizes with “virtual trials”, in order to avoid “fishing for a smaller p-value”, when our collected samples might not accurately reflect the actual response population variability.

      As an alternative to hypothesis testing based on ranking and determining stochastic dominance of one or more azimuth response samples (Kruskal-Wallis test), we evaluated the overall statistical dependency to stimulus azimuth of the collected responses.  To do this we implement the Chi-square test by binning neuronal responses into categories. Binning responses into categories can reduce the influence of response variability to some extent, which constitutes an advantage of the Chi-square approach, but we note the important consideration that these response categories are arbitrary.

      Altogether, we acknowledge that our Chi-square approach to define azimuth sensitivity is not free of limitations and despite enabling the interrogation of azimuth sensitivity at DCIC, its interpretability might not extend to other brain regions like CNIC or auditory cortex. Nevertheless we hope the aforementioned arguments justify why the Kruskal-Wallis test simply could not “have been left as the only analysis”.

      (3) Single-trial population responses encode sound source azimuth "effectively" in that localization decoding error matches average mouse discrimination thresholds - If only one neuron in a population had responses that were sensitive to azimuth, we would expect that decoding azimuth from observation of that one neuron's response would perform better than chance. By observing the responses of more than one neuron (if more than one were sensitive to azimuth), we would expect performance to increase. The authors found that decoding from the whole population response was no better than chance. They argue (reasonably) that this is because of overfitting of the decoder modeltoo few trials used to fit too many parameters-and provide evidence from decoding combined with principal components analysis which suggests that overfitting is occurring. What is troubling is the performance of the decoder when using only a handful of "topranked" neurons (in terms of azimuth sensitivity) (Fig. 4F and G). Decoder performance seems to increase when going from one to two neurons, then decreases when going from two to three neurons, and doesn't get much better for more neurons than for one neuron alone. It seems likely there is more information about azimuth in the population response, but decoder performance is not able to capture it because spike count distributions in the decoder model are not being accurately estimated due to too few stimulus trials (14, on average). In other words, it seems likely that decoder performance is underestimating the ability of the DCIC population to encode sound source azimuth.

      To get a sense of how effective a neural population is at coding a particular stimulus parameter, it is useful to compare population decoder performance to psychophysical performance. Unfortunately, mouse behavioral localization data do not exist. Therefore, the authors compare decoder error to mouse left-right discrimination thresholds published previously by a different lab. However, this comparison is inappropriate because the decoder and the mice were performing different perceptual tasks. The decoder is classifying sound sources to 1 of 13 locations from left to right, whereas the mice were discriminating between left or right sources centered around zero degrees. The errors in these two tasks represent different things. The two data sets may potentially be more accurately compared by extracting information from the confusion matrices of population decoder performance. For example, when the stimulus was at -30 deg, how often did the decoder classify the stimulus to a lefthand azimuth? Likewise, when the stimulus was +30 deg, how often did the decoder classify the stimulus to a righthand azimuth?

      The azimuth discrimination error reported by Lauer et al. (2011) comes from engaged and highly trained mice, which is a very different context to our experimental setting with untrained mice passively listening to stimuli from 13 random azimuths. Therefore we did not perform analyses or interpretations of our results based on the behavioral task from Lauer et al. (2011) and only made the qualitative observation that the errors match for discussion.

      We believe it is further important to clarify that Lauer et al. (2011) tested the ability of mice to discriminate between a positively conditioned stimulus (reference speaker at 0º center azimuth associated to a liquid reward) and a negatively conditioned stimulus (coming from one of five comparison speakers positioned at 20º, 30º, 50º, 70 and 90º azimuth, associated to an electrified lickport) in a conditioned avoidance task. In this task, mice are not precisely “discriminating between left or right sources centered around zero degrees”, making further analyses to compare the experimental design of Lauer et al (2011) and ours even more challenging for valid interpretation.

      (4) DCIC can encode sound source azimuth in a similar format to that in the central nucleus of the inferior colliculus - It is unclear what exactly the authors mean by this statement in the Abstract. There are major differences in the encoding of azimuth between the two neighboring brain areas: a large majority of neurons in the CNIC are sensitive to azimuth (and strongly so), whereas the present study shows a minority of azimuth-sensitive neurons in the DCIC. Furthermore, CNIC neurons fire reliably to sound stimuli (low neural noise), whereas the present study shows that DCIC neurons fire more erratically (high neural noise).

      Since sound source azimuth is reported to be encoded by population activity patterns at CNIC (Day and Delgutte, 2013), we refer to a population activity pattern code as the “similar format” in which this information is encoded at DCIC. Please note that this is a qualitative comparison and we do not claim this is the “same format”, due to the differences the reviewer precisely describes in the encoding of azimuth at CNIC where a much larger majority of neurons show stronger azimuth sensitivity and response reliability with respect to our observations at DCIC. By this qualitative similarity of encoding format we specifically mean the similar occurrence of activity patterns from azimuth sensitive subpopulations of neurons in both CNIC and DCIC, which carry sufficient information about the stimulus azimuth for a sufficiently accurate prediction with regard to the behavioral discrimination ability.

      (5) Evidence of noise correlation between pairs of neurons exists - The authors' data and analyses seem appropriate and sufficient to justify this claim.

      (6) Noise correlations between responses of neurons help reduce population decoding error - The authors show convincing analysis that performance of their decoder increased when simultaneously measured responses were tested (which include noise correlation) than when scrambled-trial responses were tested (eliminating noise correlation). This makes it seem likely that noise correlation in the responses improved decoder performance. The authors mention that the naïve Bayesian classifier was used as their decoder for computational efficiency, presumably because it assumes no noise correlation and, therefore, assumes responses of individual neurons are independent of each other across trials to the same stimulus. The use of decoder that assumes independence seems key here in testing the hypothesis that noise correlation contains information about sound source azimuth. The logic of using this decoder could be more clearly spelled out to the reader. For example, if the null hypothesis is that noise correlations do not carry azimuth information, then a decoder that assumes independence should perform the same whether population responses are simultaneous or scrambled. The authors' analysis showing a difference in performance between these two cases provides evidence against this null hypothesis.

      We sincerely thank the reviewer for this careful and detailed consideration of our analysis approach. Following the reviewer’s constructive suggestion, we justified the decoder choice in the results section at the last paragraph of page 18:

      “To characterize how the observed positive noise correlations could affect the representation of stimulus azimuth by DCIC top ranked unit population responses, we compared the decoding performance obtained by classifying the single-trial response patterns from top ranked units in the modeled decorrelated datasets versus the acquired data (with noise correlations). With the intention to characterize this with a conservative approach that would be less likely to find a contribution of noise correlations as it assumes response independence, we relied on the naive Bayes classifier for decoding throughout the study.

      Using this classifier, we observed that the modeled decorrelated datasets produced stimulus azimuth prediction error distributions that were significantly shifted towards higher decoding errors (Fig. 5B, C) and, in our imaging datasets, were not significantly different from chance level (Fig. 5B). Altogether, these results suggest that the detected noise correlations in our simultaneously acquired datasets can help reduce the error of the IC population code for sound azimuth.”

      Minor weakness:

      - Most studies of neural encoding of sound source azimuth are done in a noise-free environment, but the experimental setup in the present study had substantial background noise. This complicates comparison of the azimuth tuning results in this study to those of other studies. One is left wondering if azimuth sensitivity would have been greater in the absence of background noise, particularly for the imaging data where the signal was only about 12 dB above the noise. The description of the noise level and signal + noise level in the Methods should be made clearer. Mice hear from about 2.5 - 80 kHz, so it is important to know the noise level within this band as well as specifically within the band overlapping with the signal.

      We agree with the reviewer that this information is useful. In our study, the background R.M.S. SPL during imaging across the mouse hearing range (2.5-80kHz) was 44.53 dB and for neuropixels recordings 34.68 dB. We have added this information to the methods section of the revised manuscript.

      Reviewer #2 (Public Review):

      In the present study, Boffi et al. investigate the manner in which the dorsal cortex of the of the inferior colliculus (DCIC), an auditory midbrain area, encodes sound location azimuth in awake, passively listening mice. By employing volumetric calcium imaging (scanned temporal focusing or s-TeFo), complemented with high-density electrode electrophysiological recordings (neuropixels probes), they show that sound-evoked responses are exquisitely noisy, with only a small portion of neurons (units) exhibiting spatial sensitivity. Nevertheless, a naïve Bayesian classifier was able to predict the presented azimuth based on the responses from small populations of these spatially sensitive units. A portion of the spatial information was provided by correlated trial-to-trial response variability between individual units (noise correlations). The study presents a novel characterization of spatial auditory coding in a non-canonical structure, representing a noteworthy contribution specifically to the auditory field and generally to systems neuroscience, due to its implementation of state-of-the-art techniques in an experimentally challenging brain region. However, nuances in the calcium imaging dataset and the naïve Bayesian classifier warrant caution when interpreting some of the results.

      Strengths:

      The primary strength of the study lies in its methodological achievements, which allowed the authors to collect a comprehensive and novel dataset. While the DCIC is a dorsal structure, it extends up to a millimetre in depth, making it optically challenging to access in its entirety. It is also more highly myelinated and vascularised compared to e.g., the cerebral cortex, compounding the problem. The authors successfully overcame these challenges and present an impressive volumetric calcium imaging dataset. Furthermore, they corroborated this dataset with electrophysiological recordings, which produced overlapping results. This methodological combination ameliorates the natural concerns that arise from inferring neuronal activity from calcium signals alone, which are in essence an indirect measurement thereof.

      Another strength of the study is its interdisciplinary relevance. For the auditory field, it represents a significant contribution to the question of how auditory space is represented in the mammalian brain. "Space" per se is not mapped onto the basilar membrane of the cochlea and must be computed entirely within the brain. For azimuth, this requires the comparison between miniscule differences between the timing and intensity of sounds arriving at each ear. It is now generally thought that azimuth is initially encoded in two, opposing hemispheric channels, but the extent to which this initial arrangement is maintained throughout the auditory system remains an open question. The authors observe only a slight contralateral bias in their data, suggesting that sound source azimuth in the DCIC is encoded in a more nuanced manner compared to earlier processing stages of the auditory hindbrain. This is interesting, because it is also known to be an auditory structure to receive more descending inputs from the cortex.

      Systems neuroscience continues to strive for the perfection of imaging novel, less accessible brain regions. Volumetric calcium imaging is a promising emerging technique, allowing the simultaneous measurement of large populations of neurons in three dimensions. But this necessitates corroboration with other methods, such as electrophysiological recordings, which the authors achieve. The dataset moreover highlights the distinctive characteristics of neuronal auditory representations in the brain. Its signals can be exceptionally sparse and noisy, which provide an additional layer of complexity in the processing and analysis of such datasets. This will be undoubtedly useful for future studies of other less accessible structures with sparse responsiveness.

      Weaknesses:                                                                                               

      Although the primary finding that small populations of neurons carry enough spatial information for a naïve Bayesian classifier to reasonably decode the presented stimulus is not called into question, certain idiosyncrasies, in particular the calcium imaging dataset and model, complicate specific interpretations of the model output, and the readership is urged to interpret these aspects of the study's conclusions with caution.

      I remain in favour of volumetric calcium imaging as a suitable technique for the study, but the presently constrained spatial resolution is insufficient to unequivocally identify regions of interest as cell bodies (and are instead referred to as "units" akin to those of electrophysiological recordings). It remains possible that the imaging set is inadvertently influenced by non-somatic structures (including neuropil), which could report neuronal activity differently than cell bodies. Due to the lack of a comprehensive ground-truth comparison in this regard (which to my knowledge is impossible to achieve with current technology), it is difficult to imagine how many informative such units might have been missed because their signals were influenced by spurious, non-somatic signals, which could have subsequently misled the models. The authors reference the original Nature Methods article (Prevedel et al., 2016) throughout the manuscript, presumably in order to avoid having to repeat previously published experimental metrics. But the DCIC is neither the cortex nor hippocampus (for which the method was originally developed) and may not have the same light scattering properties (not to mention neuronal noise levels). Although the corroborative electrophysiology data largely eleviates these concerns for this particular study, the readership should be cognisant of such caveats, in particular those who are interested in implementing the technique for their own research.

      A related technical limitation of the calcium imaging dataset is the relatively low number of trials (14) given the inherently high level of noise (both neuronal and imaging). Volumetric calcium imaging, while offering a uniquely expansive field of view, requires relatively high average excitation laser power (in this case nearly 200 mW), a level of exposure the authors may have wanted to minimise by maintaining a low the number of repetitions, but I yield to them to explain.

      We assumed that the levels of heating by excitation light measured at the neocortex in Prevedel et al. (2016), were representative for DCIC also. Nevertheless, we recognize this approximation might not be very accurate, due to the differences in tissue architecture and vascularization from these two brain areas, just to name a few factors. The limiting factor preventing us from collecting more trials in our imaging sessions was that we observed signs of discomfort or slight distress in some mice after ~30 min of imaging in our custom setup, which we established as a humane end point to prevent distress. In consequence imaging sessions were kept to 25 min in duration, limiting the number of trials collected. However we cannot rule out that with more extensive habituation prior to experiments the imaging sessions could be prolonged without these signs of discomfort or if indeed influence from our custom setup like potential heating of the brain by illumination light might be the causing factor of the observed distress. Nevertheless, we note that previous work has shown that ~200mW average power is a safe regime for imaging in the cortex by keeping brain heating minimal (Prevedel et al., 2016), without producing the lasting damages observed by immunohistochemisty against apoptosis markers above 250mW (Podgorski and Ranganathan 2016, https://doi.org/10.1152/jn.00275.2016).

      Calcium imaging is also inherently slow, requiring relatively long inter-stimulus intervals (in this case 5 s). This unfortunately renders any model designed to predict a stimulus (in this case sound azimuth) from particularly noisy population neuronal data like these as highly prone to overfitting, to which the authors correctly admit after a model trained on the entire raw dataset failed to perform significantly above chance level. This prompted them to feed the model only with data from neurons with the highest spatial sensitivity. This ultimately produced reasonable performance (and was implemented throughout the rest of the study), but it remains possible that if the model was fed with more repetitions of imaging data, its performance would have been more stable across the number of units used to train it. (All models trained with imaging data eventually failed to converge.) However, I also see these limitations as an opportunity to improve the technology further, which I reiterate will be generally important for volume imaging of other sparse or noisy calcium signals in the brain.

      Transitioning to the naïve Bayesian classifier itself, I first openly ask the authors to justify their choice of this specific model. There are countless types of classifiers for these data, each with their own pros and cons. Did they actually try other models (such as support vector machines), which ultimately failed? If so, these negative results (even if mentioned en passant) would be extremely valuable to the community, in my view. I ask this specifically because different methods assume correspondingly different statistical properties of the input data, and to my knowledge naïve Bayesian classifiers assume that predictors (neuronal responses) are assumed to be independent within a class (azimuth). As the authors show that noise correlations are informative in predicting azimuth, I wonder why they chose a model that doesn't take advantage of these statistical regularities. It could be because of technical considerations (they mention computing efficiency), but I am left generally uncertain about the specific logic that was used to guide the authors through their analytical journey.

      One of the main reasons we chose the naïve Bayesian classifier is indeed because it assumes that the responses of the simultaneously recorded neurons are independent and therefore it does not assume a contribution of noise correlations to the estimation of the posterior probability of each azimuth. This model would represent the null hypothesis that noise correlations do not contribute to the encoding of stimulus azimuth, which would be verified by an equal decoding outcome from correlated or decorrelated datasets. Since we observed that this is not the case, the model supports the alternative hypothesis that noise correlations do indeed influence stimulus azimuth encoding. We wanted to test these hypotheses with the most conservative approach possible that would be least likely to find a contribution of noise correlations. Other relevant reasons that justify our choice of the naive Bayesian classifier are its robustness against the limited numbers of trials we could collect in comparison to other more “data hungry” classifiers like SVM, KNN, or artificial neuronal nets. We did perform preliminary tests with alternative classifiers but the obtained decoding errors were similar when decoding the whole population activity (Supplemental figure 3A). Dimensionality reduction following the approach described in the manuscript showed a tendency towards smaller decoding errors observed with an alternative classifier like KNN, but these errors were still larger than the ones observed with the naive Bayesian classifier (median error 45º). Nevertheless, we also observe a similar tendency for slightly larger decoding errors in the absence of noise correlations (decorrelated, Supplemental figure 3B). Sentences detailing the logic of classifier choice are now included in the results section at page 10 and at the last paragraph of page 18 (see responses to Reviewer 1).

      That aside, there remain other peculiarities in model performance that warrant further investigation. For example, what spurious features (or lack of informative features) in these additional units prevented the models of imaging data from converging?

      Considering the amount of variability observed throughout the neuronal responses both in imaging and neuropixels datasets, it is easy to suspect that the information about stimulus azimuth carried in different amounts by individual DCIC neurons can be mixed up with information about other factors (Stringer et al., 2019). In an attempt to study the origin of these features that could confound stimulus azimuth decoding we explored their relation to face movement (Supplemental Figure 2), finding a correlation to snout movements, in line with previous work by Stringer et al. (2019).

      In an orthogonal question, did the most spatially sensitive units share any detectable tuning features? A different model trained with electrophysiology data in contrast did not collapse in the range of top-ranked units plotted. Did this model collapse at some point after adding enough units, and how well did that correlate with the model for the imaging data?

      Our electrophysiology datasets were much smaller in size (number of simultaneously recorded neurons) compared to our volumetric calcium imaging datasets, resulting in a much smaller total number of top ranked units detected per dataset. This precluded the determination of a collapse of decoder performance due to overfitting beyond the range plotted in Fig 4G.

      How well did the form (and diversity) of the spatial tuning functions as recorded with electrophysiology resemble their calcium imaging counterparts? These fundamental questions could be addressed with more basic, but transparent analyses of the data (e.g., the diversity of spatial tuning functions of their recorded units across the population). Even if the model extracts features that are not obvious to the human eye in traditional visualisations, I would still find this interesting.

      The diversity of the azimuth tuning curves recorded with calcium imaging (Fig. 3B) was qualitatively larger than the ones recorded with electrophysiology (Fig. 4B), potentially due to the larger sampling obtained with volumetric imaging. We did not perform a detailed comparison of the form and a more quantitative comparison of the diversity of these functions because the signals compared are quite different, as calcium indicator signal is subject to non linearities due to Ca2+ binding cooperativity and low pass filtering due to binding kinetics. We feared this could lead to misleading interpretations about the similarities or differences between the azimuth tuning functions in imaged and electrophysiology datasets. Our model uses statistical response dependency to stimulus azimuth, which does not rely on features from a descriptive statistic like mean response tuning. In this context, visualizing the trial-to-trial responses as a function of azimuth shows “features that are not obvious to the human eye in traditional visualizations” (Fig. 3D, left inset).

      Finally, the readership is encouraged to interpret certain statements by the authors in the current version conservatively. How the brain ultimately extracts spatial neuronal data for perception is anyone's guess, but it is important to remember that this study only shows that a naïve Bayesian classifier could decode this information, and it remains entirely unclear whether the brain does this as well. For example, the model is able to achieve a prediction error that corresponds to the psychophysical threshold in mice performing a discrimination task (~30 {degree sign}). Although this is an interesting coincidental observation, it does not mean that the two metrics are necessarily related. The authors correctly do not explicitly claim this, but the manner in which the prose flows may lead a non-expert into drawing that conclusion.

      To avoid misleading the non-expert readers, we have clarified in the manuscript that the observed correspondence between decoding error and psychophysical threshold is explicitly coincidental.

      Page 13, end of middle paragraph:

      “If we consider the median of the prediction error distribution as an overall measure of decoding performance, the single-trial response patterns from subsamples of at least the 7 top ranked units produced median decoding errors that coincidentally matched the reported azimuth discrimination ability of mice (Fig 4G, minimum audible angle = 31º) (Lauer et al., 2011).”

      Page 14, bottom paragraph:

      “Decoding analysis (Fig. 4F) of the population response patterns from azimuth dependent top ranked units simultaneously recorded with neuropixels probes showed that the 4 top ranked units are the smallest subsample necessary to produce a significant decoding performance that coincidentally matches the discrimination ability of mice (31° (Lauer et al., 2011)) (Fig. 5F, G).”

      We also added to the Discussion sentences clarifying that a relationship between these two variables remains to be determined and it also remains to be determined if the DCIC indeed performs a bayesian decoding computation for sound localization.

      Page 20, bottom:

      “… Concretely, we show that sound location coding does indeed occur at DCIC on the single trial basis, and that this follows a comparable mechanism to the characterized population code at CNIC (Day and Delgutte, 2013). However, it remains to be determined if indeed the DCIC network is physiologically capable of Bayesian decoding computations. Interestingly, the small number of DCIC top ranked units necessary to effectively decode stimulus azimuth suggests that sound azimuth information is redundantly distributed across DCIC top ranked units, which points out that mechanisms beyond coding efficiency could be relevant for this population code.

      While the decoding error observed from our DCIC datasets obtained in passively listening, untrained mice coincidentally matches the discrimination ability of highly trained, motivated mice (Lauer et al., 2011), a relationship between decoding error and psychophysical performance remains to be determined. Interestingly, a primary sensory representations should theoretically be even more precise than the behavioral performance as reported in the visual system (Stringer et al., 2021).”

      Moreover, the concept of redundancy (of spatial information carried by units throughout the DCIC) is difficult for me to disentangle. One interpretation of this formulation could be that there are non-overlapping populations of neurons distributed across the DCIC that each could predict azimuth independently of each other, which is unlikely what the authors meant. If the authors meant generally that multiple neurons in the DCIC carry sufficient spatial information, then a single neuron would have been able to predict sound source azimuth, which was not the case. I have the feeling that they actually mean "complimentary", but I leave it to the authors to clarify my confusion, should they wish.

      We observed that the response patterns from relatively small fractions of the azimuth sensitive DCIC units (4-7 top ranked units) are sufficient to generate an effective code for sound azimuth, while 32-40% of all simultaneously recorded DCIC units are azimuth sensitive. In light of this observation, we interpreted that the azimuth information carried by the population should be redundantly distributed across the complete subpopulation of azimuth sensitive DCIC units.

      In summary, the present study represents a significant body of work that contributes substantially to the field of spatial auditory coding and systems neuroscience. However, limitations of the imaging dataset and model as applied in the study muddles concrete conclusions about how the DCIC precisely encodes sound source azimuth and even more so to sound localisation in a behaving animal. Nevertheless, it presents a novel and unique dataset, which, regardless of secondary interpretation, corroborates the general notion that auditory space is encoded in an extraordinarily complex manner in the mammalian brain.

      Reviewer #3 (Public Review):

      Summary: Boffi and colleagues sought to quantify the single-trial, azimuthal information in the dorsal cortex of the inferior colliculus (DCIC), a relatively understudied subnucleus of the auditory midbrain. They used two complementary recording methods while mice passively listened to sounds at different locations: a large volume but slow sampling calcium-imaging method, and a smaller volume but temporally precise electrophysiology method. They found that neurons in the DCIC were variable in their activity, unreliably responding to sound presentation and responding during inter-sound intervals. Boffi and colleagues used a naïve Bayesian decoder to determine if the DCIC population encoded sound location on a single trial. The decoder failed to classify sound location better than chance when using the raw single-trial population response but performed significantly better than chance when using intermediate principal components of the population response. In line with this, when the most azimuth dependent neurons were used to decode azimuthal position, the decoder performed equivalently to the azimuthal localization abilities of mice. The top azimuthal units were not clustered in the DCIC, possessed a contralateral bias in response, and were correlated in their variability (e.g., positive noise correlations). Interestingly, when these noise correlations were perturbed by inter-trial shuffling decoding performance decreased. Although Boffi and colleagues display that azimuthal information can be extracted from DCIC responses, it remains unclear to what degree this information is used and what role noise correlations play in azimuthal encoding.

      Strengths: The authors should be commended for collection of this dataset. When done in isolation (which is typical), calcium imaging and linear array recordings have intrinsic weaknesses. However, those weaknesses are alleviated when done in conjunction with one another - especially when the data largely recapitulates the findings of the other recording methodology. In addition to the video of the head during the calcium imaging, this data set is extremely rich and will be of use to those interested in the information available in the DCIC, an understudied but likely important subnucleus in the auditory midbrain.

      The DCIC neural responses are complex; the units unreliably respond to sound onset, and at the very least respond to some unknown input or internal state (e.g., large inter-sound interval responses). The authors do a decent job in wrangling these complex responses: using interpretable decoders to extract information available from population responses.

      Weaknesses:

      The authors observe that neurons with the most azimuthal sensitivity within the DCIC are positively correlated, but they use a Naïve Bayesian decoder which assume independence between units. Although this is a bit strange given their observation that some of the recorded units are correlated, it is unlikely to be a critical flaw. At one point the authors reduce the dimensionality of their data through PCA and use the loadings onto these components in their decoder. PCA incorporates the correlational structure when finding the principal components and constrains these components to be orthogonal and uncorrelated. This should alleviate some of the concern regarding the use of the naïve Bayesian decoder because the projections onto the different components are independent. Nevertheless, the decoding results are a bit strange, likely because there is not much linearly decodable azimuth information in the DCIC responses. Raw population responses failed to provide sufficient information concerning azimuth for the decoder to perform better than chance. Additionally, it only performed better than chance when certain principal components or top ranked units contributed to the decoder but not as more components or units were added. So, although there does appear to be some azimuthal information in the recoded DCIC populations - it is somewhat difficult to extract and likely not an 'effective' encoding of sound localization as their title suggests.

      As described in the responses to reviewers 1 and 2, we chose the naïve Bayes classifier as a decoder to determine the influence of noise correlations through the most conservative approach possible, as this classifier would be least likely to find a contribution of correlated noise. Also, we chose this decoder due to its robustness against limited numbers of trials collected, in comparison to “data hungry” non linear classifiers like KNN or artificial neuronal nets. Lastly, we observed that small populations of noisy, unreliable (do not respond in every trial) DCIC neurons can encode stimulus azimuth in passively listening mice matching the discrimination error of trained mice. Therefore, while this encoding is definitely not efficient, it can still be considered effective.

      Although this is quite a worthwhile dataset, the authors present relatively little about the characteristics of the units they've recorded. This may be due to the high variance in responses seen in their population. Nevertheless, the authors note that units do not respond on every trial but do not report what percent of trials that fail to evoke a response. Is it that neurons are noisy because they do not respond on every trial or is it also that when they do respond they have variable response distributions? It would be nice to gain some insight into the heterogeneity of the responses.

      The limited number of azimuth trial repetitions that we could collect precluded us from making any quantification of the unreliability (failures to respond) and variability in the response distributions from the units we recorded, as we feared they could be misleading. In qualitative terms, “due to the high variance in responses seen” in the recordings and the limited trial sampling, it is hard to make any generalization. In consequence we referred to the observed response variance altogether as neuronal noise. Considering these points, our datasets are publicly available for exploration of the response characteristics.

      Additionally, is there any clustering at all in response profiles or is each neuron they recorded in the DCIC unique?

      We attempted to qualitatively visualize response clustering using dimensionality reduction, observing different degrees of clustering or lack thereof across the azimuth classes in the datasets collected from different mice. It is likely that the limited number of azimuth trials we could collect and the high response variance contribute to an inconsistent response clustering across datasets.

      They also only report the noise correlations for their top ranked units, but it is possible that the noise correlations in the rest of the population are different.

      For this study, since our aim was to interrogate the influence of noise correlations on stimulus azimuth encoding by DCIC populations, we focused on the noise correlations from the top ranked unit subpopulation, which likely carry the bulk of the sound location information.  Noise correlations can be defined as correlation in the trial to trial response variation of neurons. In this respect, it is hard to ascertain if the rest of the population, that is not in the top rank unit percentage, are really responding and showing response variation to evaluate this correlation, or are simply not responding at all and show unrelated activity altogether. This makes observations about noise correlations from “the rest of the population” potentially hard to interpret.

      It would also be worth digging into the noise correlations more - are units positively correlated because they respond together (e.g., if unit x responds on trial 1 so does unit y) or are they also modulated around their mean rates on similar trials (e.g., unit x and y respond and both are responding more than their mean response rate). A large portion of trial with no response can occlude noise correlations. More transparency around the response properties of these populations would be welcome.

      Due to the limited number of azimuth trial repetitions collected, to evaluate noise correlations we used the non parametric Kendall tau correlation coefficient which is a measure of pairwise rank correlation or ordinal association in the responses to each azimuth. Positive rank correlation would represent neurons more likely responding together. Evaluating response modulation “around their mean rates on similar trials” would require assumptions about the response distributions, which we avoided due to the potential biases associated with limited sample sizes.

      It is largely unclear what the DCIC is encoding. Although the authors are interested in azimuth, sound location seems to be only a small part of DCIC responses. The authors report responses during inter-sound interval and unreliable sound-evoked responses. Although they have video of the head during recording, we only see a correlation to snout and ear movements (which are peculiar since in the example shown it seems the head movements predict the sound presentation). Additional correlates could be eye movements or pupil size. Eye movement are of particular interest due to their known interaction with IC responses - especially if the DCIC encodes sound location in relation to eye position instead of head position (though much of eye-position-IC work was done in primates and not rodent). Alternatively, much of the population may only encode sound location if an animal is engaged in a localization task. Ideally, the authors could perform more substantive analyses to determine if this population is truly noisy or if the DCIC is integrating un-analyzed signals.

      We unsuccessfully attempted eye tracking and pupillometry in our videos. We suspect that the reason behind this is a generally overly dilated pupil due to the low visible light illumination conditions we used which were necessary to protect the PMT of our custom scope.

      It is likely that DCIC population activity is integrating un-analyzed signals, like the signal associated with spontaneous behaviors including face movements (Stringer et al., 2019), which we observed at the level of spontaneous snout movements. However investigating if and how these signals are integrated to stimulus azimuth coding requires extensive behavioral testing and experimentation which is out of the scope of this study. For the purpose of our study, we referred to trial-to-trial response variation as neuronal noise. We note that this definition of neuronal noise can, and likely does, include an influence from un-analyzed signals like the ones from spontaneous behaviors.

      Although this critique is ubiquitous among decoding papers in the absence of behavioral or causal perturbations, it is unclear what - if any - role the decoded information may play in neuronal computations. The interpretation of the decoder means that there is some extractable information concerning sound azimuth - but not if it is functional. This information may just be epiphenomenal, leaking in from inputs, and not used in computation or relayed to downstream structures. This should be kept in mind when the authors suggest their findings implicate the DCIC functionally in sound localization.

      Our study builds upon previous reports by other independent groups relying on “causal and behavioral perturbations” and implicating DCIC in sound location learning induced experience dependent plasticity (Bajo et al., 2019, 2010; Bajo and King, 2012), which altogether argues in favor of DCIC functionality in sound localization.

      Nevertheless, we clarified in the discussion of the revised manuscript that a relationship between the observed decoding error and the psychophysical performance, or the ability of the DCIC network to perform Bayesian decoding computations, both remain to be determined (please see responses to Reviewer #2).

      It is unclear why positive noise correlations amongst similarly tuned neurons would improve decoding. A toy model exploring how positive noise correlations in conjunction with unreliable units that inconsistently respond may anchor these findings in an interpretable way. It seems plausible that inconsistent responses would benefit from strong noise correlations, simply by units responding together. This would predict that shuffling would impair performance because you would then be sampling from trials in which some units respond, and trials in which some units do not respond - and may predict a bimodal performance distribution in which some trials decode well (when the units respond) and poor performance (when the units do not respond).

      In samples with more that 2 dimensions, the relationship between signal and noise correlations is more complex than in two dimensional samples (Montijn et al., 2016) which makes constructing interpretable and simple toy models of this challenging. Montijn et al. (2016) provide a detailed characterization and model describing how the accuracy of a multidimensional population code can improve when including “positive noise correlations amongst similarly tuned neurons”. Unfortunately we could not successfully test their model based on Mahalanobis distances as we could not verify that the recorded DCIC population responses followed a multivariate gaussian distribution, due to the limited azimuth trial repetitions we could sample.

      Significance: Boffi and colleagues set out to parse the azimuthal information available in the DCIC on a single trial. They largely accomplish this goal and are able to extract this information when allowing the units that contain more information about sound location to contribute to their decoding (e.g., through PCA or decoding on top unit activity specifically). The dataset will be of value to those interested in the DCIC and also to anyone interested in the role of noise correlations in population coding. Although this work is first step into parsing the information available in the DCIC, it remains difficult to interpret if/how this azimuthal information is used in localization behaviors of engaged mice.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      Summary:

      Ma & Yang et al. report a new investigation aimed at elucidating one of the key nutrients S. Typhimurium (STM) utilizes with the nutrient-poor intracellular niche within macrophage, focusing on the amino acid beta-alanine. From these data, the authors report that beta-alanine plays important roles in mediating STM infection and virulence. The authors employ a multidisciplinary approach that includes some mouse studies, and ultimately propose a mechanism by which panD, involved in B-Ala synthesis, mediates regulation of zinc homeostatisis in Salmonella.

      Strengths and weaknesses:

      The results and model are adequately supported by the authors' data. Further work will need to be performed to learn whether the Zn2+ functions as proposed in their mechanism. By performing a small set of confirmatory experiments in S. Typhi, the authors provide some evidence of relevance to human infections.

      Impact:

      This work adds to the body of literature on the metabolic flexibility of Salmonella during infection that enable pathogenesis.

      Reviewer #1 (Recommendations for the authors):

      No further suggestions. The authors have adequately addressed my prior concerns through new data and revisions to the text.

      Thank you for considering this work. We appreciate your efforts in aiding us to improve our manuscript.

      Reviewer #3 (Public review):

      Summary:

      Salmonella is interesting due to its life within a compact compartment, which we call SCV or Salmonella containing vacuole in the field of Salmonella. SCV is a tight-fitting vacuole where the acquisition of nutrients is a key factor by Salmonella. The authors among many nutrients, focussed on beta-alanine. It is also known that Salmonella requires beta-alanine from many other studies. The authors have done in vitro RAW macrophage infection assays and In vivo mouse infection assays to see the life of Salmonella in the presence of beta-alanine. They concluded by comprehending that beta-alanine modulates the expression of many genes including zinc transporters which is required for pathogenesis.

      Strengths:

      Made a couple of knockouts in Salmonella and did transcriptomic to understand the global gene expression pattern

      Weaknesses:

      (1) Transport of Beta-alanine to SCV is not yet elucidated. Is it possible to determine whether the Zn transporter is involved in B-alanine transport?

      Thank you for the comment. Following your suggestion, we investigated the growth of Salmonella WT and the ∆znuA mutant cultured in N-minimal and M9 minimal medium, with β-alanine as the sole carbon source. We observed no significant difference in growth kinetics between the ∆znuA mutant and WT strain under either culture condition (please refer to Author response image 1). The results indicate that ZnuA is not involved in β-alanine transport in Salmonella.

      Author response image 1.

      (2) Beta-alanine can also be shuttled to form carnosine along with histidine. If beta-alanine is channelled to make more carnosine, then the virulence phenotypes may be very different.

      Our study reveals that β-alanine availability, whether obtained from the host or synthesized de novo via the panD-dependent pathway, is important for Salmonella pathogenesis. We have shown that β-alanine influences Salmonella intracellular replication and in vivo virulence partly by enhancing the expression of the zinc transporter genes.

      Although β-alanine can also be shuttled to form carnosine along with histidine in animals, the Salmonella genome lacks canonical carnosine synthase (CARNS) orthologs that catalyze the condensation of β-alanine and histidine into carnosine. Therefore, we believe that the carnosine biosynthetic pathway does not influence the virulence phenotypes of Salmonella.

      (3) Some amino acid transporters can be knocked out to see if beta-alanine uptake is perturbed. Like ArgT transport Arginine, and its mutation perturbs the uptake of beta-alanine. What is the beta-alanine concentration in the SCV? SCVS can be purified at different time points, and the Beta-alanine concentration can be measured

      Thank you for the comment. As suggested, we have investigated the role of other amino acid transporters in the uptake of β-alanine. In E. coli, GabP transports γ-aminobutyric acid (GABA), a structural analogue of β-alanine, and may also transport β-alanine (J Bacteriol. 2021, 203(4):e00642-20). Nevertheless, SalmonellagabP mutant displayed no growth defect in minimal medium with β-alanine as the sole carbon source (Figure 1_figure Supplement 7, Figure 1_figure Supplement 8), indicating that GabP is not involved in β-alanine uptake in Salmonella. Strikingly, the Δ_argT_ mutant—defective in arginine uptake—showed markedly decreased growth in the minimal medium with β-alanine as the sole carbon source (Figure 1F),suggesting that ArgT also transports β-alanine in Salmonella. We have added the results in the revised manuscript (lines 167-179).

      It has been reported that ArgT is essential for Salmonella replication within macrophages and full virulence in vivo (PloS one. 2010, 5(12):e15466). Given that ArgT is involved in both arginine and β-alanine uptake (as verified in this study), whether the attenuated virulence of the ∆argT mutant is due to a deficiency in β-alanine or arginine requires further investigation. We have also included a discussion on this issue (lines 409-415).

      In this work, to avoid delays and alterations in metabolite concentrations during the isolation of bacterial contents from macrophages, we directly assessed the combined metabolite concentrations within infected cells and Salmonella. It has been previously verified that these metabolites are primarily of host origin (Nat Commun. 2021, 12(1):879.). We noted a decrease in β-alanine levels in macrophages infected with Salmonella. The process of separating SCV is intricate and encompasses dissociation and sonication (Nat Commun. 2018, 9(1):2091). These steps may potentially result in alterations of metabolite concentrations during the separation procedure. Therefore, we did not measure the β-alanine concentration in the SCV.

      Reviewer #3 (Recommendations for the authors):

      The Authors have done meticulous experiments to address the questions asked by the reviewers. My one question of beta-alanine transport inside the SCV remains undone, though the authors have tried.

      Was Zinc transporter mutant checked? It is possible that the Zn transporter can take up Beta-alanine.

      Thank you for the comment. Following your suggestion, we investigated the growth of Salmonella WT and the ∆znuA mutant cultured in N-minimal and M9 minimal medium, with β-alanine as the sole carbon source. We observed no significant difference in growth kinetics between the ∆znuA mutant and WT strain under either culture condition (please refer to Author response image 1). The results indicate that ZnuA is not involved in β-alanine transport in Salmonella.

      Additionally, we have investigated the role of other amino acid transporters in the uptake of β-alanine and have ultimately identified that ArgT, the arginine transporter, is involved in the uptake of β-alanine in Salmonella (please refer to our previous response).

    1. Author Response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This study presents potentially useful findings describing how activity in the corticotropin-releasing hormone neurons in the paraventricular nucleus of the hypothalamus modulates sevoflurane anesthesia, as well as a phenomenon the authors term a "general anesthetic stress response". The technical approaches are solid and the data presented are largely clear. However, the primary conclusion, that the PVHCRH neurons are a mechanism of sevoflurane anesthesia, is inadequately supported.

      We appreciate the editors and reviewers for their thorough assessment and constructive feedback. We have provided clarifications and updated the manuscripts to better interpret our results, please see below. As for the primary conclusion, we revised it as PVH CRH neurons potently modulate states of anaesthesia in sevoflurane general anesthesia, being a part of anaesthesia regulatory network of sevoflurane.

      Combined Public Review:

      This study describes a group of CRH-releasing neurons, located in the paraventricular nucleus of the hypothalamus, which, in mice, affects both the state of sevoflurane anesthesia and a grooming behavior observed after it. PVH-CRH neurons showed elevated calcium activity during the post-anesthesia period. Optogenetic activation of these PVH-CRH neurons during sevoflurane anesthesia shifts the EEG from burst-suppression to a seemingly activated state (an apparent arousal effect), although without a behavioral correlate. Chemogenetic activation of the PVH-CRH neurons delays sevoflurane-induced loss of righting reflex (another apparent arousal effect). On the other hand, chemogenetic inhibition of PVH-CRH neurons delays recovery of the righting reflex and decreases sevoflurane-induced stress (an apparent decrease in the arousal effect). The authors conclude that PVH-CRH neurons are a common substrate for sevoflurane-induced anesthesia and stress. The PVH-CRH neurons are related to behavioral stress responses, and the authors claim that these findings provide direct evidence for a relationship between sevoflurane anesthesia and sevoflurane-mediated stress that might exist even when there is no surgical trauma, such as an incision. In its current form, the article does not achieve its intended goal.

      Thank you for the detailed review. We have carefully considered your comments and have revised the manuscript to provide a clearer interpretation of our findings. Our findings indicate that PVH CRH neurons integrate the anesthetic effect and post-anesthesia stress response of sevoflurane (GA), providing new evidence for understanding the neuronal regulation of sevoflurane GA and identifying a potential brain target for further investigation into modulating the post-anesthesia stress response. However, we did not propose that there was a direct relationship between sevoflurane anesthesia and sevoflurane-mediated stress in the absence of incision. Our results mainly concluded that PVH CRH neurons integrate the anaesthetic effect and post-anaesthesia stress response of sevoflurane GA, which offers new evidence for the neuronal regulation of sevoflurane GA and provides an important but ignored potential cause of the post-anesthesia stress response.

      Strengths:

      The manuscript uses targeted manipulation of the PVH-CRH neurons, and is technically sound. Also, the number of experiments is substantial.

      Thank you.

      Weaknesses:

      The most significant weaknesses are a) the lack of consideration and measurement of GABAergic mechanisms of sevoflurane anesthesia, b) the failure to use another anesthetic as a control, c) a failure to document a compelling post-anesthesia stress response to sevoflurane in humans, d) limitations in the novelty of the findings. These weaknesses are related to the primary concerns described below:

      Concerns about the primary conclusion, that PVH-CRH neurons mediate "the anesthetic effects and post-anesthesia stress response of sevoflurane GA".

      Thanks for the advice. Our responses are as below:

      1) Just because the activity of a given neural cell type or neural circuit alters an anesthetic's response, this does not mean that those neurons play a role in how the anesthetic creates its anesthetic state. For example, sevoflurane is commonly used in children. Its primary mechanism of action is through enhancement of GABA-mediated inhibition. Children with ADHD on Ritalin (a dopamine reuptake inhibitor) who take it on the day of surgery can often require increased doses of sevoflurane to achieve the appropriate anesthetic state. The mesocortical pathway through which Ritalin acts is not part of the mechanism of action of sevoflurane. Through this pathway, Ritalin is simply increasing cortical excitability making it more challenging for the inhibitory effects of sevoflurane at GABAergic synapses to be effective. Similarly, here, altering the activity of the PVHCRH neurons and seeing a change in anesthetic response to sevoflurane does not mean that these neurons play a role in the fundamental mechanism of this anesthetic's action. With the current data set, the primary conclusions should be tempered.

      Thank you for your comments. Our results adequately uncover PVH CRH neurons that modulate the state of consciousness as well as the stress response in sevoflurane GA, but are insufficient to demonstrate that these neurons play a role in the underlying mechanism of sevoflurane anesthesia. We will revise our conclusions and make them concrete. The primary conclusion has been revised as PVH CRH neurons potently modulate states of anaesthesia in sevoflurane GA, being a part of the anaesthesia regulatory network of sevoflurane.

      2) It is important to compare the effects of sevoflurane with at least one other inhaled ether anesthetic. Isoflurane, desflurane, and enflurane are ether anesthetics that are very similar to each other, as well as being similar to sevoflurane. It is important to distinguish whether the effects of sevoflurane pertain to other anesthetics, or, alternatively, relate to unique idiosyncratic properties of this gas that may not be a part of its anesthetic properties.

      For example, one study cited by the authors (Marana et al.. 2013) concludes that there is weak evidence for differences in stress-related hormones between sevoflurane and desflurane, with lower levels of cortisol and ACTH observed during the desflurane intraoperative period. It is not clear that this difference in some stress-related hormones is modeled by post-sevoflurane excess grooming in the mice, but using desflurane as a control could help determine this.

      Thank you for your suggestions. We completely agree on the importance of determining whether the effects of sevoflurane apply to other anesthetics or arise from unique idiosyncratic attributes separate from its anesthetic properties. However, it is challenging to definitively conclude whether the effects of sevoflurane observed in our study extend to other inhaled anesthetics, even with desflurane as a control. While sevoflurane shares many common anesthetic properties with other inhalation agents, it also exhibits distinct characteristics and potential idiosyncrasies that set it apart from its counterparts. Regarding studies related to desflurane's impact on hormone levels or stress-like behaviors, one study involving 20 women scheduled for elective total abdominal hysterectomy demonstrated that there was no significant correlation between the intra-operative depth of anesthesia achieved with desflurane and the extent of the endocrine-metabolic stress response (as indicated by the concentrations of plasma cortisol, glucose, and lactate)1. Besides, a study conducted with mice suggested the abilities related to sensorimotor functions, anxiety and depression did not undergo significant changes after 7 days of anesthesia administered with 8.0% desflurane for 6 h2. Furthermore, a study involving 50 Caucasian women undergoing laparoscopic surgery for benign ovarian cysts demonstrated that in low stress surgery, desflurane, when compared to sevoflurane, exhibited superior control over the intraoperative cortisol and ACTH response 3. Based on these findings, we propose that the effect we observed in this study is likely attributed to the unique idiosyncratic properties of sevoflurane. We will conduct additional experiments to investigate this proposal with other commonly used anaesthetics in our future studies.

      Concerns about the clinical relevance of the experiments

      In anesthesiology practice, perioperative stress observed in patients is more commonly related to the trauma of the surgical intervention, with inadequate levels of antinociception or unconsciousness intraoperatively and/or poor post-operative pain control. The authors seem to be suggesting that the anesthetic itself is causing stress, but there is no evidence of this from human patients cited. We were not aware that this is a documented clinical phenomenon. It is important to know whether sevoflurane effectively produces behavioral stress in the recovery room in patients that could be related to the putative stress response (excess grooming) observed in mice. For example, in surgeries or procedures that required only a brief period of unconsciousness that could be achieved by administering sevoflurane alone (comparable to the 30 min administered to the mice), is there clinical evidence of post-operative stress?

      Thank you for your question. There is currently no direct evidence available. Studies on sevoflurane in humans primarily focus on its use during surgical interventions, making it difficult to find studies that solely administer sevoflurane, as was done in our study with mice. Generally, a short anesthesia time refers to procedures that last less than one hour, while a long anesthesia time could be considered for procedures lasting several hours or more4. A study published in eLife investigated the patterns of reemerging consciousness and cognitive function in 30 healthy adults who underwent GA for three hours 5. This finding suggests that the cognitive dysfunction observed immediately and persistently after GA in healthy animals may not necessarily apply anesthesia and postoperative neurocognitive disorders could be influenced by factors other than GA, such as surgery or patient comorbidity. Therefore, further studies are needed to verify the post-operative stress in sevoflurane-only short time anesthesia.

      Indeed, stress after surgeries can result from multiple factors aside from anesthesia, including pain, anxiety, inflammation, but what we want to illustrate in this study is that anesthesia could be one of these factors that we ignored in previous studies. In our current study, we did not propose that there was a direct relationship between sevoflurane anesthesia and sevoflurane-mediated stress without incision. We observed stress-related behavioural changes after exposure of sevoflurane GA in mouse model, indicating sevoflurane-mediated stress might exist without surgical trauma. Importantly, whether anesthetic administration alone will cause post-operative stress is worth studying in different species especially human.

      Patients who receive sevoflurane as the primary anesthetic do not wake up more stressed than if they had had one of the other GABAergic anesthetics. If there were signs of stress upon emergence (increased heart rate, blood pressure, thrashing movements) from general anesthesia, the anesthesiologist would treat this right away. The most likely cause of post-operative stress behaviors in humans is probably inadequate anti-nociception during the procedure, which translates into inadequate post-op analgesia and likely delirium. It is the case that children receiving sevoflurane do have a higher likelihood of post-operative delirium. Perhaps the authors' studies address a mechanism for delirium associated with sevoflurane, but this is not considered. Delirium seems likely to be the closest clinical phenomenon to what was studied.

      We agree with your idea. We aim to establish a connection between post-operative delirium in humans and stress-like behaviors observed in mice following sevoflurane anesthesia. Specifically, we have observed that the increased grooming behavior exhibited by mice after sevoflurane anesthesia resembles the fuzzy state of consciousness experienced during post-operative delirium6. In our discussion, we also emphasized the occurrence of sevoflurane-induced emergence agitation, a common phenomenon reported in clinical studies with an incidence of up to 80%. This state is characterized by hyperactivity, confusion, delirium, and emotional agitation 7,8. Meanwhile, in our experimental tests, namely the open field test (OFT) and elevated plus maze (EPM) test, we observed that mice exposed to sevoflurane inhalation displayed reduced movement distances during both the OFT and EPM tests (Figure 7G and I). These findings suggest a decline in behavioral activity similar to what is observed in cases of delirium.

      Concerns about the novelty of the findings

      CRH is associated with arousal in numerous studies. In fact, the authors' own work, published in eLife in 2021, showed that stimulating the hypothalamic CRH cells leads to arousal and their inhibition promotes hypersomnia. In both papers, the authors use fos expression in CRH cells during a specific event to implicate the cells, then manipulate them and measure EEG responses. In the previous work, the cells were active during wakefulness; here- they were active in the awake state that follows anesthesia (Figure 1). Thus, the findings in the current work are incremental.

      Thank you for acknowledging our previous work focusing on the changes in the sleep-wake state of mice when PVH CRH neurons are manipulated. In this study, our primary objective was to identify the neuronal mechanisms mediating the anesthetic effects and post-anesthetic stress response of sevoflurane GA. While our study claims that activation of PVH CRH neurons leads to arousal, it provides evidence that PVH CRH neurons may play a role in the regulation of conscious states in GA. Our current findings uncover that PVH CRH neurons modulate the state of consciousness as well as the stress response in sevoflurane GA, and that the modulation of PVH CRH neurons bidirectionally altered the induction and recovery of sevoflurane GA. This identifies a new brain region involved in sevoflurane GA that goes beyond the arousal-related regions.

      The activation of CRH cells in PVN has already been shown to result in grooming by Jaideep Bains (cited as reference 58). Thus, the involvement of these cells in this behavior is expected. The authors perform elaborate manipulations of CRH cells and numerous analyses of grooming and related behaviors. For example, they compare grooming and paw licking after anesthesia with those after other stressors such as forced swim, spraying mice with water, physical attack, and restraint. However, the relevance of these behaviors to humans and generalization to other types of anesthetics is not clear.

      The hyperactivity of PVH CRH neurons and behavior (e.g., excessive self-grooming) in mice may partially mirror the observed agitation and underlying mechanisms during emergence from sevoflurane GA in patients. As mentioned in the Discussion section (page 16, lines 371-374), sevoflurane-induced emergence agitation represents a prevalent manifestation of the post-anesthesia stress response. It is frequently observed, with an incidence of up to 80% in clinical reports, and is characterized by hyperactivity, confusion, delirium, and emotional agitation7,8. Our aim in this study is to distinguish the excessive stress responses of patients to sevoflurane GA from stress triggered by other factors. Other stimuli, such as forced swimming, can be considered sources of both physical and emotional stress, which are associated with depression and anxiety in humans.

      Regarding generalization to other types of anesthetics, we propose that the stress-related behavioral effects observed in this study might occur in cases of the administration of certain types of anesthetics. For example, one study showed that intravenous ketamine infusion (10 mg/kg, 2 hours) elevated plasma corticosterone and progesterone levels in rats, reducing locomotor activity (sedation) 9. The administration of intravenous anesthesia with propofol combined with sevoflurane caused greater postoperative stress than the single use of propofol10. However, desflurane, a common inhaled ether anesthetic, when compared to sevoflurane, was associated with better control of intraoperative cortisol and ACTH response in low-stress surgeries8. Thus, these behaviors observed after exposure to sevoflurane GA may be related to the post-anesthesia stress response in humans, which might also occur in cases of the administration of certain types of anesthetics.

      Recommendations for the authors:

      Reviewer 1

      1) The CRH-Cre mouse line should be validated. There are several lines of these mice, and their fidelity varies.

      The CRH-Cre mouse line we used in this study is from The Jackson Laboratory (https://www.jax.org/strain/012704) with the name B6(Cg)-Crhtm1(cre)Zjh/J (Strain #: 012704). These CRH-ires-CRE knock-in mice have Cre recombinase expression directed to CRH positive neurons by the endogenous promoter/enhancer elements of the corticotropin releasing hormone locus (Crh). We have done standard PCR to validate the mouse line following genotyping protocols provided by the Jackson Laboratory. The protocol primers were: 10574 (SEQUENCE 5' → 3': CTT ACA CAT TTC GTC CTA GCC); 10575 (SEQUENCE 5' → 3': CAC GAC CAG GCT GCG GCT AAC); 10576 (SEQUENCE 5' → 3': CAA TGT ATC TTA TCA TGT CTG GAT CC). The 468-bp CRH-specific PCR product was amplified in mutant (CRH-Cre+/+) mice; in heterozygote (CRH-Cre+/-) mice, both the 468-bp and the 676-bp PCR products were detected; in wild type (WT) mice, only the 676-bp WT allele-specific PCR product was amplified. An example of PCR results is presented below. The heterozygote and mutant mice were included in our study.

      Author response image 1.

      1. It would be very helpful to validate the CRH antibody. Using any antiserum at 1:800 suggests that it may not be potent or highly specific.

      As requested, we used the same CRH antibody at a concentration of 1:800, following the methods described in the Method section. The results are displayed below.

      Author response image 2.

      1. In Figure 1C, the control sections are out of focus, any cells are blurry, reducing confidence in the analyses (locus ceruleus cells appear confluent in the control?)

      Sorry for the confusing figure and we have revised the control section part of Figure 1C:

      Author response image 3.

      Reviewer 2

      1) In the Abstract, to say that "General anesthetics benefit patients undergoing surgeries without consciousness. ..." is a gross understatement of the essential role that general anesthesia plays today to make surgery not only tolerable but humane. This opening sentence should be rewritten. General anesthesia is a fundamental process required to undertake safely and humanely a high fraction of surgeries and invasive diagnostic procedures.

      As requested, we rewrote this opening sentence, please see the follows:

      GA is a fundamental process required to undertake surgeries and invasive diagnostic procedures safely and humanely. However, the undesired stress response associated with GA can lead to delayed recovery and even increased morbidity in clinical settings.

      2) In the Abstract, when discussing the response of the PVN-CRH neurons to chemogenetic inhibition, say exactly what the "opposite effect" is.

      Thanks for your insights. We have rewritten our abstract as follows:

      Chemogenetic activation of these neurons delayed the induction and accelerated emergence from sevoflurane GA, whereas chemogenetic inhibition of PVH CRH neurons promoted induction and prolonged emergence from sevoflurane GA.

      3) In all spectrograms the dynamic range is compressed between 0.5 and 1. Please make use of the full range, as some details might be missed because of this compression.

      We are sorry for the incorrect unit of the spectrograms. We have provided the correct one with full range, please see below:

      Author response image 4.

      Author response image 5.

      4) The spectrogram in Figure 2D has several frequency chirps that do not seem physiological.

      Thank you for your comments. The frequency chips of the spectrogram during the During and Post 1 phase were caused by recording noises. To avoid confusion, we have deleted the spectrogram in Figure 2D.

      5) The 3D plots in Figures 3G and H are not helpful. Thanks for the comment. We'd like to keep the 3D plots as they aid visual comparison of three different features of grooming, which complements other panels in Figure 3.

      6) The spectrograms in Figures 5A and B are too small, while the spectra in Figures 5C and D are too large. Please invert this relationship, as it is interesting and important to see the details in the spectrograms. The same happens in Figure 6.

      We adjusted the layout of the Figure 5 and Figure 6 as requested, please see below:

      Author response image 6.

      Author response image 7.

      7) In Figure 6H, the authors compute the burst-suppression ratio during a period that seemingly has no bursts or suppressions (Figure 6B).

      The burst-suppression ratio was computed from data with the minimum duration of burst and suppression periods set at 0.5 s. Sorry for the confusion. We added a new supplementary figure (Figure 6-figure supplement 8) displaying a 40-second EEG with a burst suppression period to better visualize the burst suppression.

      Author response image 8.

      8) The data analyses are done in terms of p-values. They should be reported as confidence intervals so that any effect the authors wish to establish is measured along with its uncertainty.

      Thank you for your valuable suggestions regarding our manuscript. We appreciate your thoughtful consideration of our work. We understand your concern but we would like to provide some justification for our choice of reporting p-values and explain why we believe they are appropriate for our study. First, the use of p-values for hypothesis testing and significance assessment is a common practice in our field. Many previous studies in our area of research also report results in terms of p-values. For example, Wei Xu11 published in 2020 suggested sevoflurane inhibits MPB neurons through postsynaptic GABAA-Rs and background potassium channels, Ao Y12 demonstrated that activation of the TH:LC-PVT projections is helpful in facilitating the transition from isoflurane anesthesia to an arousal state, using P-value as data analyses. By adhering to this convention, we ensure that our findings are consistent with the existing body of literature. This makes it easier for readers to compare and integrate our results with previous work. Secondly, while confidence intervals can provide a measure of effect size and uncertainty, p-values offer a concise way to communicate statistical significance. They help readers quickly assess whether an effect is statistically significant or not, which is often the primary concern when interpreting research findings. We hope that by providing these reasons for our choice of reporting p-values, we can address your concern while maintaining the integrity and consistency of our study. If you believe there are specific instances where reporting confidence intervals would be more informative, please feel free to highlight those, and we will consider your suggestion on a case-by-case basis. 

      References

      1. Baldini, G., Bagry, H. & Carli, F. Depth of anesthesia with desflurane does not influence the endocrine-metabolic response to pelvic surgery. Acta Anaesthesiol Scand 52, 99-105, doi:10.1111/j.1399-6576.2007.01470.x (2008).
      2. Niikura, R. et al. Exploratory analyses of postanesthetic effects of desflurane using behavioral test battery of mice. Behav Pharmacol 31, 597-609, doi:10.1097/fbp.0000000000000567 (2020).
      3. Marana, E. et al. Desflurane versus sevoflurane: a comparison on stress response. Minerva Anestesiol 79, 7-14 (2013).
      4. Vutskits, L. & Xie, Z. Lasting impact of general anaesthesia on the brain: mechanisms and relevance. Nat Rev Neurosci 17, 705-717, doi:10.1038/nrn.2016.128 (2016).
      5. Mashour, G. A. et al. Recovery of consciousness and cognition after general anesthesia in humans. Elife 10, doi:10.7554/eLife.59525 (2021).
      6. Mattison, M. L. P. Delirium. Ann Intern Med 173, Itc49-itc64, doi:10.7326/aitc202010060 (2020).
      7. Dahmani, S. et al. Pharmacological prevention of sevoflurane- and desflurane-related emergence agitation in children: a meta-analysis of published studies. Br J Anaesth 104, 216-223, doi:10.1093/bja/aep376 (2010).
      8. Lim, B. G. et al. Comparison of the incidence of emergence agitation and emergence times between desflurane and sevoflurane anesthesia in children: A systematic review and meta-analysis. Medicine (Baltimore) 95, e4927, doi:10.1097/MD.0000000000004927 (2016).
      9. Radford, K. D. et al. Association between intravenous ketamine-induced stress hormone levels and long-term fear memory renewal in Sprague-Dawley rats. Behav Brain Res 378, 112259, doi:10.1016/j.bbr.2019.112259 (2020).
      10. Yang, L., Chen, Z. & Xiang, D. Effects of intravenous anesthesia with sevoflurane combined with propofol on intraoperative hemodynamics, postoperative stress disorder and cognitive function in elderly patients undergoing laparoscopic surgery. Pak J Med Sci 38, 1938-1944, doi:10.12669/pjms.38.7.5763 (2022).
      11. Xu, W. et al. Sevoflurane depresses neurons in the medial parabrachial nucleus by potentiating postsynaptic GABA(A) receptors and background potassium channels. Neuropharmacology 181, 108249, doi:10.1016/j.neuropharm.2020.108249 (2020).
      12. Ao, Y. et al. Locus Coeruleus to Paraventricular Thalamus Projections Facilitate Emergence From Isoflurane Anesthesia in Mice. Front Pharmacol 12, 643172, doi:10.3389/fphar.2021.643172 (2021).
    1. Author response:

      The following is the authors’ response to the current reviews.

      eLife assessment:

      The manuscript establishes a sophisticated mouse model for acute retinal artery occlusion (RAO) by combining unilateral pterygopalatine ophthalmic artery occlusion (UPOAO) with a silicone wire embolus and carotid artery ligation, generating ischemia-reperfusion injury upon removal of the embolus. This clinically relevant model is useful for studying the cellular and molecular mechanisms of RAO. The data overall are solid, presenting a novel tool for screening pathogenic genes and promoting further therapeutic research in RAO.

      Thank you for your thorough evaluation. We are pleased that you find our mouse model for acute retinal artery occlusion to be sophisticated and clinically relevant. Your recognition of the model’s utility in studying the cellular and molecular mechanisms of RAO, as well as its potential for advancing therapeutic research, is highly encouraging and underscores the significance of our work. We are grateful for your supportive feedback.

      Public Reviews:

      Reviewer #1:

      Summary:

      Wang, Y. et al. used a silicone wire embolus to definitively and acutely clot the pterygopalatine ophthalmic artery in addition to carotid artery ligation to completely block blood supply to the mouse inner retina, which mimic clinical acute retinal artery occlusion. A detailed characterization of this mouse model determined the time course of inner retina degeneration and associated functional deficits, which closely mimic human patients. Whole retina transcriptome profiling and comparison revealed distinct features associated with ischemia, reperfusion, and different model mechanisms. Interestingly and importantly, this team found a sequential event including reperfusion-induced leukocyte infiltration from blood vessels, residual microglial activation, and neuroinflammation that may lead to neuronal cell death.

      Strengths:

      Clear demonstration of the surgery procedure with informative illustrations, images, and superb surgical videos.

      Two time points of ischemia and reperfusion were studied with convincing histological and in vivo data to demonstrate the time course of various changes in retinal neuronal cell survivals, ERG functions, and inner/outer retina thickness.

      The transcriptome comparison among different retinal artery occlusion models provides informative evidence to differentiate these models.

      The potential applications of the in vivo retinal ischemia-reperfusion model and relevant readouts demonstrated by this study will certainly inspire further investigation of the dynamic morphological and functional changes of retinal neurons and glial cell responses during disease progression and before and after treatments.

      We sincerely appreciate your detailed and positive feedback. These evaluations are invaluable in highlighting the significance and impact of our work. Thank you for your thoughtful and supportive review.

      Weaknesses:

      The revised manuscript has been significantly improved in clarity and readability. It has addressed all my questions convincingly.

      Thank you for your positive feedback. We are pleased to hear that the revisions have significantly improved the manuscript's clarity and readability, and that we have convincingly addressed all your questions. Your encouraging words are of great importance to us.

      Reviewer #2 (Public Review):

      Summary:

      The authors of this manuscript aim to develop a novel animal model to accurately simulate the retinal ischemic process in retinal artery occlusion (RAO). A unilateral pterygopalatine ophthalmic artery occlusion (UPOAO) mouse model was established using silicone wire embolization combined with carotid artery ligation. This manuscript provided data to show the changes of major classes of retinal neural cells and visual dysfunction following various durations of ischemia (30 minutes and 60 minutes) and reperfusion (3 days and 7 days) after UPOAO. Additionally, transcriptomics was utilized to investigate the transcriptional changes and elucidate changes in the pathophysiological process in the UPOAO model post-ischemia and reperfusion. Furthermore, the authors compared transcriptomic differences between the UPOAO model and other retinal ischemic-reperfusion models, including HIOP and UCCAO, and revealed unique pathological processes.

      Strengths:

      The UPOAO model represents a novel approach for studying retinal artery occlusion. The study is very comprehensive.

      Thank you for your positive feedback. We are delighted that you find the UPOAO model to be a novel and comprehensive approach to studying retinal artery occlusion. Your recognition of the depth and significance of our study is highly valuable and encourages us in our ongoing research.

      Weaknesses:

      Originally, some statements were incorrect and confusing. However, the authors have made clarifications in the revised manuscript to avoid confusion.

      We sincerely appreciate your meticulous review of the manuscript. We have thoroughly addressed the inaccuracies identified in the revised version. Additionally, we have polished the article to ensure improved readability. We apologize for any confusion caused by these inaccuracies and genuinely. We appreciate your careful attention to detail, and your patience and meticulous suggestions have significantly improved the clarity and readability of our manuscript.


      The following is the authors’ response to the original reviews.

      Recommendations for the authors:

      Reviewer #1:

      The revised manuscript has been significantly improved in clarity and readability. It has addressed all my questions convincingly.

      Thank you for your positive feedback. We are pleased to hear that the revisions have significantly improved the manuscript's clarity and readability, and that we have convincingly addressed all your questions. Your encouraging words are of great importance to us.

      Reviewer #2:

      The authors have revised the manuscript and/or provided answers to the majority of prior comments, which have helped to strengthen the work. However, addressing the following concerns is still necessary to further improve the manuscript.

      Thank you for acknowledging our revisions and the improvements made to the manuscript. We appreciate your continued feedback and will address the remaining concerns to further enhance the quality of our work.

      The quantification method of RGCs is described in detail in the response letter, but this detailed methodology was not included in the revised manuscript to clarify the quantification process.

      Thank you for your helpful recommendations. We have added detailed methodology in the revised manuscript to clarify the quantification process (line 180-188).

      The graphs in Fig. 3D b-wave and Fig. 3E-b wave are duplicated.

      We apologize for the error in our figures. We have corrected the mistake by replacing the duplicated image in Fig. 3E-b wave with the correct one (line 880). Your careful observation has been very helpful in improving our manuscript. Thank you for bringing this to our attention.

      The quantifications of the thickness of retinal layers in HE-stained sections in Figure 4 (IPL) and Response Figure 2 are incorrect. For mice retina, the thickness of the IPL is approximately 50 µm.

      Thank you for your meticulous review of the manuscript. We have rectified the inaccuracies in the quantification of retinal layer thickness in HE-stained sections in Figure 4, addressing the initial issue with the scale bar.

      We consulted with a microscope engineer and used a microscope microscale to calibrate the scale of the fluorescence microscope (BX63; Olympus, Tokyo, Japan) at the suggestion of the engineer.

      We recount the thickness of all layers of the HE-stained retinal section (line 902). The inner retina thickness in Figure 4 has been adjusted under a new scale bar, and the thickness of the outer retinal layers is now displayed in

      Author response image 1. However, the IPL thickness of the sham eye in the UPOAO model is still not aligned with the common thickness of 50 µm. Therefore we review the literature within our laboratory, focusing on C57BL/6 mice from the same source, revealed that the inner retina thickness (GCC+INL) in the HE-stained sections of the sham eye in the UPOAO model (around 80 µm) is consistent with previous findings (see Author response image 2) conducted by Kaibao Ji and published in Experimental Eye Research in 2021 [1].

      We captured and analyzed the average retinal thickness of each layer over a long range of 200-1100 μm from the optic nerve head (see Author response image 3, highlighted by the green line). The field region has been corrected in the revised manuscript (line 232). Considering the significant variation in retinal thickness from the optic nerve to the periphery, we consulted literature on multi-point measurements of HE-stained retinas. The average thickness of the GCC layer in the control group was approximately 57 µm at 600 µm from the optic nerve head and about 48 µm at 1200 µm from the optic nerve head in the literature [2] (see Author response image 4). The GCC layer thickness of the sham eye in the UPOAO model is around 50 µm, in alignment with existing literature. In future studies, we will pay more attention to the issue of thickness averaging.

      We appreciate your thorough review and valuable feedback, which has enabled us to correct errors and enhance the accuracy of our research.

      Author response image 1.

      Thickness of OPL, ONL, IS/OS+RPE in HE staining. n=3; ns: no significance (p>0.05).

      Author response image 2.

      Cited from Ji, K., et al., Resveratrol attenuates retinal ganglion cell loss in a mouse model of retinal ischemia reperfusion injury via multiple pathways. Experimental Eye Research, 2021. 209: p. 108683.

      Author response image 3.

      Schematic diagram illustrating the selection of regions. The figure was captured using a fluorescence microscope (BX63; Olympus, Tokyo, Japan) under a 4X objective. Scale bar=500 µm.

      Author response image 4.

      Cited from Feng, L., et al., Ripa-56 protects retinal ganglion cells in glutamate-induced retinal excitotoxic model of glaucoma. Sci Rep, 2024. 14(1): p. 3834.

      There are some typos in the summary table. For example: 'Amplitudes of a-wave (0.3, 2.0, and 10.0 cd.s/m²)' should be 'Amplitudes of a-wave (0.3, 3.0, and 10.0 cd.s/m²)'; and 'IINL thickness' in HE' should be 'INL thickness'.

      Thank you for pointing out the typos in the summary table (line 1073). We have corrected 'Amplitudes of a-wave (0.3, 2.0, and 10.0 cd.s/m²)' to 'Amplitudes of a-wave (0.3, 3.0, and 10.0 cd.s/m²)' and 'IINL thickness' to 'INL thickness'. Your attention to detail is greatly appreciated and has been very helpful in improving our manuscript.

      References

      (1) Ji, K., et al., Resveratrol attenuates retinal ganglion cell loss in a mouse model of retinal ischemia reperfusion injury via multiple pathways. Experimental Eye Research, 2021. 209: p. 108683.

      (2) Feng, L., et al., Ripa-56 protects retinal ganglion cells in glutamate-induced retinal excitotoxic model of glaucoma. Sci Rep, 2024. 14(1): p. 3834.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #1 (Public review):

      (1) Authors' experimental designs have some caveats to definitely support their claims. Authors claimed that aged LT-HSCs have no myeloid-biased clone expansion using transplantation assays. In these experiments, authors used 10 HSCs and young mice as recipients. Given the huge expansion of old HSC by number and known heterogeneity in immunophenotypically defined HSC populations, it is questionable how 10 out of so many old HSCs (an average of 300,000 up to 500,000 cells per mouse; Mitchell et al., Nature Cell Biology, 2023) can faithfully represent old HSC population. The Hoxb5+ old HSC primary and secondary recipient mice data (Fig. 2C and D) support this concern. In addition, they only used young recipients. Considering the importance of inflammatory aged niche in the myeloid-biased lineage output, transplanting young vs old LT-HSCs into aged mice will complete the whole picture. 

      We sincerely appreciate your insightful comment regarding the existence of approximately 500,000 HSCs per mouse in older mice. To address this, we have conducted a statistical analysis to determine the appropriate sample size needed to estimate the characteristics of a population of 500,000 cells with a 95% confidence level and a ±5% margin of error. This calculation was performed using the finite population correction applied to Cochran’s formula.

      For our calculations, we used a proportion of 50% (p = 0.5), as it has been reported that approximately 50% of HSCs are myeloid-biased1,2. The formula used is as follows:

      N \= 500,000 (total population size)

      Z = 1.96 (Z-score for a 95% confidence level)

      p = 0.5 (expected proportion)

      e \= 0.05 (margin of error)

      Applying this formula, we determined that the required sample size is approximately 384 cells. This sample size ensures that the observed proportion in the sample will reflect the characteristics of the entire population. In our study, we have conducted functional experiments across Figures 2, 3, 5, 6, S3, and S6, with a total sample size of n = 126, which corresponds to over 1260 cells. While it would be ideal to analyze all 500,000 cells, this would necessitate the use of 50,000 recipient mice, which is not feasible. We believe that the number of cells analyzed is reasonable from a statistical standpoint. 

      References

      (1) Dykstra, Brad et al. “Clonal analysis reveals multiple functional defects of aged murine hematopoietic stem cells.” The Journal of experimental medicine vol. 208,13 (2011): 2691-703. doi:10.1084/jem.20111490

      (2) Beerman, Isabel et al. “Functionally distinct hematopoietic stem cells modulate hematopoietic lineage potential during aging by a mechanism of clonal expansion.” Proceedings of the National Academy of Sciences of the United States of America vol. 107,12 (2010): 5465-70. doi:10.1073/pnas.1000834107

      (2) Authors' molecular data analyses need more rigor with unbiased approaches. They claimed that neither aged LT-HSCs nor aged ST-HSCs exhibited myeloid or lymphoid gene set enrichment but aged bulk HSCs, which are just a sum of LTHSCs and ST-HSCs by their gating scheme (Fig. 4A), showed the "tendency" of enrichment of myeloid-related genes based on the selected gene set (Fig. 4D). Although the proportion of ST-HSCs is reduced in bulk HSCs upon aging, since STHSCs do not exhibit lymphoid gene set enrichment based on their data, it is hard to understand how aged bulk HSCs have more myeloid gene set enrichment compared to young bulk HSCs. This bulk HSC data rather suggest that there could be a trend toward certain lineage bias (although not significant) in aged LT-HSCs or ST-HSCs. Authors need to verify the molecular lineage priming of LT-HSCs and ST-HSCs using another comprehensive dataset. 

      Thank you for your thoughtful feedback regarding the lack of myeloid or lymphoid gene set enrichment in aged LT-HSCs and aged ST-HSCs, despite the observed tendency for myeloid-related gene enrichment in aged bulk HSCs.

      First, we acknowledge that the GSEA results vary among the different myeloid gene sets analyzed (Fig. 4, D–F; Fig. S4, C–D). Additionally, a comprehensive analysis of mouse HSC aging using multiple RNA-seq datasets reported that nearly 80% of differentially expressed genes show poor reproducibility across datasets[1]. These factors highlight the challenges of interpreting lineage bias in HSCs based solely on previously published transcriptomic data.

      Given these points, we believe that emphasizing functional experimental results is more critical than incorporating an additional dataset to support our claim. In this regard, we have confirmed that young and aged LT-HSCs have similar differentiation capacity (Figure 3), while myeloid-biased hematopoiesis is observed in aged bulk HSCs (Figure S3). These findings are further corroborated by independent functional experiments. We sincerely appreciate your insightful comments.

      Reference

      (1) Flohr Svendsen, Arthur et al. “A comprehensive transcriptome signature of murine hematopoietic stem cell aging.” Blood vol. 138,6 (2021): 439-451. doi:10.1182/blood.2020009729

      (3) Although authors could not find any molecular evidence for myeloid-biased hematopoiesis from old HSCs (either LT or ST), they argued that the ratio between LT-HSC and ST-HSC causes myeloid-biased hematopoiesis upon aging based on young HSC experiments (Fig. 6). However, old ST-HSC functional data showed that they barely contribute to blood production unlike young Hoxb5- HSCs (ST-HSC) in the transplantation setting (Fig. 2). Is there any evidence that in unperturbed native old hematopoiesis, old Hoxb5- HSCs (ST-HSC) still contribute to blood production?

      If so, what are their lineage potential/output? Without this information, it is hard to argue that the different ratio causes myeloid-biased hematopoiesis in aging context. 

      Thank you for the insightful and important question. The post-transplant chimerism of ST-HSCs was low in Fig. 2, indicating that transplantation induced a short-term loss of hematopoietic potential due to hematopoietic stress per cell. 

      To reduce this stress, we increased the number of HSCs in transplantation setting. In Fig. S6, old LT-HSCs and old ST-HSCs were transplanted in a 50:50 or 20:80 ratio, respectively. As shown in Fig. S6.D, the 20:80 group, which had a higher proportion of old ST-HSCs, exhibited a statistically significant increase in the lymphoid percentage in the peripheral blood post-transplantation. 

      These findings suggest that old ST-HSCs contribute to blood production following transplantation. 

      Reviewer #2 (Public review):

      While aspects of their work are fascinating and might have merit, several issues weaken the overall strength of the arguments and interpretation. Multiple experiments were done with a very low number of recipient mice, showed very large standard deviations, and had no statistically detectable difference between experimental groups. While the authors conclude that these experimental groups are not different, the displayed results seem too variable to conclude anything with certainty. The sensitivity of the performed experiments (e.g. Fig 3; Fig 6C, D) is too low to detect even reasonably strong differences between experimental groups and is thus inadequate to support the author's claims. This weakness of the study is not acknowledged in the text and is also not discussed. To support their conclusions the authors need to provide higher n-numbers and provide a detailed power analysis of the transplants in the methods section. 

      Response #2-1:

      Thank you for your important remarks. The power analysis for this experiment shows that power = 0.319, suggesting that more number may be needed. On the other hand, our method for determining the sample size in Figure 3 is as follows:

      (1) First, we checked whether myeloid biased change is detected in the bulk-HSC fraction (Figure S3). The results showed that the difference in myeloid output at 16 weeks after transplantation was statistically significant (young vs. aged = 7.2 ± 8.9 vs. 42.1 ± 35.5%, p = 0.01), even though n = 10.

      (2) Next, myeloid biased HSCs have been reported to be a fraction with high selfrenewal ability (2004, Blood). If myeloid biased HSCs increase with aging, the increase in myeloid biased HSCs in LT-HSC fraction would be detected with higher sensitivity than in the bulk-HSC fraction used in Figure S3.

      (3) However, there was no difference not only in p-values but also in the mean itself, young vs aged = 51.4±31.5% vs 47.4±39.0%, p = 0.82, even though n = 8 in Figure 3. Since there was no difference in the mean itself, it is highly likely that no difference will be detected even if n is further increased.

      Regarding Figure 6, we obtained a statistically significant difference and consider the sample size to be sufficient. In addition, we have performed various functional experiments (Figures 2, 5, 6 and S6), and have obtained consistent results that expansion of myeloid biased HSCs does not occur with aging in Hoxb5+HSCs fraction. Based on the above, we conclude that the LT-HSC fraction does not differ in myeloid differentiation potential with aging.

      As the authors attempt to challenge the current model of the age-associated expansion of myeloid-biased HSCs (which has been observed and reproduced by many different groups), ideally additional strong evidence in the form of single-cell transplants is provided. 

      Response #2-2:

      Thank you for the comments. As the reviewer pointed out, we hope we could reconfirm our results using single-cell level technology in the future.

      On the other hand, we have reported that the ratio of myeloid to lymphoid cells in the peripheral blood changes when the number of HSCs transplanted, or the number of supporting cells transplanted with HSCs, is varied[1-2]. Therefore, single-cell transplant data need to be interpreted very carefully to determine differentiation potential.

      From this viewpoint, future experiments will combine the Hoxb5 reporter system with a lineage tracing system that can track HSCs at the single-cell level over time. This approach will investigate changes in the self-renewal capacity of individual HSCs and their subsequent differentiation into progenitor cells and peripheral blood cells. We have reflected this comment by adding the following sentences in the manuscript.

      [P19, L451] “In contrast, our findings should be considered in light of some limitations. In this report, we primarily performed ten to twenty cell transplantation assays. Therefore, the current theory should be revalidated using single-cell technology with lineage tracing system[3-4]. This approach will investigate changes in the self-renewal capacity of individual HSCs and their subsequent differentiation into progenitor cells and peripheral blood cells.” 

      It is also unclear why the authors believe that the observed reduction of ST-HSCs relative to LT-HSCs explains the myeloid-biased phenotype observed in the peripheral blood. This point seems counterintuitive and requires further explanation. 

      Response #2-3:

      Thank you for your comment. We apologize for the insufficient explanation. Our data, as shown in Figures 3 and 4, demonstrate that the differentiation potential of LT-HSCs remains unchanged with age. Therefore, rather than suggesting that an increase in LT-HSCs with a consistent differentiation capacity leads to myeloidbiased hematopoiesis, it seems more accurate to highlight that the relative decrease in the proportion of ST-HSCs, which remain in peripheral blood as lymphocytes, leads to a relative increase in myeloid cells in peripheral blood and thus causes myeloid-biased hematopoiesis.

      However, if we focus on the increase in the ratio of LT-HSCs, it is also plausible to explain that “with aging, the proportion of LT-HSCs capable of long-term myeloid hematopoiesis increases. As a result, from 16 weeks after transplantation, the influence of LT-HSCs maintaining the long-term ability to produce myeloid cells becomes relatively more significant, leading to an increase in the ratio of myeloid cells in the peripheral blood and causing myeloid-biased hematopoiesis.”

      Based on my understanding of the presented data, the authors argue that myeloidbiased HSCs do not exist, as 

      a) they detect no difference between young/aged HSCs after transplant (mind low nnumbers and large std!!!); b) myeloid progenitors downstream of HSCs only show minor or no changes in frequency and c) aged LT-HSCs do not outperform young LT-HSC in myeloid output LT-HSCs in competitive transplants (mind low n-numbers and large std!!!). 

      However, given the low n-numbers and high variance of the results, the argument seems weak and the presented data does not support the claims sufficiently. That the number of downstream progenitors does not change could be explained by other mechanisms, for instance, the frequently reported differentiation short-cuts of HSCs and/or changes in the microenvironment. 

      Response #2-4:

      We appreciate the comments. As mentioned above, we will correct the manuscript regarding the sample size. Regarding the interpreting of the lack of increase in the percentage of myeloid progenitor cells in the bone marrow with age, it is instead possible that various confounding factors, such as differentiation shortcuts or changes in the microenvironment, are involved.

      However, even when aged LT-HSCs and young LT-HSCs are transplanted into the same recipient mice, the timing of the appearance of different cell fractions in peripheral blood is similar (Figure 3 of this paper). Therefore, we have not obtained data suggesting that clear shortcuts exist in the differentiation process of aged HSCs into neutrophils or monocytes. Additionally, it is currently consensually accepted that myeloid cells, including neutrophils and monocytes, differentiate from GMPs[1]. Since there is no changes in the proportion of GMPs in the bone marrow with age, we concluded that the differentiation potential into myeloid cells remains consistent with aging.

      "Then, we found that the myeloid lineage proportions from young and aged LT-HSCs were nearly comparable during the observation period after transplantation (Fig. 3, B and C)." 

      [Comment to the authors]: Given the large standard deviation and low n-numbers, the power of the analysis to detect differences between experimental groups is very low. Experimental groups with too large standard deviations (as displayed here) are difficult to interpret and might be inconclusive. The absence of clearly detectable differences between young and aged transplanted HSCs could thus simply be a false-negative result. The shown experimental results hence do not provide strong evidence for the author's interpretation of the data. The authors should add additional transplants and include a detailed power analysis to be able to detect differences between experimental groups with reasonable sensitivity. 

      Response #2-5:

      Thank you for providing these insights. Regarding the sample size, we have addressed this in Response #2-1.

      Line 293: "Based on these findings, we concluded that myeloid-biased hematopoiesis observed following transplantation of aged HSCs was caused by a relative decrease in ST-HSC in the bulk-HSC compartment in aged mice rather than the selective expansion of myeloid-biased HSC clones." 

      Couldn't that also be explained by an increase in myeloid-biased HSCs, as repeatedly reported and seen in the expansion of CD150+ HSCs? It is not intuitively clear why a reduction of ST-HSCs clones would lead to a myeloid bias. The author should try to explain more clearly where they believe the increased number of myeloid cells comes from. What is the source of myeloid cells if the authors believe they are not derived from the expanded population of myeloid-biased HSCs? t 

      Response #2-6:

      Thank you for pointing this out. We apologize for the insufficient explanation. We will explain using Figure 8 from the paper.

      First, our data show that LT-HSCs maintain their differentiation capacity with age, while ST-HSCs lose their self-renewal capacity earlier, so that only long-lived memory lymphocytes remain in the peripheral blood after the loss of selfrenewal capacity in ST-HSCs (Figure 8, upper panel). In mouse bone marrow, the proportion of LT-HSCs increases with age, while the proportion of ST-HSCs relatively decreases (Figure 8, lower panel and Figure S5). 

      Our data show that merely reproducing the ratio of LT-HSCs to ST-HSCs observed in aged mice using young LT-HSCs and ST-HSCs can replicate myeloidbiased hematopoiesis. This suggests that the increase in LT-HSC and the relative decrease in ST-HSC within the HSC compartment with aging are likely to contribute to myeloid-biased hematopoiesis.

      As mentioned earlier, since the differentiation capacity of LT-HSCs remain unchaged with age, it seems more accurate to describe that the relative decrease in the proportion of ST-HSCs, which retain long-lived memory lymphocytes in peripheral blood, leads to a relative increase in myeloid cells in peripheral blood and thus causes myeloid-biased hematopoiesis.

      However, focusing on the increase in the proportion of LT-HSCs, it is also possible to explain that “with aging, the proportion of LT-HSCs capable of long-term myeloid hematopoiesis increases. As a result, from 16 weeks after transplantation, the influence of LT-HSCs maintaining the long-term ability to produce myeloid cells becomes relatively more significant, leading to an increase in the ratio of myeloid cells in the peripheral blood and causing myeloid-biased hematopoiesis.”

      Recommendations for the authors: 

      Reviewer #2 (Recommendations for the authors):

      Summary: 

      Comment #2-1: While aspects of their work are fascinating and might have merit, several issues weaken the overall strength of the arguments and interpretation. Multiple experiments were done with a very low number of recipient mice, showed very large standard deviations, and had no statistically detectable difference between experimental groups. While the authors conclude that these experimental groups are not different, the displayed results seem too variable to conclude anything with certainty. The sensitivity of the performed experiments (e.g. Figure 3; Figure 6C, D) is too low to detect even reasonably strong differences between experimental groups and is thus inadequate to support the author's claims. This weakness of the study is not acknowledged in the text and is also not discussed. To support their conclusions the authors, need to provide higher n-numbers and provide a detailed power analysis of the transplants in the methods section. 

      Response #2-1

      Thank you for your important remarks. The power analysis for this experiment shows that power = 0.319, suggesting that more number may be needed. On the other hand, our method for determining the sample size in Figure 3 is as follows: 

      (1) First, we checked whether myeloid biased change is detected in the bulk-HSC fraction (Figure S3). The results showed that the difference in myeloid output at 16 weeks after transplantation was statistically significant (young vs. aged = 7.2 {plus minus} 8.9 vs. 42.1 {plus minus} 35.5%, p = 0.01), even though n = 10. 

      (2) Next, myeloid biased HSCs have been reported to be a fraction with high selfrenewal ability (2004, Blood). If myeloid biased HSCs increase with aging, the increase in myeloid biased HSCs in LT-HSC fraction would be detected with higher sensitivity than in the bulk-HSC fraction used in Figure S3. 

      (3) However, there was no difference not only in p-values but also in the mean itself, young vs aged = 51.4{plus minus}31.5% vs 47.4{plus minus}39.0%, p = 0.82, even though n = 8 in Figure 3. Since there was no difference in the mean itself, it is highly likely that no difference will be detected even if n is further increased. 

      Regarding Figure 6, we obtained a statistically significant difference and consider the sample size to be sufficient. In addition, we have performed various functional experiments (Figures 2, 5, 6 and S6), and have obtained consistent results that expansion of myeloid-biased HSCs does not occur with aging in Hoxb5+HSCs fraction. Based on the above, we conclude that the LT-HSC fraction does not differ in myeloid differentiation potential with aging. 

      [Comment for authors]  

      Paradigm-shifting extraordinary claims require extraordinary data. Unfortunately, the authors do not provide additional data to further support their claims. Instead, the authors argue the following: Because they were able to find significant differences between experimental groups in some experiments, the absence of significant differences in the results of other experiments must be correct, too. 

      This logic is in my view flawed. Any assay/experiment with highly variable data has a very low sensitivity to detect significant differences between groups. If, as in this case, the variance is as large as the entire dynamic range of the readout, it becomes impossible to be able to detect any difference. In these cases, it is not surprising and actually expected that the mean of the group is located close to the center of the dynamic range as is the case here (center of dynamic range: 50%). In other words, this means that the experiments are simply not reproducible. It is absolutely critical to remember that any experiment and its associated statistical analysis has 3 (!!!) instead of 2 possible outcomes: 

      (1) There is a statistically significant difference 

      (2) There is no statistically significant difference 

      (3) The results of the experiment are inconclusive because the replicates are too variable and the results are not reproducible.  

      While most of us are inclined to think about outcomes (1) or (2), outcome (3) cannot be neglected. While it might be painful to accept, the only way to address concerns about data reproducibility is to provide additional data, improve reproducibility, and lower the power of the analysis to an acceptable level (e.g. able to detect difference of 5-10% between groups). 

      Without going into the technical details, the example graph from the link below illustrates that with a power 0.319 as stated by the authors, approx. 25 transplants, instead of 8, would be required. 

      Typically, however, a power of 0.8 is a reasonable value for any power analysis (although it's not a very strong power either). Even if we are optimistic and assume that there might be a reasonably large difference between experimental groups (in the example above P2 = 0.6, which is actually not that large) we can estimate that we would need over 10 transplants per group to say with confidence that two experimental groups likely do not differ. With smaller differences, these numbers increase quickly to 20+ transplants per group as can be seen in the example graph using an Alpha of 0.1 above. 

      Further reading can be found here and in many textbooks or other online resources: https://power-analysis.com/effect_size.htm  https://tss.awf.poznan.pl/pdf-188978-110207? filename=Using%20power%20analysis%20to.pdf 

      Response:

      Thank you for your feedback. We fully agree with the reviewer that paradigmshifting claims must be supported by equally robust data. It has been welldocumented that the frequency of myeloid-biased HSCs increases with age, with reports indicating that over 50% of the HSC compartment in aged mice consists of myeloid-biased HSCs[1,2]. Based on this, we believe that if aged LT-HSCs were substantially myeloid-biased, the difference should be readily detectable.

      To further validate our findings, we showed the similar preliminary experiment. The resulting data are shown below (n = 8). 

      Author response image 1.

      (A) Experimental design for competitive co-transplantation assay. Ten CD45.2<sup>+</sup> young LT-HSCs and ten CD45.2<sup>+</sup> aged LT-HSCs were transplanted with 2 × 10<sup>5</sup> CD45.1<sup>+</sup>/CD45.2<sup>+</sup> supporting cells into lethally irradiated CD45.1<sup>+</sup> recipient mice (n \= 8). (B) Lineage output of young or aged LT-HSCs at 4, 8, 12, 16 weeks after transplantation. Each bar represents an individual mouse. *P < 0.05. **P < 0.01.

      While a slight increase in myeloid-biased hematopoiesis was observed in the aged LT-HSC fraction, the difference was not statistically significant. These new results are presented alongside the original Figure 3, which was generated using a larger sample size (n = 16).

      Author response image 2.

      (A) Experimental design for competitive co-transplantation assay. Ten CD45.2<sup>+</sup> young LT-HSCs and ten CD45.2<sup>+</sup> aged LT-HSCs were transplanted with 2 × 10<sup>5</sup> CD45.1<sup>+</sup>/CD45.2<sup>+</sup> supporting cells into lethally irradiated CD45.1<sup>+</sup> recipient mice (n \= 16). (B) Lineage output of young or aged LT-HSCs at 4, 8, 12, 16 weeks after transplantation. Each bar represents an individual mouse. 

      Consistent with the original data, aged LT-HSCs exhibited a lineage output that was nearly identical to that of young LT-HSCs. Nonetheless, as the reviewer rightly pointed out, we cannot completely exclude the possibility that subtle differences may exist but remain undetected. To address this, we have added the following sentence to the manuscript:  

      [P9, L200] “These findings unmistakably demonstrated that mixed/bulk-HSCs showed myeloid skewed hematopoiesis in PB with aging. In contrast, LT-HSCs maintained a consistent lineage output throughout life, although subtle differences between aged and young LT-HSCs may exist and cannot be entirely ruled out.”

      References

      (1) Dykstra, Brad et al. “Clonal analysis reveals multiple functional defects of aged murine hematopoietic stem cells.” The Journal of experimental medicine vol. 208,13 (2011): 2691-703. doi:10.1084/jem.20111490

      (2) Beerman, Isabel et al. “Functionally distinct hematopoietic stem cells modulate hematopoietic lineage potential during aging by a mechanism of clonal expansion.” Proceedings of the National Academy of Sciences of the United States of America vol. 107,12 (2010): 5465-70. doi:10.1073/pnas.1000834107

      Comment #2-3: It is also unclear why the authors believe that the observed reduction of STHSCs relative to LT-HSCs explains the myeloid-biased phenotype observed in the peripheral blood. This point seems counterintuitive and requires further explanation. 

      Response #2-3:  

      Thank you for your comment. We apologize for the insufficient explanation. Our data, as shown in Figures 3 and 4, demonstrate that the differentiation potential of LTHSCs remains unchanged with age. Therefore, rather than suggesting that an increase in LT-HSCs with a consistent differentiation capacity leads to myeloid biased hematopoiesis, it seems more accurate to highlight that the relative decrease in the proportion of ST-HSCs, which remain in peripheral blood as lymphocytes, leads to a relative increase in myeloid cells in peripheral blood and thus causes myeloid-biased hematopoiesis. However, if we focus on the increase in the ratio of LT-HSCs, it is also plausible to explain that "with aging, the proportion of LT-HSCs capable of long-term myeloid hematopoiesis increases. As a result, from 16 weeks after transplantation, the influence of LT-HSCs maintaining the long-term ability to produce myeloid cells becomes relatively more significant, leading to an increase in the ratio of myeloid cells in the peripheral blood and causing myeloid-biased hematopoiesis." 

      [Comment for authors] 

      While this interpretation of the data might make sense the shown data do not exclude alternative explanations. The authors do not exclude the possibility that LTHSCs expand with age and that this expansion in combination with an aging microenvironment drives myeloid bias. The authors should quantify the frequency [%] and absolute number of LT-HSCs and ST-HSCs in young vs. aged animals. Especially analyzing the abs. numbers of cells will be important to support their claims as % can be affected by changes in the frequency of other populations. 

      Thank you for your very important point. As this reviewer pointed out, we do not exclude the possibility that the combination of aged microenvironment drives myeloid bias. Additionally, we acknowledge that myeloid-biased hematopoiesis with age is a complex process likely influenced by multiple factors. We would like to discuss the mechanism mentioned as a future research direction. Thank you for the insightful feedback. Regarding the point about the absolute cell numbers mentioned in the latter half of the paragraph, we will address this in detail in our subsequent response (Response #2-4).

      Comment #2-4: Based on my understanding of the presented data, the authors argue that myeloid-biased HSCs do not exist, as a) they detect no difference between young/aged HSCs after transplant (mind low n-numbers and large std!); b) myeloid progenitors downstream of HSCs only show minor or no changes in frequency and c) aged LT-HSCs do not outperform young LT-HSCs in myeloid output LTHSCs in competitive transplants (mind low n-numbers and large std!). However, given the low n-numbers and high variance of the results, the argument seems weak and the presented data does not support the claims sufficiently. That the number of downstream progenitors does not change could be explained by other mechanisms, for instance, the frequently reported differentiation short-cuts of HSCs and/or changes in the microenvironment. 

      Response #2-4:  

      We appreciate the comments. As mentioned above, we will correct the manuscript regarding the sample size. Regarding the interpreting of the lack of increase in the percentage of myeloid progenitor cells in the bone marrow with age, it is instead possible that various confounding factors, such as differentiation shortcuts or changes in the microenviroment, are involved. However, even when aged LT-HSCs and young LT-HSCs are transplanted into the same recipient mice, the timing of the appearance of different cell fractions in peripheral blood is similar (Figure 3 of this paper). Therefore, we have not obtained data suggesting that clear shortcuts exist in the differentiation process of aged HSCs into neutrophils or monocytes. Additionally, it is currently consensually accepted that myeloid cells, including neutrophils and monocytes, differentiate from GMPs1. Since there are no changes in the proportion of GMPs in the bone marrow with age, we concluded that the differentiation potential into myeloid cells remains consistent with aging. 

      Reference 

      (1) Akashi K and others, 'A Clonogenic Common Myeloid Progenitor That Gives Rise to All Myeloid Lineages', Nature, 404.6774 (2000), 193-97. 

      [Comment for authors] 

      As the relative frequency of cell population can be misleading, the authors should compare the absolute numbers of progenitors in young vs. aged mice to strengthen their argument. It would also be helpful to quantify the absolute numbers and relative frequencies in WT mice to exclude the possibility the HoxB5-trimcherry mouse model suffers from unexpected aging phenotypes and the hematopoietic system differs from wild-type animals.

      Thank you for your valuable feedback. We understand the importance of comparing the absolute numbers of progenitors in young versus aged mice to provide a more accurate representation of the changes in cell populations.

      Therefore, we quantified the absolute cell count of hematopoietic cells in the bone marrow using flow cytometry data. 

      Author response image 3.

      As previously reported, we observed a 10-fold increase in the number of pHSCs in aged mice compared to young mice. Additionally, our analysis revealed a statistically significant decrease in the number of Flk2+ progenitors and CLPs in aged mice. On the other hand, there was no statistically significant change in the number of myeloid progenitors between the two age groups. We appreciate the suggestion and hope that this additional information strengthens our argument and addresses your concerns.

      Comment #2-5:  

      "Then, we found that the myeloid lineage proportions from young and aged LT-HSCs were nearly comparable during the observation period after transplantation (Figure 3, B and C)." Given the large standard deviation and low n-numbers, the power of the analysis to detect differences between experimental groups is very low. Experimental groups with too large standard deviations (as displayed here) are difficult to interpret and might be inconclusive. The absence of clearly detectable differences between young and aged transplanted HSCs could thus simply be a false-negative result. The shown experimental results hence do not provide strong evidence for the author's interpretation of the data. The authors should add additional transplants and include a detailed power analysis to be able to detect differences between experimental groups with reasonable sensitivity. 

      Response #2-5:  

      Thank you for providing these insights. Regarding the sample size, we have addressed this in Response #2-1. 

      [Comment for authors]  

      As explained in detail in the response to #2-1 the provided arguments are not convincing. As the authors pointed out, the power of these experiments is too low to make strong claims. If the author does not intend to provide new data, the language of the manuscript needs to be adjusted to reflect this weakness. A paragraph discussing the limitations of the study mentioning the limited power of the data should be included beyond the above-mentioned rather vague statement that the data should be validated (which is almost always necessary anyway). 

      Thank you for your valuable comment. We agree with the importance of discussing potential limitations in our experimental design. In response to the reviewer’s suggestion, we have revised the manuscript to include the following sentences:

      [P19, L434] "In the co-transplantation assay shown in Figure 3, the myeloid lineage output derived from young and aged LT-HSCs was comparable (Young LT-HSC: 51.4 ± 31.5% vs. Aged LT-HSC: 47.4 ± 39.0%, p = 0.82). Although no significant difference was detected, the small sample size (n = 8) may limit the sensitivity of the assay to detect subtle myeloid-biased phenotypes."

      This addition acknowledges the potential limitations of our analysis and highlights the need for further investigation with larger cohorts.

      Comment #2-6:

      Line 293: "Based on these findings, we concluded that myeloid biased hematopoiesis observed following transplantation of aged HSCs was caused by a relative decrease in ST-HSC in the bulk-HSC compartment in aged mice rather than the selective expansion of myeloid-biased HSC clones." Couldn't that also be explained by an increase in myeloid-biased HSCs, as repeatedly reported and seen in the expansion of CD150+ HSCs? It is not intuitively clear why a reduction of STHSCs clones would lead to a myeloid bias. The author should try to explain more clearly where they believe the increased number of myeloid cells comes from. What is the source of myeloid cells if the authors believe they are not derived from the expanded population of myeloid-biased HSCs?

      Response #2-6:

      Thank you for pointing this out. We apologize for the insufficient explanation. We will explain using attached Figure 8 from the paper. First, our data show that LT-HSCs maintain their differentiation capacity with age, while ST-HSCs lose their self-renewal capacity earlier, so that only long-lived memory lymphocytes remain in the peripheral blood after the loss of self-renewal capacity in ST-HSCs (Figure 8, upper panel). In mouse bone marrow, the proportion of LT-HSCs increases with age, while the proportion of STHSCs relatively decreases (Figure 8, lower panel and Figure S5).

      Our data show that merely reproducing the ratio of LT-HSCs to ST-HSCs observed in aged mice using young LT-HSCs and ST-HSCs can replicate myeloid-biased hematopoiesis. This suggests that the increase in LT-HSC and the relative decrease in ST-HSC within the HSC compartment with aging are likely to contribute to myeloid-biased hematopoiesis.

      As mentioned earlier, since the differentiation capacity of LT-HSCs remain unchanged with age, it seems more accurate to describe that the relative decrease in the proportion of STHSCs, which retain long-lived memory lymphocytes in peripheral blood, leading to a relative increase in myeloid cells in peripheral blood and thus causes myeloid-biased hematopoiesis. However, focusing on the increase in the proportion of LT-HSCs, it is also possible to explain that "with aging, the proportion of LT-HSCs capable of long-term myeloid hematopoiesis increases. As a result, from 16 weeks after transplantation, the influence of LT-HSCs maintaining the long-term ability to produce myeloid cells become relatively more significant, leading to an increase in the ratio of myeloid cells in the peripheral blood and causing myeloid biased hematopoiesis."

      [Comment for authors]

      While I can follow the logic of the argument, my concerns about the interpretation remain as I see discrepancies in other findings in the published literature. For instance, what the authors call ST-HSCs, differs from the classical functional definition of ST-HSCs. It is thus difficult to relate the described observations to previous reports. ST-HSCs typically can contribute significantly to multiple lineages for several weeks (see for example PMID: 29625072). It is somewhat surprising that the ST-HSC in this study don't show this potential and loose their potential much quicker.

      The authors should thus provide a more comprehensive depth of immunophenotypic and molecular characterization to compare their LT-HSCs to ST-HSCs. For instance, are LT-HSCs CD41- HSCs? How do ST-HSCs differ in their surface marker expression from previously used definitions of ST-HSCs? A list of differentially expressed genes between young and old LT-HSCs and ST-HSCs should be done and will likely provide important insights into the molecular programs/markers (beyond the provided GO analysis, which seems superficial).

      Thank you for your valuable feedback. As the reviewer noted, there are indeed multiple definitions of ST-HSCs. We appreciate the opportunity to clarify our definitions of ST-HSCs. We define ST-HSCs functionally, rather than by surface antigens, which we believe is the most classical and widely accepted definition [1]. In our study, we define long-term hematopoietic stem cells (LT-HSCs) as those HSCs that continue to contribute to hematopoiesis after a second transplantation and possess long-term self-renewal potential. Conversely, we define short-term hematopoietic stem cells (ST-HSCs) as those HSCs that do not contribute to hematopoiesis after a second transplantation and only exhibit self-renewal potential in the short term. 

      Next, in the paper referenced by the reviewer[2], the chimerism of each fraction of ST-HSCs also peaked at 4 weeks and then decreased to approximately 0.1% after 12 weeks post-transplantation. Author response image 5 illustrates our ST-HSC donor chimerism in Figure 2. We believe that data in the paper referenced by the reviewer2 is consistent with our own observations of the hematopoietic pattern following ST-HSC transplantation, indicating a characteristic loss of hematopoietic potential 4 weeks after the transplantation. Furthermore, as shown in Figures 2D and 2F, the fraction of ST-HSCs does not exhibit hematopoietic activity after the second transplantation. Therefore, we consider this fraction to be ST-HSCs.

      Author response image 4.

      Additionally, the RNAseq data presented in Figures 4 and S4 revealed that the GSEA results vary among the different myeloid gene sets analyzed (Fig. 4, D–F; Fig. S4, C–D). Moreover, a comprehensive analysis of mouse HSC aging using multiple RNA-seq datasets reported that nearly 80% of differentially expressed genes show poor reproducibility across datasets[3]. From the above, while RNAseq data is indeed helpful, we believe that emphasizing functional experimental results is more critical than incorporating an additional dataset to support our claim. Thank you once again for your insightful feedback.

      References

      (1) Kiel, Mark J et al. “SLAM family receptors distinguish hematopoietic stem and progenitor cells and reveal endothelial niches for stem cells.” Cell vol. 121,7 (2005): 1109-21. doi:10.1016/j.cell.2005.05.026

      (2) Yamamoto, Ryo et al. “Large-Scale Clonal Analysis Resolves Aging of the Mouse Hematopoietic Stem Cell Compartment.” Cell stem cell vol. 22,4 (2018): 600-607.e4. doi:10.1016/j.stem.2018.03.013

      (3) Flohr Svendsen, Arthur et al. “A comprehensive transcriptome signature of murine hematopoietic stem cell aging.” Blood vol. 138,6 (2021): 439-451. doi:10.1182/blood.2020009729

      Reviewer #3 (Public review): 

      Although the topic is appropriate and the new model provides a new way to think about lineage-biased output observed in multiple hematopoietic contexts, some of the experimental design choices, as well as some of the conclusions drawn from the results could be substantially improved. Also, they do not propose any potential mechanism to explain this process, which reduces the potential impact and novelty of the study. 

      The authors have satisfactorily replied to some of my comments. However, there are multiple key aspects that still remain unresolved.

      Reviewer #3 (Recommendations for the authors): 

      Comment #3-1,2:  

      Although the additional details are much appreciated the core of my original comments remains unanswered. There are still no details about the irradiation dose for each particular experiment. Is any transplant performed using a 9.1 Gy dose? If yes, please indicate it in text or figure legend. If not, please remove this number from the corresponding method section. 

      Again, 9.5 Gy (split in two doses) is commonly reported as sublethal. The fact that the authors used a methodology that deviates from the "standard" for the field makes difficult to put these results in context with previous studies. It is not possible to know if the direct and indirect effects of this conditioning method in the hematopoietic system have any consequences in the presented results. 

      Thank you for your clarification. We confirm that none of the transplantation experiments described were performed using a 9.1 Gy irradiation dose. We have therefore removed the mention of "9.1 Gy" from the relevant section of the Materials and Methods. We appreciate helpful suggestion to improve the clarity of the manuscript.

      [P22, L493] “12-24 hours prior to transplantation, C57BL/6-Ly5.1 mice, or aged C57BL/6J recipient mice were lethally irradiated with single doses of 8.7 Gy.”

      Regarding the reviewer’s concern about the radiation dose used in our experiments, we will address this point in more detail in our subsequent response (see Response #3-4).

      Comment #3-4(Original): When representing the contribution to PB from transplanted cells, the authors show the % of each lineage within the donor-derived cells (Figures 3B-C, 5B, 6B-D, 7C-E, and S3 B-C). To have a better picture of total donor contribution, total PB and BM chimerism should be included for each transplantation assay. Also, for Figures 2C-D and Figures S2A-B, do the graphs represent 100% of the PB cells? Are there any radioresistant cells?

      Response #3-4 (Original): Thank you for highlighting this point. Indeed, donor contribution to total peripheral blood (PB) is important information. We have included the donor contribution data for each figure above mentioned.

      In Figure 2C-D and Figure S2A-B, the percentage of donor chimerism in PB was defined as the percentage of CD45.1-CD45.2+ cells among total CD45.1-CD45.2+ and CD45.1+CD45.2+ cells as described in method section.

      Comment for our #3-4 response:  

      Thanks for sharing these data. These graphs should be included in their corresponding figures along with donor contribution to BM. 

      Regarding Figure2 C-D, as currently shown, the graphs only account for CD45.1CD45.2+ (donor-derived) and CD45.1+CD45.2+ (supporting-derived). What is the percentage of CD45.1+CD45.2- (recipient-derived)? Since the irradiation regiment is atypical, including this information would help to know more about the effects of this conditioning method. 

      Thank you for your insightful comment regarding Figure 2C-D. To address the concern that the reviewer pointed out, we provide the kinetics of the percentage of CD45.1+CD45.2- (recipient-derived) in Author response image 7.

      Author response image 5.

      As the reviewer pointed out, we observed the persistence of recipient-derived cells, particularly in the secondary transplant. As noted, this suggests that our conditioning regimen may have been suboptimal. In response, we will include the donor chimerism analysis in the total cells and add the following statement in the study limitations section to acknowledge this point:

      [P19, L439] “Additionally, in this study, we purified LT-HSCs using the Hoxb5 reporter system and employed a moderate conditioning regimen (8.7 Gy). To have a better picture of total donor contribution, total PB chimerism are presented in Figure S7 and we cannot exclude the possibility that these factors may have influenced the results. Therefore, it would be ideal to validate our findings using alternative LT-HSC markers and different conditioning regimens.”

      Comment #3-5: For BM progenitor frequencies, the authors present the data as the frequency of cKit+ cells. This normalization might be misleading as changes in the proportion of cKit+ between the different experimental conditions could mask differences in these BM subpopulations. Representing this data as the frequency of BM single cells or as absolute numbers (e.g., per femur) would be valuable.

      Response #3-5:

      We appreciate the reviewer's comment on this point. 

      Firstly, as shown in Supplemental Figures S1B and S1C, we analyze the upstream (HSC, MPP, Flk2+) and downstream (CLP, MEP, CMP, GMP) fractions in different panels. Therefore, normalization is required to assess the differentiation of HSCs from upstream to downstream.

      Additionally, the reason for normalizing by c-Kit+ is that the bone marrow analysis was performed after enrichment using the Anti-c-Kit antibody for both upstream and downstream fractions. Based on this, we calculated the progenitor populations as a frequency within the c-Kit positive cells. Next, the results of normalizing the whole bone marrow cells (live cells) are shown below. 

      Author response image 6.

      Similar to the results of normalizing c-Kit+ cells, myeloid progenitors remained unchanged, including a statistically significant decrease in CMP in aged mice. Additionally, there were no significant differences in CLP. In conclusion, similar results were obtained between the normalization with c-Kit and the normalization with whole bone marrow cells (live cells).

      However, as the reviewer pointed out, it is necessary to explain the reason for normalization with c-Kit. Therefore, we will add the following description.

      [P21, L502] For the combined analysis of the upstream (HSC, MPP, Flk2+) and downstream (CLP, MEP, CMP, GMP) fractions in Figures 1B, we normalized by cKit+ cells because we performed a c-Kit enrichment for the bone marrow analysis.

      Comment for our #3-5 response:

      I understand that normalization is necessary to compare across different BM populations. However, the best way would be to normalize to single cells. As I mentioned in my original comment, normalizing to cKit+ cells could be misleading, as the proportion of cKit+ cells could be different across the experimental conditions. Further, enriching for cKit+ cells when analyzing BM subpopulation frequencies could introduce similar potential errors. The enrichment would depend on the level of expression of cKit for each of these population, what would alter the final quantification. Indeed, CLP are typically defined as cKit-med/low. Thus, cKit enrichment would not be a great method to analyze the frequency of these cells. 

      The graph in the authors' response to my comment, show similar trend to what is represented Figure 1B for some populations. However, there are multiple statistically significant changes that disappear in this new version. This supports my original concern and, in consequence, I would encourage to represent this data as the frequency of BM single cells or as absolute numbers (e.g., per femur). 

      Thank you for your thoughtful follow-up comment. In response to the reviewer’s suggestion, we will represent the data as the frequency among total BM single cells. These revised graphs have been incorporated into the updated Figure 7F and corresponding figure legend have been revised accordingly to accurately reflect these representations. We appreciate your valuable input, which has helped us improve the clarity and rigor of our data presentation.

      Comment #3-6: Regarding Figure 1B, the authors argue that if myeloid-biased HSC clones increase with age, they should see increased frequency of all components of the myeloid differentiation pathway (CMP, GMP, MEP). This would imply that their results (no changes or reduction in these myeloid subpopulations) suggest the absence of myeloid-biased HSC clones expansion with age. This reviewer believes that differentiation dynamics within the hematopoietic hierarchy can be more complex than a cascade of sequential and compartmentalized events (e.g., accelerated differentiation at the CMP level could cause exhaustion of this compartment and explain its reduction with age and why GMP and MEP are unchanged) and these conclusions should be considered more carefully.

      Response #3-6:

      We wish to thank the reviewer for this comment. We agree with that the differentiation pathway may not be a cascade of sequential events but could be influenced by various factors such as extrinsic factors.

      In Figure 1B, we hypothesized that there may be other mechanisms causing myeloid-biased hematopoiesis besides the age-related increase in myeloid-biased HSCs, given that the percentage of myeloid progenitor cells in the bone marrow did not change with age. However, we do not discuss the presence or absence of myeloid-biased HSCs based on the data in Figure 1B. 

      Our newly proposed theories—that the differentiation capacity of LT-HSCs remains unchanged with age and that age-related myeloid-biased hematopoiesis is due to changes in the ratio of LT-HSCs to ST-HSCs—are based on functional experiment results. As the reviewer pointed out, to discuss the presence or absence of myeloid-biased HSCs based on the data in Figure 1B, it is necessary to apply a system that can track HSC differentiation at single-cell level. The technology would clarify changes in the self-renewal capacity of individual HSCs and their differentiation into progenitor cells and peripheral blood cells. The authors believe that those single-cell technologies will be beneficial in understanding the differentiation of HSCs. Based on the above, the following statement has been added to the text.

      [P19, L440] In contrast, our findings should be considered in light of some limitations. In this report, we primarily performed ten to twenty cell transplantation assays. Therefore, the current theory should be revalidated using single-cell technology with lineage tracing system1-2. This approach will investigate changes in the self-renewal capacity of individual HSCs and their subsequent differentiation into progenitor cells and peripheral blood cells. 

      Comment for our #3-6 response:

      Thanks for the response. My original comments referred to the statement "On the other hand, in contrast to what we anticipated, the frequency of GMP was stable, and the percentage of CMP actually decreased significantly with age, defying our prediction that the frequency of components of the myeloid differentiation pathway, such as CMP, GMP, and MEP would increase in aged mice if myeloid-biased HSC clones increase with age (Fig. 1 B)" (lines #129-133). Again, the absence of an increase in CMP, GMP and MEP with age does not mean the absence of and increase in myeloid-biased HSC clones. This statement should be considered more carefully. 

      Thank you for the insightful comment. We agree that the absence of an increase in CMP, GMP and MEP with age does not mean the absence of an increase in myeloid-biased HSC clones. In our revised manuscript, we have refined the statement to acknowledge this nuance more clearly. The updated text now reads as follows:

      P6, L129] On the other hand, in contrast to what we anticipated, the frequency of GMP was stable, and the percentage of CMP actually decreased significantly with age, defying our prediction that the frequency of components of the myeloid differentiation pathway, such as CMP, GMP, and MEP may increase in aged mice, if myeloid-biased HSC clones increase with age. 

      Comment #3-7: Within the few recipients showing good donor engraftment in Figure 2C, there is a big proportion of T cells that are "amplified" upon secondary transplantation (Figure 2D). Is this expected?

      Response #3-7:

      We wish to express our deep appreciation to the reviewer for insightful comment on this point. As the reviewers pointed out, in Figure 2D, a few recipients show a very high percentage of T cells. The authors had the same question and considered this phenomenon as follows:

      (1) One reason for the very high percentage of T cells is that we used 1 x 107 whole bone marrow cells in the secondary transplantation. Consequently, the donor cells in the secondary transplantation contained more T-cell progenitor cells, leading to a greater increase in T cells compared to the primary transplantation.

      (2) We also consider that this phenomenon may be influenced by the reduced selfrenewal capacity of aged LT-HSCs, resulting in decreased sustained production of myeloid cells in the secondary recipient mice. As a result, long-lived memorytype lymphocytes may preferentially remain in the peripheral blood, increasing the percentage of T cells in the secondary recipient mice.

      We have discussed our hypothesis regarding this interesting phenomenon. To further clarify the characteristics of the increased T-cell count in the secondary recipient mice, we will analyze TCR clonality and diversity in the future.

      Comment for our #3-7 response:

      Thanks for the potential explanations to my question. This fact is not commonly reported in previous transplantation studies using aged HSCs. Could Hoxb5 label fraction of HSCs that is lymphoid/T-cell biased upon secondary transplantation? The number of recipients with high frequency of lymphoid cells in the peripheral blood (even from young mice) is remarkable. 

      Response:

      Thank you for your insightful suggestion. Based on this comment, we calculated the percentage of lymphoid cells in the donor fraction at 16 weeks following the secondary transplantation, which was 56.1 ± 25.8% (L/M = 1.27). According to the Müller-Sieburg criteria, lymphoid-biased hematopoiesis is defined as having an L/M ratio greater than 10. 

      Given our findings, we concluded that the Hoxb5-labeled fraction does not specifically indicate lymphoid-biased hematopoiesis. We sincerely appreciate the valuable input, which helped us to further clarify the interpretation of our results.

      Comment #3-8: Do the authors have any explanation for the high level of variabilitywithin the recipients of Hoxb5+ cells in Figure 2C?

      Response #3-8:

      We appreciate the reviewer's comment on this point. As noted in our previous report, transplantation of a sufficient number of HSCs results in stable donor chimerism, whereas a small number of HSCs leads to increased variability in donor chimerism1. Additionally, other studies have observed high variability when fewer than 10 HSCs are transplanted2-3. Based on this evidence, we consider that the transplantation of a small number of cells (10 cells) is the primary cause of the high level of variability observed.

      Comment for our #3-8 response:

      I agree that transplanting low number of HSC increases the mouse-to-mouse variability. For that reason, a larger cohort of recipients for this kind of experiment would be ideal. 

      Response:

      Thank you for the insightful comment. We agree that a larger cohort of recipients would be ideal for this type of experiment. In Figure 2, the difference between Hoxb5<suup>+</sup> and Hoxb5⁻ cells are robust, allowing for a clear statistical distinction despite the cohort size. However, we also recognize that a larger cohort would be necessary to detect more subtle differences, particularly in Figure 3. In response, we have added the following statement to the main text to acknowledge this limitation.

      P9, L200] These findings unmistakably demonstrated that mixed/bulk-HSCs showed myeloid skewed hematopoiesis in PB with aging. In contrast, LT-HSCs maintained a consistent lineage output throughout life, although subtle differences between aged and young LT-HSCs may exist and cannot be entirely ruled out.

      Comment #3-10: Is Figure 2G considering all primary recipients or only the ones that were used for secondary transplants? The second option would be a fairer comparison.

      Response #3-10:

      We appreciate the reviewer's comment on this point. We considered all primary recipients in Figure 2G to ensure a fair comparison, given the influence of various factors such as the radiosensitivity of individual recipient mice[1]. Comparing only the primary recipients used in the secondary transplantation would result in n = 3 (primary recipient) vs. n = 12 (secondary recipient). Including all primary recipients yields n = 11 vs. n = 12, providing a more balanced comparison. Therefore, we analyzed all primary recipient mice to ensure the reliability of our results.

      Comment for our #3-10 response:

      I respectfully disagree. Secondary recipients are derived from only 3 of the primary recipients. Therefore, the BM composition is determined by the composition of their donors. Including primary recipients that are not transplanted into secondary recipients for is not the fairest comparison for this analysis. 

      Thank you for your comment and for highlighting this important issue. We acknowledge the concern that including primary recipients that are not transplanted into secondary recipients is not the fairest comparison for this analysis. In response, we have reanalyzed the data using only the primary recipients whose bone marrow was actually transplanted into secondary recipients. 

      Author response image 7.

      Importantly, the reanalysis confirmed that the kinetics of myeloid cell proportions in peripheral blood were consistent between primary and secondary transplant recipients. We sincerely appreciate your thoughtful feedback, which has helped us improve the clarity.

      Comment #3-11: When discussing the transcriptional profile of young and aged HSCs, the authors claim that genes linked to myeloid differentiation remain unchanged in the LT-HSC fraction while there are significant changes in the STHSCs. However, 2 out of the 4 genes shown in Figure S4B show ratios higher than 1 in LT-HSCs.

      Response #3-11:

      Thank you for highlighting this important point. As the reviewer pointed out, when we analyze the expression of myeloid-related genes, some genes are elevated in aged LT-HSCs compared to young LT-HSCs. However, the GSEA analysis using myeloid-related gene sets, which include several hundred genes, shows no significant difference between young and aged LT-HSCs (see Figure S4C in this paper). Furthermore, functional experiments using the co-transplantation system show no difference in differentiation capacity between young and aged LT-HSCs (see Figure 3 in this paper). Based on these results, we conclude that LT-HSCs do not exhibit any change in differentiation capacity with aging.

      Comment for our #3-11 response:

      The authors used the data in Figure S4 to claim that "myeloid genes were tended to be enriched in aged bulk-HSCs but not in aged LT-HSCs compared to their respective controls" (this is the title of the figure; line # 1326). This is based on an increase in gene expression of CD150, vWF, Selp, Itgb3 in aged cells compared to young cells (Figure S4B). However, an increase in Selp and Itgb3 is also observed for LT-HSCs (lower magnitude, but still and increase). 

      Also, regarding the GSEA, the only term showing statistical significance in bulk HSCs is "Myeloid gene set", which does not reach significance in LT-HSCs, but present a trend for enrichment (q = 0.077). None of the terms in shown in this panel present statistical significance in ST-HSCs. 

      Thank you for your valuable point. As the reviewer noted, the current title may cause confusion. Therefore, we propose changing it to the following:

      [P52, L1331] “Figure S4. Compared to their respective young controls, aged bulk-HSCs exhibit greater enrichment of myeloid gene expression than aged LT-HSCs”

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public Review):

      Overall, the manuscript is very well written, the approaches used are clever, and the data were thoroughly analyzed. The study conveyed important information for understanding the circuit mechanism that shapes grid cell activity. It is important not only for the field of MEC and grid cells, but also for broader fields of continuous attractor networks and neural circuits.

      We appreciate the positive comments.

      (1) The study largely relies on the fact that ramp-like wide-field optogenetic stimulation and focal optogenetic activation both drove asynchronous action potentials in SCs, and therefore, if a pair of PV+ INs exhibited correlated activity, they should receive common inputs. However, it is unclear what criteria/thresholds were used to determine the level of activity asynchronization, and under these criteria, what percentage of cells actually showed synchronized or less asynchronized activity. A notable percentage of synchronized or less asynchronized SCs could complicate the results, i.e., PV+ INs with correlated activity could receive inputs from different SCs (different inputs), which had synchronized activity. More detailed information/statistics about the asynchronization of SC activity is necessary for interpreting the results.

      The percentage of SCs that show synchronised activity during ramping optogenetic activation is zero. To make this clear we've added new quantification to the analyses of simultaneously activated SCs in Figure 2, Figure Supplement 1. This includes confidence intervals for the correlograms and statistical comparisons of the correlograms to shuffled data from each pair of neurons. We also validate our statistical analysis strategy by showing that it successfully identifies autocorrelation peaks for the same cells.

      Synchronisation during focal optogenetic activation is also expected to be zero. We did not commit resources to experiments to directly test this for focal stimulation because we had already tested the possibility with ramping stimuli discussed above, and because the established biophysics of local SC circuits is such that synchronised activity during selective activation of SCs is unlikely. In particular, because direct excitatory connections between SCs are either rare or absent (Fuchs et al. 2016; Couey et al. 2013; Pastoll et al. 2013; Winterer et al. 2017), and when detected have small amplitude (Winterer et al. 2017), no mechanism exists that could drive synchronisation. The absence of coordination in responses to ramping stimuli quantified above is consistent with this conclusion.

      (2) The hypothesis about the "direct excitatory-inhibitory" synaptic interactions is made based on the GABAzine experiments in Figure 4. In the Figure 8 diagram, the direct interaction is illustrated between PV+ INs and SCs. However, the evidence supporting this "direct interaction" between these two cell types is missing. Is it possible that pyramidal cells are also involved in this interaction? Some pieces of evidence or discussions are necessary to further support the "direction interaction".

      We were insufficiently clear in our previous attempts to ground these interpretations in the context of previous work. The hypothesis about "direct excitatory-inhibitory" interactions wasn't made solely on the basis of Figure 4, but from multiple previous studies that directly demonstrate these interactions (e.g. Fuchs et al. 2016; Couey et al. 2013; Pastoll et al. 2013). Similarly, the diagram in Figure 8 doesn't only reflect the conclusions of the present study but integrates work from these and other previous studies.

      A possible role for pyramidal cells in coordination would require that they can be driven to fire action potentials by input from SCs. However, SCs appear not to connect to pyramidal cells (0/126 tested connections in Winterer et al. 2017). Thus, this possibility is inconsistent with the previously published data.

      To make these points clearer we have added additional discussion and citations to the results (p 5), discussion (p 11) and legend to Figure 8.

      Reviewer #2 (Public Review):

      In this study, Huang et al. employed optogenetic stimulation alongside paired whole-cell recordings in genetically defined neuron populations of the medial entorhinal cortex to examine the spatial distribution of synaptic inputs and the functional-anatomical structure of the MEC. They specifically studied the spatial distribution of synaptic inputs from parvalbumin-expressing interneurons to pairs of excitatory stellate cells. Additionally, they explored the spatial distribution of synaptic inputs to pairs of PV INs. Their results indicate that both pairs of SCs and PV INs generally receive common input when their relative somata are within 200-300 ums of each other. The research is intriguing, with controlled and systematic methodologies. There are interesting takeaways based on the implications of this work to grid cell network organization in MEC.

      We appreciate the positive comments.

      (1) Results indicate that in brain slices, nearby cells typically share a higher degree of common input. However, some proximate cells lack this shared input. The authors interpret these findings as: "Many cells in close proximity don't seem to share common input, as illustrated in Figures 3, 5, and 7. This implies that these cells might belong to separate networks or exist in distinct regions of the connectivity space within the same network.".

      Every slice orientation could have potentially shared inputs from an orthogonal direction that are unavoidably eliminated. For instance, in a horizontal section, shared inputs to two SCs might be situated either dorsally or ventrally from the horizontal cut, and thus removed during slicing. Given the synaptic connection distributions observed within each intact orientation, and considering these distributions appear symmetrically in both horizontal and sagittal sections, the authors should be equipped to estimate the potential number of inputs absent due to sectioning in the orthogonal direction. How might this estimate influence the findings, especially those indicating that many close neurons don't have shared inputs?

      We appreciate the suggestion, however systematically generating estimates that account in full for the relative position of the postsynaptic neurons, for variation in the organisation of their dendritic fields and for unknowns such as the location and number of synaptic contacts made, quickly leads to a large potential parameter space, while not advancing our understanding beyond qualitative assessment of the raw data.

      Given this, we make the following comments:

      'We note that the absence of correlated inputs in one slice plane does not rule out the possibility that the same cell pair receives common inputs in a different plane, as these inputs would most likely not be activated if the cell bodies of the presynaptic neuron were removed by slicing.' (p10) and:

      'The incompleteness may in part result from loss of some inputs by tissue slicing. However, the fact that axons were well preserved and typically extended beyond the range of functional correlations, while many cell pairs that did not receive correlated input were relatively close to one another and had overlapping dendritic fields, argues against tissue slicing being a major contributor to incompleteness.' (p10).

      (2) The study examines correlations during various light-intensity phases of the ramp stimuli. One wonders if the spatial distribution of shared (or correlated) versus independent inputs differs when juxtaposing the initial light stimulation phase, which begins to trigger spiking, against subsequent phases. This differentiation might be particularly pertinent to the PV to SC measurements. Here, the initial phase of stimulation, as depicted in Figure 7, reveals a relatively sparse temporal frequency of IPSCs. This might not represent the physiological conditions under which high-firing INs function.

      While the authors seem to have addressed parts of this concern in their focal stim experiments by examining correlations during both high and low light intensities, they could potentially extract this metric from data acquired in their ramp conditions. This would be especially valuable for PV to SC measurements, given the absence of corresponding focal stimulation experiments.

      As the reviewer's comments recognise, the consistent results with focal stimulation already provide direct experimental validation to our ramp stimulation approach. We appreciate the suggestion for further analysis, but as we understand it this analysis would be hard to interpret. First, variation between pairs in the activity at different phases of the light ramp will be confounded by slice to slice differences in the level of ChR2 expression, e.g. in Figure 2, Figure Supplement 1 within slice variability is low, whereas between slice variation is relatively high. This is because in slices with relatively low expression spike onset is relatively late, while in slices with relatively high expression spike onset is early in the ramp and later in the ramp neurons experience depolarising block. Second, the onset of changes in cross-correlation coefficients and lag variation is typically abrupt. This makes it challenging to assign windows to onset phases or to interpret the resulting data.

      (3) Re results from Figure 2: Please fully describe the model in the methods section. Generally, I like using a modeling approach to explore the impact of convergent synaptic input to PVs from SCs that could effectively validate the experimental approach and enhance the interpretability of the experimental stim/recording outcomes. However, as currently detailed in the manuscript, the model description is inadequate for assessing the robustness of the simulation outcomes. If the IN model is simply integrate-and-fire with minimal biophysical attributes, then the findings in Fig 2F results shown in Fig 2F might be trivial. Conversely, if the model offers a more biophysically accurate representation (e.g., with conductance-based synaptic inputs, synapses appropriately dispersed across the model IN dendritic tree, and standard PV IN voltage-gated membrane conductances), then the model's results could serve as a meaningful method to both validate and interpret the experiments.

      We have expanded the description of the modelling given in the methods including clearer motivation and justification (p 15). Two points are helpful to consider:

      First, the goal of the model is to assess the feasibility of the correlation based approach given the synaptic current responses recorded at the soma. We now make this clearer by stating that:

      'The goal of our simulations was to assess if analysis of cross-correlations between currents recorded from pairs of neurons could be used to establish whether they receive shared input from the same pre-synaptic neuron. While this should be obvious if neurons exclusively receive shared input, we wanted to establish whether shared input is detectable when each neuron also receives independent inputs of similar frequency and amplitude to the shared input.' (p 15).

      The suggestion that the results in Figure 2F are trivial doesn't make sense to us. Indeed, it strikes us as non-trivial that with this approach shared input from a single common presynaptic neuron is not detectable, but input from two or more is.

      Second, because we are simulating a somatic voltage-clamp experiment the details of the neuronal time constants, voltage-gated channels or other integrative mechanisms that reviewer suggests may be important here are not actually relevant to the interpretation. To appreciate this consider the membrane equation:

      When the membrane is clamped at a fixed potential, there is no capacitance current , while voltage-dependent ionic currents and the resting ionic current are constant. In this case the only time varying current is the synaptic current . Thus, adding more details would not make the model more 'meaningful' as these details would be redundant and the results will be the same as simply considering convolution of the synaptic conductances. We have made this rationale clearer in the revised methods (p 15).

      Reviewer #3 (Public Review):

      These are technically demanding experiments, but the authors show quite convincing differences in the correlated response of cell pairs that are close to each other in contrast to an absence of correlation in other cell pairs at a range of relative distances. This supports their main point of demonstrating anatomical clusters of cells receiving shared inhibitory input.

      We appreciate the positive comments.

      The overall technique is complex and the presentation could be more clear about the techniques and analysis.

      Thanks. We've added additional explanation to the methods section to try to improve clarity (p 15-16).

      In addition, due to this being a slice preparation they cannot directly relate the inhibitory interactions to the functional properties of grid cells which was possible in the 2-photon in vivo imaging experiment by Heys and Dombeck, 2014.

      We agree the two approaches are complementary. The Heys and Dombeck study could only reveal correlations in functional activity, which could have many possible synaptic mechanisms, whereas our results address synaptic organisation but the representational roles of the specific neurons we recorded from are unclear. We have highlighted these current limitations and strategies to address them in the final paragraph of the discussion (p 11).

    1. Author Response

      The following is the authors’ response to the previous reviews.

      Reviewer #3 comment

      1) One suggestion for improvement is to consider incorporating the results from Figure S9 into in the main Figure 6, which would enhance readers' comprehension.

      We appreciate your valuable feedback. Based on the reviewer’s suggestion, we have incorporated results from the Figure S9 into the main Figure 6, as shown below. Manuscripts and figure legends have also been modified accordingly.

      Author response image 1.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors aim to assess the effect of salt stress on root:shoot ratio, identify the underlying genetic mechanisms, and evaluate their contribution to salt tolerance. To this end, the authors systematically quantified natural variations in salt-induced changes in root:shoot ratio. This innovative approach considers the coordination of root and shoot growth rather than exploring biomass and the development of each organ separately. Using this approach, the authors identified a gene cluster encoding eight paralog genes with a domain-of-unknown-function 247 (DUF247), with the majority of SNPs clustering into SR3G (At3g50160). In the manuscript, the authors utilized an integrative approach that includes genomic, genetic, evolutionary, histological, and physiological assays to functionally assess the contribution of their genes of interest to salt tolerance and root development.

      Comments on revisions:

      As the authors correctly noted, variations across samples, genotypes, or experiments make achieving statistical significance challenging. Should the authors choose to emphasize trends across experiments to draw biological conclusions, careful revisions of the text, including titles and figure legends, will be necessary to address some of the inconsistencies between figures (see examples below). However, I would caution that this approach may dilute the overall impact of the work on SR3G function and regulation. Therefore, I strongly recommend pursuing additional experimental evidence wherever possible to strengthen the conclusions.

      (1) Given the phenotypic differences shown in Figures S17A-B, 10A-C, and 6A, the statement that "SR3G does not play a role in plant development under non-stress conditions" (lines 680-681) requires revision to better reflect the observed data.

      Thank you to the reviewer for the comment. We appreciate the acknowledgment that variations among experiments are inherent to biological studies. Figures 6A and S17 represent the same experiment, which initially indicated a phenotype for the sr3g mutant under salt stress. To ensure that growth changes were specifically normalized for stress conditions, we calculated the Stress Tolerance Index (Fig. 6B). In Figure 10, we repeated the experiment including all five genotypes, which supported our original observation that the sr3g mutant exhibited a trend toward reduced lateral root number under 75 mM NaCl compared to Col-0, although this difference was not significant (Fig. 10B). Additionally, we confirmed that the wrky75 mutant showed a significant reduction in main root growth under salt stress compared to Col-0, consistent with findings reported in The Plant Cell by Lu et al. 2023. For both main root length and lateral root number, we demonstrated that the double mutants of wrky75/sr3g displayed growth comparable to wild-type Col-0. This result suggests that the sr3g mutation compensates for the salt sensitivity of the wrky75 mutant.

      We completely agree with the reviewer that there is a variation in our results regarding the sr3g phenotype under control conditions, as presented in Fig. 6A/Fig. S17 and Fig. 10A-C. In Fig. 6A/Fig. S17, we did not observe any consistent trends in main root or lateral root length for the sr3g mutant compared to Col-0 under control conditions. However, in Fig. 10A-C, we observed a significant reduction in main root length, lateral root number, and lateral root length for the sr3g mutant under control conditions. We believe this may align with SR3G’s role as a negative regulator of salt stress responses. While loss of this gene benefits plants in coping with salt stress, it might negatively impact overall plant growth under non-stress conditions. This interpretation is further supported by our findings on the root suberization pattern in sr3g mutants under control conditions (Fig. 8B), where increased suberization in root sections 1 to 3, compared to Col-0, could inhibit root growth. While SR3G's role in overall plant fitness is intriguing, it is beyond the scope of this study. We cannot rule out the possibility that SR3G contributes positively to plant growth, particularly root growth. That said, we observed no differences in shoot growth between Col-0 and the sr3g mutant under control conditions (Fig. 7). Additionally, we calculated the Stress Tolerance Index for all aspects of root growth shown in Fig. 10 and presented it in Fig. S25.

      To address the reviewer request on rephrasing the lines 680-681 from"SR3G does not play a role in plant development under non-stress conditions" (lines 680-681) statement, this statement is found in lines 652-653 and corresponds to Fig. 7, where we evaluated rosette growth in the WT and sr3g mutant under both control and salt stress conditions. We did not observe any significant differences or even trends between the two genotypes under control conditions, confirming the accuracy of the statement. To clarify further, we have added “SR3G does not play a role in rosette growth and development under non-stress conditions”.

      (2) I agree with the authors that detecting expression differences in lowly expressed genes can be challenging. However, as demonstrated in the reference provided (Lu et al., 2023), a significant reduction in WRKY75 expression is observed in T-DNA insertion mutant alleles of WRKY75. In contrast, Fig. 9B in the current manuscript shows no reduction in WRKY75 expression in the two mutant alleles selected by the authors, which suggests that these alleles cannot be classified as loss-of-function mutants (line 745). Additionally, the authors note that the wrky75 mutant exhibits reduced main root length under salt stress, consistent with the phenotype reported by Lu et al. (2023). However, other phenotypic discrepancies exist between the two studies. For example, 1) Lu et al. (2023) report that w¬rky75 root length is comparable to WT under control conditions, whereas the current manuscript shows that wrky75 root growth is significantly lower than WT; 2) under salt stress, Lu et al. (2023) show that wrky75 accumulates higher levels of Na+, whereas the current study finds Na+ levels in wrky75 indistinguishable from WT. To confirm the loss of WRKY75 function in these T-DNA insertion alleles the authors should provide additional evidence (e.g., Western blot analysis).

      We sincerely appreciate the reviewer acknowledging the challenge of detecting expression differences in lowly expressed genes, such as transcription factors. Transcription factors are typically expressed at lower levels compared to structural or enzymatic proteins, as they function as regulators where small quantities can have substantial effects on downstream gene expression.

      That said, we respectfully disagree with the reviewer’s interpretation that there is no reduction in WRKY75 expression in the two mutant lines tested in Fig. 9C. Among the two independent alleles examined, wrky75-3 showed a clear reduction in expression compared to WT Col-0 under both control and salt stress conditions. Using the Tukey test to compare all groups, we observed distinct changes in the assigned significance letters for each case:

      Col/root/control (cd) vs wrky75-3/root/control (cd): Although the same significance letter was assigned, we still observed a clear reduction in WRKY75 transcript abundance. More importantly, the variation in expression is notably lower compared to Col-0.

      Col/shoot/control (bcd) vs wrky75-3/shoot/control (a): This is significant reduction compared to Col

      Col/root/salt (cd) vs wrky75-3/root/salt (bcd): Once again, the reduction in WRKY75 transcript levels corresponds to changes in the assigned significance letters.

      Col/shoot/salt (bc) vs wrky75-3/shoot/salt (ab): Once again, the reduction in WRKY75 transcript levels corresponds to changes in the assigned significance letters.

      To address the reviewer’s comment regarding the significant reduction in WRKY75 expression observed in T-DNA insertion mutant alleles of WRKY75 in the reference by Lu et al., 2023, we would like to draw the reviewer’s attention to the following points:

      a) Different alleles: The authors in The Plant Cell used different alleles than those used in our study, with one of their alleles targeting regions upstream of the WRKY75 gene. While we identified one of their described alleles (WRKY75-1, SALK_101367) on the T-DNA express website, which targets upstream of WRKY75, the other allele (wrky75-25) appears to have been generated through a different mechanism (possibly an RNAi line) that is not defined in the Plant Cell paper and does not appear on the T-DNA express website. The authors mentioned they have received these seeds as gifts from other labs in the acknowledgement ”We thank Prof. Hongwei Guo (Southern University of Science and Technology, China) and Prof. Diqiu Yu (Yunnan University, China) for kindly providing the WRKY75<sub>pro</sub>:GUS, 35S<sub>pro</sub>:WRKY75-GFP, wrky75-1, and wrky75-25 seeds. We thank Man-cang Zhang (Electrophysiology platform, Henan University) for performing the NMT experiment”.

      However, in our study, we selected two different T-DNAs that target the coding regions. While this may explain slight differences in the observed responses, both studies independently link WRKY75 to salt stress, regardless of the alleles used. For your reference, we have included a screenshot of the different alleles used.

      Author response image 1.

      b) Different developmental stages: They measured WRKY75 expression in 5-day-old seedlings. In our experiment, we used seedlings grown on 1/2x MS for 4 days, followed by transfer to treatment plates with or without 75 mM NaCl for one week. As a result, we analyzed older plants (12 days old) for gene expression analysis. Despite the difference in developmental stage, we were still able to observe a reduction in gene expression.

      c) Different tissues: The authors of The Plant Cell used whole seedlings for gene expression analysis, whereas we separated the roots and shoots and measured gene expression in each tissue type individually. This approach is logical, as WRKY75 is a root cell-specific transcription factor with higher expression in the roots compared to the shoots, as demonstrated in our analysis (Fig. 9C).

      Based on the reasoning above, we did work with loss-of-function mutants of WRKY75, particularly wrky75-3. To more accurately reflect the nature of the mutation, we have changed the term "loss-of-function" to "knock-down" in line 717.

      The reviewer mentioned phenotypic discrepancies between the two studies. We agree that there are some differences, particularly in the magnitude of responses or expression levels. However, despite variations in the alleles used, developmental stages, and tissue types, both studies reached the same conclusion: WRKY75 is involved in the salt stress response and acts as a positive regulator. We have discussed the differences between our study and The Plant Cell in the section above, summarizing them into three main points: different alleles, different developmental stages, and different tissue types.

      To address the reviewer’s comment regarding "Lu et al. (2023) report that wrky75 root length is comparable to WT under control conditions, whereas the current manuscript shows that wrky75 root growth is significantly lower than WT": We evaluated root growth differently than The Plant Cell study. In The Plant Cell (Fig. 5, H-J), root elongation was measured in 10-day-old plants with a single time point measurement. They transferred five-day-old wild-type, wrky75-1, wrky75-25, and WRKY75-OE plants to 1/2× MS medium supplemented with 0 mM or 125 mM NaCl for further growth and photographed them 5 days after transfer. In contrast, our study used 4-day-old seedlings, which were transferred to 1/2 MS with or without 0, 75, or 125 mM salt for additional growth (9 days). Rather than measuring root growth only at the end, we scanned the roots every other day, up to five times, to assess root growth rates. Essentially, the precision of our method is higher as we captured growth changes throughout the developmental process, compared to the approach used in The Plant Cell. We do not underestimate the significance of the work conducted by other colleagues in the field, but we also recognize that each laboratory has its own approach and specific practices. This variation in experimental setup is intrinsic to biology, and we believe it is important to study biological phenomena in different ways. Especially as the common or contrasting conclusions reached by different studies, performed by different labs and using different experimental setups are shedding more light on reproducibility and gene contribution across different conditions, which is intrinsic to phenotypic plasticity, and GxE interactions.

      The Plant Cell used a very high salt concentration, starting at 125 mM, while we were more cautious in our approach, as such a high concentration can inhibit and obscure more subtle phenotypic changes.

      To address the reviewer’s comment on "Lu et al. (2023) show that wrky75 accumulates higher levels of Na+, whereas the current study finds Na+ levels in wrky75 indistinguishable from WT," we would like to highlight the differences in the methodologies used in both studies. The Plant Cell measured Na+ accumulation in the wrky75 mutant using xylem sap (Supplemental Figure S10), which appears to be a convenient and practical approach in their laboratory. In their experiment, wild-type and wrky75 mutant plants were grown in soil for 3 weeks, watered with either a mock solution or 100 mM NaCl solution for 1 day, and then xylem sap was collected for Na+ content analysis. In contrast, our study employed a different method to measure Na+ and K+ ion content, using Inductively Coupled Plasma Atomic Emission Spectroscopy (ICP-AES) for root and shoot Na+ and K+ measurements. Additionally, we collected samples after two weeks on treatment plates and focused on the Na+/K+ ratio, which we consider more relevant than net Na+ or K+ levels, as the ratio of these ions is a critical determinant of plant salt tolerance. With this in mind, we observed a considerable non-significant increase in the Na+/K+ ratio in the shoots of the wrky75-3 mutant (assigned Tukey’s letter c) compared to the Col-0 WT (assigned Tukey’s letters abc) under 125 mM salt, suggesting that this mutant is salt-sensitive. Importantly, the Na+/K+ ratio in the double wrky75/sr3g mutants was reduced to the WT level under the same salt conditions, further indicating that the salt sensitivity of wrky75 is mitigated by the sr3g mutation.

      Based on the reasons mentioned above, we believe that conducting additional experiments, such as Western blot analysis, is unnecessary and would not contribute new insights or alter the context of our findings.

      Reviewer #2 (Public review):

      Summary:

      Salt stress is a significant and growing concern for agriculture in some parts of the world. While the effects of sodium excess have been studied in Arabidopsis and (many) crop species, most studies have focused on Na uptake, toxicity and overall effects on yield, rather than on developmental responses to excess Na, per se. The work by Ishka and colleagues aims to fill this gap.

      Working from an existing dataset that exposed a diverse panel of A. thaliana accessions to control, moderate, and severe salt stress, the authors identify candidate loci associated with altering the root:shoot ratio under salt stress. Following a series of molecular assays, they characterize a DUF247 protein which they dub SR3G, which appears to be a negative regulator of root growth under salt stress.

      Overall, this is a well-executed study which demonstrates the functional role played by a single gene in plant response to salt stress in Arabidopsis.

      Review of revised manuscript:

      The authors have addressed my point-by-point comments to my satisfaction. In the cases where they have changed their manuscript language, clarified figures, or added analyses I have no further comment. In some cases, there is a fruitful back-and-forth discussion of methodology which I think will be of interest to readers.

      I have nothing to add during this round of review. I think that the paper and associated discussion will make a nice contribution to the field.

      We sincerely appreciate the reviewer’s recognition of the significance of our work to the field.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Lines 518-519: The statement that other DUF247s exhibit similar expression patterns to SR3G, suggesting their responsiveness to salt stress, is not fully supported by Fig. S14. Please clarify the specific similarities (and differences) in the expression patterns of the DUF247s shown in Fig. S14, as their expression appears to be spatially and temporally diverse. Additionally, the scale is missing in Fig. S14.

      We thank the reviewer. We fixed the text and added expression scales to Figure S14.

      Line 684, Fig. 6A should be 7A.

      Thanks. It is fixed.

      Line 686, Fig. 7A should be 7B.

      Thanks. It is fixed.

      Lines 721-723: The signal quantification in Fig. 8B does not support the claim that "in section one,..., sr3g-5 showed more suberization compared to Col-0." Given the variability and noise often associated with histological dyes such as Fluorol Yellow staining, conclusions should be cautiously grounded in robust signal quantification. Additionally, please specify the number of biological replicates used in both Fig. 8B and C.

      We thank the reviewer for their comments. We believe the statement in the text accurately reflects our results presented in Figure 8B, where we stated “non-significant, but substantially higher levels of root suberization in sr3g-5 compared to Col-0 in sections one to three of the root under control condition (Fig. 8B).” Therefore, we kept the statement and have included the number of biological replicates in the figure legend.

      Lines 731-732: Please provide a more detailed explanation of how the significant changes in suberin monomer levels align with the Fluorol Yellow staining results, and clarify how these findings support the proposed negative role of SR3G in root suberization.

      Fluorol Yellow is a lipophilic dye widely used to label suberin in plant tissues, specifically in roots in this study. Given the inherent variability in histological assays, we confirmed the increase in suberization using an alternative method, Gas Chromatography–Mass Spectrometry (GC-MS). Both approaches revealed elevated suberin levels in the sr3g mutant compared to Col-0. Since the overall suberin content was higher in the mutant under both control and salt stress conditions, we proposed that SR3G acts as a negative regulator of root suberization.

      Lines 686-688 and Figure S24: The authors calculated water mass as FW-DW. A more standard approach for calculating water content is (FW-DW)/FW x 100. Please update the text or adjust the calculation accordingly. Additionally, if the goal is to test differences between WT and the mutant within each condition, a t-test would be a more appropriate statistical method.

      We thank the reviewer. We added water content % to the figure S24. We kept the statistical test as it is as we wanted to be able to observe changes across conditions and genotypes.

      Lines 633-635 states that "No significant difference was observed between sr3g-4 and Col-0 (Fig. S18), except for the Stress Tolerance Index (STI) calculated using growth rates of lateral root length and number." However, based on the Figure S18 legend and statistical analysis (i.e., ns), it appears that the sr3g-4 mutant shows no alterations in root system architecture compared to Col-0. Please revise the text to accurately reflect the results of the statistical analysis.

      We thank the reviewer. We now fixed the text to reflect the result.

      Lines 698-707: The statistical analysis does not support the reported differences in the Na+/K+ ratio for the single and double mutants of sr3g-5 and wrky75-3 (Fig. 10D, where levels connected by the same letters indicate they are not significantly different). Furthermore, the conclusion that "the SR3G mutation indeed compensated for the increased Na+ accumulation observed in the wrky75 mutant under salt stress" is also based on non-significant differences (Fig. S25B). Please revise the text to accurately reflect the results of the statistical analysis. Additionally, since each mutant is compared to the WT, I recommend using Dunnett's test for statistical analysis.

      We thank the reviewer for their feedback. We have carefully revised the text to better support our findings. As previously mentioned, variations among samples are evident and are well-reflected across all our datasets. We have presented all data and focused on identifying trends within our samples to guide interpretation.

      We observed that the SR3G mutation effectively compensated for the increased Na+ accumulation observed in the wrky75 mutant under salt stress. A closer examination of the shoot Na+/K+ ratio under 125 mM salt shows that the wrky75 single mutant has a higher Na+/K+ ratio (indicated by the letter "c") compared to Col-0 (indicated by "abc") and the two double mutants (also indicated by "abc"). Therefore, we have retained the statistical analysis as originally conducted, and maintain our conclusions as is.

      Figure 6: data in panel C present the Na/K ratio, not Na+ content. Based on the statistical analysis of root Na+ levels presented in Fig. S17C, there is no significant difference between sr3g-5 and WT. Please update the title of Fig. 6. In addition, in panel A, the title of the Y-axis and figure legend should be "Lateral root growth rate" without the word length, and in panel C, the statistical analysis is missing.

      We thank the reviewer. We updated Fig. 6 title and fixed the Y-axis in panel A, and added statistical letters to panel C. Legend was updated to reflect the changes.

      Figure 7: Please clearly label the time points where significant differences between genotypes are observed for both early and late salt treatments. Was there a significant difference recorded between WT and sr3g-5 on day 0 under early salt stress? Such differences may arise from initial variations in plant size within this experiment, as indicated by Fig. 7B, where significant differences in rosette area are evident starting from day 0. Additionally, please indicate the statistical analysis in panel E.

      We thank the reviewer for this suggestion. We updated the figure with a statistical test added to the panel E. Although the difference between sr3g mutant and Col-0 is indeed significant in its growth rate at day 0, we would like to draw the attention of the reviewer that this growth rate was calculated over the 24 hours after adding salt stress. Therefore, this difference in growth rate is related to exposure to salt stress. Moreover, the growth rate between Col-0 and sr3g mutant does not differ in two other treatments (Control and Late Salt Stress) further supporting the conclusion that sr3g is affecting rosette size and growth rate only under early salt stress conditions.

      We have also added the Salt Tolerance Index calculation to Figure S24 as additional evidence, controlling for potential differences in size between Col-0 and sr3g mutant.

      Figure S17: statistical analysis is not indicated in panels A, B, and D.

      We thank the reviewer for spotting that. We updated the figure with a statistical test.

      Figures S21-23: The quality of these figures is insufficient, hindering the ability to effectively interpret the authors' results and main message. Furthermore, a Dunnett's test, rather than a t-test, is the appropriate statistical method for this analysis.

      We thank the reviewer for this observation. We have now added a high resolution figures for all supplemental figures, which should increase the resolution of the figures. As we are comparing all of the genotypes to Col-0 one-by-one - the results of individual t-tests are sufficient for this analysis.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      In their paper, Kang et al. investigate rigidity sensing in amoeboid cells, showing that, despite their lack of proper focal adhesions, amoeboid migration of single cells is impacted by substrate rigidity. In fact, many different amoeboid cell types can durotax, meaning that they preferentially move towards the stiffer side of a rigidity gradient. 

      The authors observed that NMIIA is required for durotaxis and, buiding on this observation, they generated a model to explain how durotaxis could be achieved in the absence of strong adhesions. According to the model, substrate stiffness alters the diffusion rate of NMAII, with softer substrates allowing for faster diffusion. This allows for NMAII accumulation at the back, which, in turn, results in durotaxis. 

      The authors responded to all my comments and I have nothing to add. The evidence provided for durotaxis of non adherent (or low-adhering) cells is strong. I am particularly impressed by the fact that amoeboid cells can durotax even when not confined. I wish to congratulate the authors for the excellent work, which will fuel discussion in the field of cell adhesion and migration.

      We thank the reviewer for critically evaluating our work and giving kind suggestions. We are glad that the reviewer found our work to be of potential interest to the broad scientific community.

      Reviewer #2 (Public Review):

      Summary:

      The authors developed an imaging-based device that provides both spatialconfinement and stiffness gradient to investigate if and how amoeboid cells, including T cells, neutrophils, and Dictyostelium, can durotax. Furthermore, the authors showed that the mechanism for the directional migration of T cells and neutrophils depends on non-muscle myosin IIA (NMIIA) polarized towards the soft-matrix-side. Finally, they developed a mathematical model of an active gel that captures the behavior of the cells described in vitro.

      Strengths:

      The topic is intriguing as durotaxis is essentially thought to be a direct consequence of mechanosensing at focal adhesions. To the best of my knowledge, this is the first report on amoeboid cells that do not depend on FAs to exert durotaxis. The authors developed an imaging-based durotaxis device that provides both spatial confinement and stiffness gradient and they also utilized several techniques such as quantitative fluorescent speckle microscopy and expansion microscopy. The results of this study have well-designed control experiments and are therefore convincing.

      Weaknesses:

      Overall this study is well performed but there are still some minor issues I recommend the authors address:

      (1) When using NMIIA/NMIIB knockdown cell lines to distinguish the role of NMIIA and NMIIB in amoeboid durotaxis, it would be better if the authors took compensatory effects into account.

      We thank the reviewer for this suggestion. We have investigated the compensation of myosin in NMIIA and NMIIB KD HL-60 cells using Western blot and added this result in our updated manuscript (Fig. S4B, C). The results showed that the level of NMIIB protein in NMIIA KD cells doubled while there was no compensatory upregulation of NMIIA in NMIIB KD cells. This is consistent with our conclusion that NMIIA rather than NMIIB is responsible for amoeboid durotaxis since in NMIIA KD cells, compensatory upregulation of NMIIB did not rescue the durotaxis-deficient phenotype. 

      (2) The expansion microscopy assay is not clearly described and some details are missed such as how the assay is performed on cells under confinement.

      We thank the reviewer for this comment. We have updated details of the expansion microscopy assay in our revised manuscript in line 481-485 including how the assay is performed on cells under confinement:

      Briefly, CD4+ Naïve T cells were seeded on a gradient PA gel with another upper gel providing confinement. 4% PFA was used to fix cells for 15 min at room temperature. After fixation, the upper gradient PA gel is carefully removed and the bottom gradient PA gel with seeded cells were immersed in an anchoring solution containing 1% acrylamide and 0.7% formaldehyde (Sigma, F8775) for 5 h at 37 °C.

      (3) In this study, an active gel model was employed to capture experimental observations. Previously, some active nematic models were also considered to describe cell migration, which is controlled by filament contraction. I suggest the authors provide a short discussion on the comparison between the present theory and those prior models.

      We thank the reviewer for this suggestion. Active nematic models have been employed to recapitulate many phenomena during cell migration (Nat Commun., 2018, doi: 10.1038/s41467-018-05666-8.). The active nematic model describes the motion of cells using the orientation field, Q, and the velocity field, u. The director field n with (n = −n) is employed to represent the nematic state, which has head-tail symmetry. However, in our experiments, actin filaments are obviously polarized, which polymerize and flow towards the direction of cell migration. Therefore, we choose active gel model which describes polarized actin field during cell migration. In the discussion part, we have provided the comparison between active gel model and motor-clutch model. We have also supplemented a short discussion between the present model and active nematic model in the main text of line 345-347:

      The active nematic model employs active extensile or contractile agents to push or pull the fluid along their elongation axis to simulate cells flowing (61). 

      (4) In the present model, actin flow contributes to cell migration while myosin distribution determines cell polarity. How does this model couple actin and myosin together?

      We thank the reviewer for this question. In our model, the polarization field is employed to couple actin and myosin together. It is obvious that actin accumulate at the front while myosin diffuses in the opposite direction. Therefore, we propose that actin and myosin flow towards the opposite direction, which is captured in the convection term of actin ) and myosin () density field.

    1. Author Response

      The following is the authors’ response to the previous reviews.

      Recommendations for the authors:

      (1) Substantial revision of the claims and interpretation of the results is needed, especially in the setting of additional data showing enhanced erythrophagocytosis with decreased RBC lifespan.

      Thank you for your valuable feedback and suggestion for a substantial revision of the claims and interpretation of our results. We acknowledge the importance of considering additional data that shows enhanced erythrophagocytosis with decreased RBC lifespan. In response, we have revised our manuscript and incorporated additional experimental data to support and clarify our findings.

      (1) In our original manuscript, we reported a decrease in the number of splenic red pulp macrophages (RPMs) and phagocytic erythrocytes after hypobaric hypoxia (HH) exposure. This conclusion was primarily based on our observations of reduced phagocytosis in the spleen.

      (2) Additional experimental data on RBC labeling and erythrophagocytosis:

      • Experiment 1 (RBC labeling and HH exposure)

      We conducted an experiment where RBCs from mice were labeled with PKH67 and injected back into the mice. These mice were then exposed to normal normoxia (NN) or HH for 7 or 14 days. The subsequent assessment of RPMs in the spleen using flow cytometry and immunofluorescence detection revealed a significant decrease in both the population of splenic RPMs (F4/80hiCD11blo, new Figure 5A and C) and PKH67-positive macrophages after HH exposure (as depicted in new Figure 5A and C-E). This finding supports our original claim of reduced phagocytosis under HH conditions.

      Author response image 1.

      -Experiment 2 (erythrophagocytosis enhancement)

      To examine the effects of enhanced erythrophagocytosis, we injected Tuftsin after administering PKH67-labelled RBCs. Our observations showed a significant decrease in PKH67 fluorescence in the spleen, particularly after Tuftsin injection compared to the NN group. This result suggests a reduction in RBC lifespan when erythrophagocytosis is enhanced (illustrated in new Figure 7, A-B).

      Author response image 2.

      (3) Revised conclusions:

      • The additional data from these experiments support our original findings by providing a more comprehensive view of the impact of HH exposure on splenic erythrophagocytosis.

      • The decrease in phagocytic RPMs and phagocytic erythrocytes after HH exposure, along with the observed decrease in RBC lifespan following enhanced erythrophagocytosis, collectively suggest a more complex interplay between hypoxia, erythrophagocytosis, and RBC lifespan than initially interpreted.

      We think that these revisions and additional experimental data provide a more robust and detailed understanding of the effects of HH on splenic erythrophagocytosis and RBCs lifespan. We hope that these changes adequately address the concerns raised and strengthen the conclusions drawn in our manuscript.

      (2) F4/80 high; CD11b low are true RPMs which the cells which the authors are presenting, i.e. splenic monocytes / pre-RPMs. To discuss RPM function requires the presentation of these cells specifically rather than general cells in the proper area of the spleen.

      Thank you for your feedback requesting a substantial revision of our claims and interpretation, particularly considering additional data showing enhanced erythrophagocytosis with decreased RBC lifespan. In response, we have thoroughly revised our manuscript and included new experimental data that further elucidate the effects of HH on RPMs and erythrophagocytosis.

      (1) Re-evaluation of RPMs population after HH exposure:

      • Flow cytometry analysis (new Figure 3G, Figure 5A and B): We revisited the analysis of RPMs (F4/80hiCD11blo) in the spleen after 7 and 14 days of HH exposure. Our revised flow cytometry data consistently showed a significant decrease in the RPMs population post-HH exposure, reinforcing our initial findings.

      Author response image 3.

      Author response image 4.

      • In situ expression of RPMs (Figure S1, A-D):

      We further confirmed the decreased population of RPMs through in situ co-staining with F4/80 and CD11b, and F4/80 and CD68, in spleen tissues. These results clearly demonstrated a significant reduction in F4/80hiCD11blo (Figure S1, A and B) and F4/80hiCD68hi (Figure S1, C and D) cells following HH exposure.

      Author response image 5.

      (2) Single-cell sequencing analysis of splenic RPMs:

      • We conducted a single-cell sequencing analysis of spleen samples post 7 days of HH exposure (Figure S2, A-C). This analysis revealed a notable shift in the distribution of RPMs, predominantly associated with Cluster 0 under NN conditions, to a reduced presence in this cluster after HH exposure.

      • Pseudo-time series analysis indicated a transition pattern change in spleen RPMs, with a shift from Cluster 2 and Cluster 1 towards Cluster 0 under NN conditions, and a reverse transition following HH exposure (Figure S2, B and D). This finding implies a decrease in resident RPMs in the spleen under HH conditions.

      (3) Consolidated findings and revised interpretation:

      • The comprehensive analysis of flow cytometry, in situ staining, and single-cell sequencing data consistently indicates a significant reduction in the number of RPMs following HH exposure.

      • These findings, taken together, strongly support the revised conclusion that HH exposure leads to a decrease in RPMs in the spleen, which in turn may affect erythrophagocytosis and RBC lifespan.

      Author response image 6.

      In conclusion, our revised manuscript now includes additional experimental data and analyses, strengthening our claims and providing a more nuanced interpretation of the impact of HH on spleen RPMs and related erythrophagocytosis processes. We believe these revisions and additional data address your concerns and enhance the scientific validity of our study.

      (3) RBC retention in the spleen should be measured anyway quantitatively, eg, with proper flow cytometry, to determine whether it is increased or decreased.

      Thank you for your query regarding the quantitative measurement of RBC retention in the spleen, particularly in relation to HH exposure. We have utilized a combination of techniques, including flow cytometry and histological staining, to investigate this aspect comprehensively. Below is a summary of our findings and methodology.

      (1) Flow cytometry analysis of labeled RBCs:

      • Our study employed both NHS-biotin (new Figure 4, A-D) and PKH67 labeling (new Figure 4, E-H) to track RBCs in mice exposed to HH. Flow cytometry results from these experiments (new Figure 4, A-H) showed a decrease in the proportion of labeled RBCs over time, both in the blood and spleen. Notably, there was a significantly greater reduction in the amplitude of fluorescently labeled RBCs after NN exposure compared to the reduced amplitude of fluorescently labeled RBCs observed in blood and spleen under HH exposure. The observed decrease in labeled RBCs was initially counterintuitive, as we expected an increase in RBC retention due to reduced erythrophagocytosis. However, this decrease can be attributed to the significantly increased production of RBCs following HH exposure, diluting the proportion of labeled cells.

      • Specifically, for blood, the biotin-labeled RBCs decreased by 12.06% under NN exposure and by 7.82% under HH exposure, while the PKH67-labeled RBCs decreased by 9.70% under NN exposure and by 4.09% under HH exposure. For spleen, the biotin-labeled RBCs decreased by 3.13% under NN exposure and by 0.46% under HH exposure, while the PKH67-labeled RBCs decreased by 1.16% under NN exposure and by 0.92% under HH exposure. These findings suggest that HH exposure leads to a decrease in the clearance rate of RBCs.

      Author response image 7.

      (2) Detection of erythrophagocytosis in spleen:

      To assess erythrophagocytosis directly, we labeled RBCs with PKH67 and analyzed their uptake by splenic macrophages (F4/80hi) after HH exposure. Our findings (new Figure 5, D-E) indicated a decrease in PKH67-positive macrophages in the spleen, suggesting reduced erythrophagocytosis.

      Author response image 8.

      (3) Flow cytometry analysis of RBC retention:

      Our flow cytometry analysis revealed a decrease in PKH67-positive RBCs in both blood and spleen (Figure S4). We postulated that this was due to increased RBC production after HH exposure. However, this method might not accurately reflect RBC retention, as it measures the proportion of PKH67-labeled RBCs relative to the total number of RBCs, which increased after HH exposure.

      Author response image 9.

      (4) Histological and immunostaining analysis:

      Histological examination using HE staining and band3 immunostaining in situ (new Figure 6, A-D, and G-H) revealed a significant increase in RBC numbers in the spleen after HH exposure. This was further confirmed by detecting retained RBCs in splenic single cells using Wright-Giemsa composite stain (new Figure 6, E and F) and retained PKH67-labelled RBCs in spleen (new Figure 6, I and J).

      Author response image 10.

      (5) Interpreting the data:

      The comprehensive analysis suggests a complex interplay between increased RBC production and decreased erythrophagocytosis in the spleen following HH exposure. While flow cytometry indicated a decrease in the proportion of labeled RBCs, histological and immunostaining analyses demonstrated an actual increase in RBCs retention in the spleen. These findings collectively suggest that while the overall RBCs production is upregulated following HH exposure, the spleen's capacity for erythrophagocytosis is concurrently diminished, leading to increased RBCs retention.

      (6) Conclusion:

      Taken together, our results indicate a significant increase in RBCs retention in the spleen post-HH exposure, likely due to reduced residual RPMs and erythrophagocytosis. This conclusion is supported by a combination of flow cytometry, histological staining, and immunostaining techniques, providing a comprehensive view of RBC dynamics under HH conditions. We think these findings offer a clear quantitative measure of RBC retention in the spleen, addressing the concerns raised in your question.

      (4) Numerous other methodological problems as listed below.

      We appreciate your question, which highlights the importance of using multiple analytical approaches to understand complex physiological processes. Please find below our point-by-point response to the methodological comments.

      Reviewer #1 (Recommendations For The Authors):

      (1) Decreased BM and spleen monocytes d/t increased liver monocyte migration is unclear. there is no evidence that this happens or why it would be a reasonable hypothesis, even in splenectomized mice.

      Thank you for highlighting the need for further clarification and justification of our hypothesized decrease in BM and spleen monocytes due to increased monocyte migration to the liver, particularly in the context of splenectomized mice. Indeed, our study has not explicitly verified an augmentation in mononuclear cell migration to the liver in splenectomized mice.

      Nonetheless, our investigations have revealed a notable increase in monocyte migration to the liver after HH exposure. Noteworthy is our discovery of a significant upregulation in colony stimulating factor-1 (CSF-1) expression in the liver, observed after both 7 and 14 days of HH exposure (data not included). This observation was substantiated through flow cytometry analysis (as depicted in Figure S4), which affirmed an enhanced migration of monocytes to the liver. Specifically, we noted a considerable increase in the population of transient macrophages, monocytes, and Kupffer cells in the liver following HH exposure.

      Author response image 11.

      Considering these findings, we hypothesize that hypoxic conditions may activate a compensatory mechanism that directs monocytes towards the liver, potentially linked to the liver’s integral role in the systemic immune response. In accordance with these insights, we intend to revise our manuscript to reflect the speculative nature of this hypothesis more accurately, and to delineate the strategies we propose for its further empirical investigation. This amendment ensures that our hypothesis is presented with full consideration of its speculative basis, supported by a coherent framework for future validation.

      (2) While F4/80+CD11b+ population is decreased, this is mainly driven by CD11b and F4/80+ alone population is significantly increased. This is counter to the hypothesis.

      Thank you for addressing the apparent discrepancy in our findings concerning the F4/80+CD11b+ population and the increase in the F4/80+ alone population, which seems to contradict our initial hypothesis. Your observation is indeed crucial for the integrity of our study, and we appreciate the opportunity to clarify this matter.

      (1) Clarification of flow cytometry results:

      • In response to the concerns raised, we revisited our flow cytometry experiments with a focus on more clearly distinguishing the cell populations. Our initial graph had some ambiguities in cell grouping, which might have led to misinterpretations.

      • The revised flow cytometry analysis, specifically aimed at identifying red pulp macrophages (RPMs) characterized as F4/80hiCD11blo in the spleen, demonstrated a significant decrease in the F4/80 population. This finding is now in alignment with our immunofluorescence results.

      Author response image 12.

      Author response image 13.

      (2) Revised data and interpretation:

      • The results presented in new Figure 3G and Figure 5 (A and B) consistently indicate a notable reduction in the RPMs population following HH exposure. This supports our revised understanding that HH exposure leads to a decrease in the specific macrophage subset (F4/80hiCD11blo) in the spleen.

      We’ve updated our manuscript to reflect these new findings and interpretations. The revised manuscript details the revised flow cytometry analysis and discusses the potential mechanisms behind the observed changes in macrophage populations.

      (3) HO-1 expression cannot be used as a surrogate to quantify number of macrophages as the expression per cell can decrease and give the same results. In addition, the localization of effect to the red pulp is not equivalent to an assertion that the conclusion applies to macrophages given the heterogeneity of this part of the organ and the spleen in general.

      Thank you for your insightful comments regarding the use of HO-1 expression as a surrogate marker for quantifying macrophage numbers, and for pointing out the complexity of attributing changes in HO-1 expression specifically to macrophages in the splenic red pulp. Your observations are indeed valid and warrant a detailed response.

      (1) Role of HO-1 in macrophage activity:

      • In our study, HO-1 expression was not utilized as a direct marker for quantifying macrophages. Instead, it was considered an indicator of macrophage activity, particularly in relation to erythrophagocytosis. HO-1, being upregulated in response to erythrophagocytosis, serves as an indirect marker of this process within splenic macrophages.

      • The rationale behind this approach was that increased HO-1 expression, induced by erythrophagocytosis in the spleen’s red pulp, could suggest an augmentation in the activity of splenic macrophages involved in this process.

      (2) Limitations of using HO-1 as an indicator:

      • We acknowledge your point that HO-1 expression per cell might decrease, potentially leading to misleading interpretations if used as a direct quantifier of macrophage numbers. The variability in HO-1 expression per cell indeed presents a limitation in using it as a sole indicator of macrophage quantity.

      • Furthermore, your observation about the heterogeneity of the spleen, particularly the red pulp, is crucial. The red pulp is a complex environment with various cell types, and asserting that changes in HO-1 expression are exclusive to macrophages could oversimplify this complexity.

      (3) Addressing the concerns:

      • To address these concerns, we propose to supplement our HO-1 expression data with additional specific markers for macrophages. This would help in correlating HO-1 expression more accurately with macrophage numbers and activity.

      • We also plan to conduct further studies to delineate the specific cell types in the red pulp contributing to HO-1 expression. This could involve techniques such as immunofluorescence or immunohistochemistry, which would allow us to localize HO-1 expression to specific cell populations within the splenic red pulp.

      We’ve revised our manuscript to clarify the role of HO-1 expression as an indirect marker of erythrophagocytosis and to acknowledge its limitations as a surrogate for quantifying macrophage numbers.

      (4) line 63-65 is inaccurate as red cell homeostasis reaches a new steady state in chronic hypoxia.

      Thank you for pointing out the inaccuracy in lines 63-65 of our manuscript regarding red cell homeostasis in chronic hypoxia. Your feedback is invaluable in ensuring the accuracy and scientific integrity of our work. We’ve revised lines 63-65 to accurately reflect the understanding.

      (5) Eryptosis is not defined in the manuscript.

      Thank you for highlighting the omission of a definition for eryptosis in our manuscript. We acknowledge the significance of precisely defining such key terminologies, particularly when they play a crucial role in the context of our research findings. Eryptosis, a term referenced in our study, is a specialized form of programmed cell death unique to erythrocytes. Similar with apoptosis in other cell types, eryptosis is characterized by distinct physiological changes including cell shrinkage, membrane blebbing, and the externalization of phosphatidylserine on the erythrocyte surface. These features are indicative of the RBCs lifecycle and its regulated destruction process.

      However, it is pertinent to note that our current study does not extensively delve into the mechanisms or implications of eryptosis. Our primary focus has been to elucidate the effects of HH exposure on the processes of splenic erythrophagocytosis and the resultant impact on the lifespan of RBCs. Given this focus, and to maintain the coherence and relevance of our manuscript, we have decided to exclude specific discussions of eryptosis from our revised manuscript. This decision aligns with our aim to provide a clear and concentrated exploration of the influence of HH exposure on RBCs dynamics and splenic function.

      We appreciate your input, which has significantly contributed to enhancing the clarity and accuracy of our manuscript. The revision ensures that our research is presented with a focused scope, aligning closely with our experimental investigations and findings.

      (6) Physiologically, there is no evidence that there is any "free iron" in cells, making line 89 point inaccurate.

      Thank you for highlighting the concern regarding the reference to "free iron" in cells in line 89 of our manuscript. The term "free iron" in our manuscript was intended to refer to divalent iron (Fe2+), rather than unbound iron ions freely circulating within cells. We acknowledge that the term "free iron" might lead to misconceptions, as it implies the presence of unchelated iron, which is not physiologically common due to the potential for oxidative damage. To rectify this and provide clarity, we’ve revised line 89 of our manuscript to reflect our meaning more accurately. Instead of "free iron," we use "divalent iron (Fe2+)" to avoid any misunderstanding regarding the state of iron in cells. We also ensure that any implications drawn from the presence of Fe2+ in cells are consistent with current scientific literature and understanding.

      (7) Fig 1f no stats

      We appreciate your critical review and suggestions, which help in improving the accuracy and clarity of our research. We’ve revised statistic diagram of new Figure 1F.

      (8) Splenectomy experiments demonstrate that erythrophagocytosis is almost completely replaced by functional macrophages in other tissues (likely Kupffer cells in the liver). there is only a minor defect and no data on whether it is in fact the liver or other organs that provide this replacement function and makes the assertions in lines 345-349 significantly overstated.

      Thank you for your critical assessment of our interpretation of the splenectomy experiments, especially concerning the role of erythrophagocytosis by macrophages in other tissues, such as Kupffer cells in the liver. We appreciate your observation that our assertions may be overstated and acknowledge the need for more specific data to identify which organs compensate for the loss of splenic erythrophagocytosis.

      (1) Splenectomy experiment findings:

      • Our findings in Figure 2D do indicate that in the splenectomized group under NN conditions, erythrophagocytosis is substantially compensated for by functional macrophages in other tissues. This is an important observation that highlights the body's ability to adapt to the loss of splenic function.

      • However, under HH conditions, our data suggest that the spleen plays an important role in managing erythrocyte turnover, as indicated by the significant impact of splenectomy on erythrophagocytosis and subsequent erythrocyte dynamics.

      (2) Addressing the lack of specific organ identification:

      • We acknowledge that our study does not definitively identify which organs, such as the liver or others, take over the erythrophagocytosis function post-splenectomy. This is an important aspect that needs further investigation.

      • To address this, we also plan to perform additional experiments that could more accurately point out the specific tissues compensating for the loss of splenic erythrophagocytosis. This could involve tracking labeled erythrocytes or using specific markers to identify macrophages actively engaged in erythrophagocytosis in various organs.

      (3) Revising manuscript statements:

      Considering your feedback, we’ve revised the statements in lines 345-349 (lines 378-383 in revised manuscript) to enhance the scientific rigor and clarity of our research presentation.

      (9) M1 vs M2 macrophage experiments are irrelevant to the main thrust of the manuscript, there are no references to support the use of only CD16 and CD86 for these purposes, and no stats are provided. It is also unclear why bone marrow monocyte data is presented and how it is relevant to the rest of the manuscript.

      Thank you for your critical evaluation of the relevance and presentation of the M1 vs. M2 macrophage experiments in our manuscript. We appreciate your insights, especially regarding the use of specific markers and the lack of statistical analysis, as well as the relevance of bone marrow monocyte data to our study's main focus.

      (1) Removal of M1 and M2 macrophage data:

      Based on your feedback and our reassessment, we agree that the results pertaining to M1 and M2 macrophages did not align well with the main objectives of our manuscript. Consequently, we have decided to remove the related content on M1 and M2 macrophages from the revised manuscript. This decision was made to ensure that our manuscript remains focused and coherent, highlighting our primary findings without the distraction of unrelated or insufficiently supported data.

      The use of only CD16 and CD86 markers for M1 and M2 macrophage characterization, without appropriate statistical analysis, was indeed a methodological limitation. We recognize that a more comprehensive set of markers and rigorous statistical analysis would be necessary for a meaningful interpretation of M1/M2 macrophage polarization. Furthermore, the relevance of these experiments to the central theme of our manuscript was not adequately established. Our study primarily focuses on erythrophagocytosis and red pulp macrophage dynamics under hypobaric hypoxia, and the M1/M2 polarization aspect did not contribute significantly to this narrative.

      (2) Clarification on bone marrow monocyte data:

      Regarding the inclusion of bone marrow monocyte data, we acknowledge that its relevance to the main thrust of the manuscript was not clearly articulated. In the revised manuscript, we provide a clearer rationale for its inclusion and how it relates to our primary objectives.

      (3) Commitment to clarity and relevance:

      We are committed to ensuring that every component of our manuscript contributes meaningfully to our overall objectives and research questions. Your feedback has been instrumental in guiding us to streamline our focus and present our findings more effectively.

      We appreciate your valuable feedback, which has led to a more focused and relevant presentation of our research. These changes enhance the clarity and impact of our manuscript, ensuring that it accurately reflects our key research findings.

      (10) Biotinolated RBC clearance is enhanced, demonstrating that RBC erythrophagocytosis is in fact ENHANCED, not diminished, calling into question the founding hypothesis that the manuscript proposes.

      Thank you for your critical evaluation of our data on biotinylated RBC clearance, which suggests enhanced erythrophagocytosis under HH conditions. This observation indeed challenges our founding hypothesis that erythrophagocytosis is diminished in this setting. Below is a summary of our findings and methodology.

      (1) Interpretation of RBC labeling results:

      Both the previous results of NHS-biotin labeled RBCs (new Figure 4, A-D) and the current results of PKH67-labeled RBCs (new Figure 4, E-H) demonstrated a decrease in the number of labeled RBCs with an increase in injection time. The production of RBCs, including bone marrow and spleen production, was significantly increased following HH exposure, resulting in a consistent decrease in the proportion of labeled RBCs via flow cytometry detection both in the blood and spleen of mice compared to the NN group. However, compared to the reduced amplitude of fluorescently labeled RBCs observed in blood and spleen under NN exposure, there was a significantly weaker reduction in the amplitude of fluorescently labeled RBCs after HH exposure. Specifically, for blood, the biotin-labeled RBCs decreased by 12.06% under NN exposure and by 7.82% under HH exposure, while the PKH67-labeled RBCs decreased by 9.70% under NN exposure and by 4.09% under HH exposure. For spleen, the biotin-labeled RBCs decreased by 3.13% under NN exposure and by 0.46% under HH exposure, while the PKH67-labeled RBCs decreased by 1.16% under NN exposure and by 0.92% under HH exposure.

      Author response image 14.

      (2) Increased RBCs production under HH conditions:

      It's important to note that RBCs production, including from bone marrow and spleen, was significantly increased following HH exposure. This increase in RBCs production could contribute to the decreased proportion of labeled RBCs observed in flow cytometry analyses, as there are more unlabeled RBCs diluting the proportion of labeled cells in the blood and spleen.

      (3) Analysis of erythrophagocytosis in RPMs:

      Our analysis of PKH67-labeled RBCs content within RPMs following HH exposure showed a significant reduction in the number of PKH67-positive RPMs in the spleen (new Figure 5). This finding suggests a decrease in erythrophagocytosis by RPMs under HH conditions.

      Author response image 15.

      (4) Reconciling the findings:

      The apparent contradiction between enhanced RBC clearance (suggested by the reduced proportion of labeled RBCs) and reduced erythrophagocytosis in RPMs (indicated by fewer PKH67-positive RPMs) may be explained by the increased overall production of RBCs under HH. This increased production could mask the actual erythrophagocytosis activity in terms of the proportion of labeled cells. Therefore, while the proportion of labeled RBCs decreases more significantly under HH conditions, this does not necessarily indicate an enhanced erythrophagocytosis rate, but rather an increased dilution effect due to higher RBCs turnover.

      (5) Revised interpretation and manuscript changes:

      Given these factors, we update our manuscript to reflect this detailed interpretation and clarify the implications of the increased RBCs production under HH conditions on our observations of labeled RBCs clearance and erythrophagocytosis. We appreciate your insightful feedback, which has prompted a careful re-examination of our data and interpretations. We hope that these revisions provide a more accurate and comprehensive understanding of the effects of HH on erythrophagocytosis and RBCs dynamics.

      (11) Legend in Fig 4c-4d looks incorrect and Fig 4e-4f is very non-specific since Wright stain does not provide evidence of what type of cells these are and making for a significant overstatement in the contribution of this data to "confirming" increased erythrophagocytosis in the spleen under HH exposure (line 395-396).

      Thank you for your insightful observations regarding the data presentation and figure legends in our manuscript, particularly in relation to Figure 4 (renamed as Figure 6 in the revised manuscript) and the use of Wright-Giemsa composite staining. We appreciate your constructive feedback and acknowledge the importance of presenting our data with utmost clarity and precision.

      (1) Amendments to Figure legends:

      We recognize the necessity of rectifying inaccuracies in the legends of the previously labeled Figure 4C and D. Corrections have been meticulously implemented to ensure the legends accurately contain the data presented. Additionally, we acknowledge the error concerning the description of Wright staining. The method employed in our study is Wright-Giemsa composite staining, which, unlike Wright staining that solely stains cytoplasm (RBC), is capable of staining both nuclei and cytoplasm.

      (2) Addressing the specificity of Wright-Giemsa Composite staining:

      Our approach involved quantifying RBC retention using Wright-Giemsa composite staining on single splenic cells post-perfusion at 7 and 14 days post HH exposure. We understand and appreciate your concerns regarding the nonspecific nature of Wright staining. Although Wright stain is a general hematologic stain and not explicitly specific for certain cell types, its application in our study aimed to provide preliminary insights. The spleen cells, devoid of nuclei and thus likely to be RBCs, were stained and observed post-perfusion, indicating RBC retention within the spleen.

      (3) Incorporating additional methods for RBC identification:

      To enhance the specificity of our findings, we integrated supplementary methods for RBC identification in the revised manuscript. We employed band3 immunostaining (in the new Figure 6, C-D and G-H) and PKH67 labeling (Figure 6, I-J) for a more targeted identification of RBCs. Band3, serving as a reliable marker for RBCs, augments the specificity of our immunostaining approach. Likewise, PKH67 labeling affords a direct and definitive means to assess RBC retention in the spleen following HH exposure.

      Author response image 16. same as 10

      (4) Revised interpretation and manuscript modifications:

      Based on these enhanced methodologies, we have refined our interpretation of the data and accordingly updated the manuscript. The revised narrative underscores that our conclusions regarding reduced erythrophagocytosis and RBC retention under HH conditions are corroborated by not only Wright-Giemsa composite staining but also by band3 immunostaining and PKH67 labeling, each contributing distinctively to our comprehensive understanding.

      We are committed to ensuring that our manuscript precisely reflects the contribution of each method to our findings and conclusions. Your thorough review has been invaluable in identifying and rectifying areas for improvement in our research report and interpretation.

      (12) Ferroptosis data in Fig 5 is not specific to macrophages and Fer-1 data confirms the expected effect of Fer-1 but there is no data that supports that Fer-1 reverses the destruction of these cells or restores their function in hypoxia. Finally, these experiments were performed in peritoneal macrophages which are functionally distinct from splenic RPM.

      Thank you for your critique of our presentation and interpretation of the ferroptosis data in Figure 5 (renamed as Figure 9 in the revised manuscript), as well as your observations regarding the specificity of the experiments to macrophages and the effects of Fer-1. We value your input and acknowledge the need to clarify these aspects in our manuscript.

      (1) Clarification on cell type used in experiments:

      • We appreciate your attention to the details of our experimental setup. The experiments presented in Figure 9 were indeed conducted on splenic macrophages, not peritoneal macrophages, as incorrectly mentioned in the original figure legend. This was an error in our manuscript, and we have revised the figure legend accordingly to accurately reflect the cell type used.

      (2) Specificity of ferroptosis data:

      • We recognize that the data presented in Figure 9 need to be more explicitly linked to the specific macrophage population being studied. In the revised manuscript, we ensure that the discussion around ferroptosis data is clearly situated within the framework of splenic macrophages.

      • We also provide additional methodological details in the 'Methods' section to reinforce the specificity of our experiments to splenic macrophages.

      (3) Effects of Fer-1 on macrophage function and survival:

      • Regarding the effect of Fer-1, we agree that while our data confirms the expected effect of Fer-1 in inhibiting ferroptosis, we have not provided direct evidence that Fer-1 reverses the destruction of macrophages or restores their function in hypoxia.

      • To address this, we propose additional experiments to specifically investigate the impact of Fer-1 on the survival and functional restoration of splenic macrophages under hypoxic conditions. This would involve assessing not only the inhibition of ferroptosis but also the recovery of macrophage functionality post-treatment.

      (4) Revised interpretation and manuscript changes:

      • We’ve revised the relevant sections of our manuscript to reflect these clarifications and proposed additional studies. This includes modifying the discussion of the ferroptosis data to more accurately represent the cell types involved and the limitations of our current findings regarding the effects of Fer-1.

      • The revised manuscript presents a more detailed interpretation of the ferroptosis data, clearly describing what our current experiments demonstrate and what remains to be investigated.

      We are grateful for your insightful feedback, which has highlighted important areas for improvement in our research presentation. We think that these revisions will enhance the clarity and scientific accuracy of our manuscript, ensuring that our findings and conclusions are well-supported and precisely communicated.

      Reviewer #2 (Recommendations For The Authors):

      The following questions and remarks should be considered by the authors:

      (1) The methods should clearly state whether the HH was discontinued during the 7 or 14 day exposure for cleaning, fresh water etc. Moreover, how was CO2 controlled? The procedure for splenectomy needs to be described in the methods.

      Thank you for your inquiry regarding the specifics of our experimental methods, particularly the management of HH exposure and the procedure for splenectomy. We appreciate your attention to detail and the importance of these aspects for the reproducibility and clarity of our research.

      (1) HH exposure conditions:

      In our experiments, mice were continuously exposed to HH for the entire duration of 7 or 14 days, without interruption for activities such as cleaning or providing fresh water. This uninterrupted exposure was crucial for maintaining consistent hypobaric conditions throughout the experiment. The hypobaric chamber was configured to ensure a ventilation rate of 25 air exchanges per minute. This high ventilation rate was effective in regulating the concentration of CO2 inside the chamber, thereby maintaining a stable environment for the mice.

      (2) The splenectomy was performed as follows:

      After anesthesia, the mice were placed in a supine position, and their limbs were fixed. The abdominal operation area was skinned, disinfected, and covered with a sterile towel. A median incision was made in the upper abdomen, followed by laparotomy to locate the spleen. The spleen was then carefully pulled out through the incision. The arterial and venous directions in the splenic pedicle were examined, and two vascular forceps were used to clamp all the tissue in the main cadre of blood vessels below the splenic portal. The splenic pedicle was cut between the forceps to remove the spleen. The end of the proximal hepatic artery was clamped with a vascular clamp, and double or through ligation was performed to secure the site. The abdominal cavity was then cleaned to ensure there was no bleeding at the ligation site, and the incision was closed. Post-operatively, the animals were housed individually. Generally, they were able to feed themselves after recovering from anesthesia and did not require special care.

      We hope this detailed description addresses your queries and provides a clear understanding of the experimental conditions and procedures used in our study. These methodological details are crucial for ensuring the accuracy and reproducibility of our research findings.

      (2) The lack of changes in MCH needs explanation? During stress erythropoiesis some limit in iron availability should cause MCH decrease particularly if the authors claim that macrophages for rapid iron recycling are decreased. Fig 1A is dispensable. Fig 1G NN control 14 days does not make sense since it is higher than 7 days of HH.

      Thank you for your inquiry regarding the lack of changes in Mean Corpuscular Hemoglobin (MCH) in our study, particularly in the context of stress erythropoiesis and decreased macrophage-mediated iron recycling. We appreciate the opportunity to provide further clarification on this aspect.

      (1) Explanation for stable MCH levels:

      • Our research identified a decrease in erythrophagocytosis and iron recycling in the spleen following HH exposure. Despite this, the MCH levels remained stable. This observation can be explained by considering the compensatory roles of other organs, particularly the liver and duodenum, in maintaining iron homeostasis.

      • Specifically, our investigations revealed an enhanced capacity of the liver to engulf RBCs and process iron under HH conditions. This increased hepatic erythrophagocytosis likely compensates for the reduced splenic activity, thereby stabilizing MCH levels.

      (2) Role of hepcidin and DMT1 expression:

      Additionally, hypoxia is known to influence iron metabolism through the downregulation of Hepcidin and upregulation of Divalent Metal Transporter 1 (DMT1) expression. These alterations lead to enhanced intestinal iron absorption and increased blood iron levels, further contributing to the maintenance of MCH levels despite reduced splenic iron recycling.

      (3) Revised Figure 1 and data presentation

      To address the confusion regarding the data presented in Figure 1G, we have made revisions in our manuscript. The original Figure 1G, which did not align with the expected trends, has been removed. In its place, we have included a statistical chart of Figure 1F in the new version of Figure 1G. This revision will provide a clearer and more accurate representation of our findings.

      (4) Manuscript updates and future research:

      • We update our manuscript to incorporate these explanations, ensuring that the rationale behind the stable MCH levels is clearly articulated. This includes a discussion on the role of the liver and duodenum in iron metabolism under hypoxic conditions.

      • Future research could explore in greater detail the mechanisms by which different organs contribute to iron homeostasis under stress conditions like HH, particularly focusing on the dynamic interplay between hepatic and splenic functions.

      We thank you for your insightful question, which has prompted a thorough re-examination of our findings and interpretations. We believe that these clarifications will enhance the overall understanding of our study and its implications in the context of iron metabolism and erythropoiesis under hypoxic conditions.

      (3) Fig 2 the difference between sham and splenectomy is really marginal and not convincing. Is there also a difference at 7 days? Why does the spleen size decrease between 7 and 14 days?

      Thank you for your observations regarding the marginal differences observed between sham and splenectomy groups in Figure 2, as well as your inquiries about spleen size dynamics over time. We appreciate this opportunity to clarify these aspects of our study.

      (1) Splenectomy vs. Sham group differences:

      • In our experiments, the difference between the sham and splenectomy groups under HH conditions, though subtle, was consistent with our hypothesis regarding the spleen's role in erythrophagocytosis and stress erythropoiesis. Under NN conditions, no significant difference was observed between these groups, which aligns with the expectation that the spleen's contribution is more pronounced under hypoxic stress.

      (2) Spleen size dynamics and peak stress erythropoiesis:

      • The observed splenic enlargement prior to 7 days can be attributed to a combination of factors, including the retention of RBCs and extramedullary hematopoiesis, which is known to be a response to hypoxic stress.

      • Prior research has elucidated that splenic stress-induced erythropoiesis, triggered by hypoxic conditions, typically attains its zenith within a timeframe of 3 to 7 days. This observation aligns with our Toluidine Blue (TO) staining results, which indicated that the apex of this response occurs at the 7-day mark (as depicted in Figure 1, F-G). Here, the culmination of this peak is characteristically succeeded by a diminution in extramedullary hematopoiesis, a phenomenon that could elucidate the observed contraction in spleen size, particularly in the interval between 7 and 14 days.

      • This pattern of splenic response under prolonged hypoxic stress is corroborated by studies such as those conducted by Wang et al. (2021), Harada et al. (2015), and Cenariu et al. (2021). These references collectively underscore that the spleen undergoes significant dynamism in reaction to sustained hypoxia. This dynamism is initially manifested as an enlargement of the spleen, attributable to escalated erythropoiesis and erythrophagocytosis. Subsequently, as these processes approach normalization, a regression in spleen size ensues.

      We’ve revised our manuscript to include a more detailed explanation of these splenic dynamics under HH conditions, referencing the relevant literature to provide a comprehensive context for our findings. We will also consider performing additional analysis or providing further data on spleen size changes at 7 days to support our observations and ensure a thorough understanding of the splenic response to hypoxic stress over time.

      (4) Fig 3 B the clusters should be explained in detail. If the decrease in macrophages in Fig 3K/L is responsible for the effect, why does splenectomy not have a much stronger effect? How do the authors know which cells died in the calcein stained population in Fig 3D?

      Thank you for your insightful questions regarding the details of our data presentation in Figure 3, particularly about the identification of cell clusters and the implications of macrophage reduction. We appreciate the opportunity to address these aspects and clarify our findings.

      (1) Explanation of cell clusters in Figure 3B:

      • In the revised manuscript, we have included detailed notes for each cell population represented in Figure 3B (Figure 3D in revised manuscript). These notes provide a clearer understanding of the cell types present in each cluster, enhancing the interpretability of our single-cell sequencing data.

      • This detailed annotation will help readers to better understand the composition of the splenic cell populations under study and how they are affected by hypoxic conditions.

      (2) Impact of splenectomy vs. macrophage reduction:

      • The interplay between the reduction in macrophage populations, as evidenced by our single-cell sequencing data, and the ramifications of splenectomy presents a multifaceted scenario. Notably, the observed decline in macrophage numbers following HH exposure does not straightforwardly equate to a comparable alteration in overall splenic function, as might be anticipated with splenectomy.

      • In the context of splenectomy under HH conditions, a significant escalation in the RBCs count was observed, surpassing that in non-splenectomized mice exposed to HH. This finding underscores the spleen's critical role in modulating RBCs dynamics under HH. It also indirectly suggests that the diminished phagocytic capacity of the spleen following HH exposure contributes to an augmented RBCs count, albeit to a lesser extent than in the splenectomy group. This difference is attributed to the fact that, while the number of RPMs in the spleen post-HH is reduced, they are still present, unlike in the case of splenectomy, where they are entirely absent.

      • Splenectomy entails the complete removal of the spleen, thus eliminating a broad spectrum of functions beyond erythrophagocytosis and iron recycling mediated by macrophages. The nuanced changes observed in our study may be reflective of the spleen's diverse functionalities and the organism's adaptive compensatory mechanisms in response to the loss of this organ.

      (3) Calcein stained population in Figure 3D:

      • Regarding the identification of cell death in the calcein-stained population in Figure 3D (Figure 3A in revised manuscript), we acknowledge that the specific cell types undergoing death could not be distinctly determined from this analysis alone.

      • The calcein staining method allows for the visualization of live (calcein-positive) and dead (calcein-negative) cells, but it does not provide specific information about the cell types. The decrease in macrophage population was inferred from the single-cell sequencing data, which offered a more precise identification of cell types.

      (4) Revised manuscript and data presentation:

      • Considering your feedback, we have revised our manuscript to provide a more comprehensive explanation of the data presented in Figure 3, including the nature of the cell clusters and the interpretation of the calcein staining results.

      • We have also updated the manuscript to reflect the removal of Figure 3K/L results and to provide a more focused discussion on the relevant findings.

      We are grateful for your detailed review, which has helped us to refine our data presentation and interpretation. These clarifications and revisions will enhance the clarity and scientific rigor of our manuscript, ensuring that our conclusions are well-supported and accurately conveyed.

      (5) Is the reduced phagocytic capacity in Fig 4B significant? Erythrophagocytosis is compromised due to the considerable spontaneous loss of labelled erythrocytes; could other assays help? (potentially by a modified Chromium release assay?). Is it necessary to stimulated phagocytosis to see a significant effect?

      Thank you for your inquiry regarding the significance of the reduced phagocytic capacity observed in Figure 4B, and the potential for employing alternative assays to elucidate erythrophagocytosis dynamics under HH conditions.

      (1) Significance of reduced phagocytic capacity:

      The observed reduction in the amplitude of fluorescently labeled RBCs in both the blood and spleen under HH conditions suggests a decrease in erythrophagocytosis. This is indicative of a diminished phagocytic capacity, particularly when contrasted with NN conditions.

      (2) Investigation of erythrophagocytosis dynamics:

      To delve deeper into erythrophagocytosis under HH, we employed Tuftsin to enhance this process. Following the injection of PKH67-labeled RBCs and subsequent HH exposure, we noted a significant decrease in PKH67 fluorescence in the spleen, particularly marked after the administration of Tuftsin. This finding implies that stimulated erythrophagocytosis can influence RBCs lifespan.

      (3) Erythrophagocytosis under normal and hypoxic conditions:

      Under normal conditions, the reduction in phagocytic activity is less apparent without stimulation. However, under HH conditions, our findings demonstrate a clear weakening of the phagocytic effect. While we established that promoting phagocytosis under NN conditions affects RBC lifespan, the impact of enhanced phagocytosis under HH on RBCs numbers was not explicitly investigated.

      (4) Potential for alternative assays:

      Considering the considerable spontaneous loss of labeled erythrocytes, alternative assays such as a modified Chromium release assay could provide further insights. Such assays might offer a more nuanced understanding of erythrophagocytosis efficiency and the stability of labeled RBCs under different conditions.

      (5) Future research directions:

      The implications of these results suggest that future studies should focus on comparing the effects of stimulated phagocytosis under both NN and HH conditions. This would offer a clearer picture of the impact of hypoxia on the phagocytic capacity of macrophages and the subsequent effects on RBC turnover.

      In summary, our findings indicate a diminished erythrophagocytic capacity, with enhanced phagocytosis affecting RBCs lifespan. Further investigation, potentially using alternative assays, would be beneficial to comprehensively understand the dynamics of erythrophagocytosis in different physiological states.

      (6) Can the observed ferroptosis be influenced by bi- and not trivalent iron chelators?

      Thank you for your question regarding the potential influence of bi- and trivalent iron chelators on ferroptosis under hypoxic conditions. We appreciate the opportunity to discuss the implications of our findings in this context.

      (1) Analysis of iron chelators on ferroptosis:

      In our study, we did not specifically analyze the effects of bi- and trivalent iron chelators on ferroptosis under hypoxia. However, our observations with Deferoxamine (DFO), a well-known iron chelator, provide some insights into how iron chelation may influence ferroptosis in splenic macrophages under hypoxic conditions.

      (2) Effect of DFO on oxidative stress markers:

      Our findings showed that under 1% O2, there was an increase in Malondialdehyde (MDA) content, a marker of lipid peroxidation, and a decrease in Glutathione (GSH) content, indicative of oxidative stress. These changes are consistent with the induction of ferroptosis, which is characterized by increased lipid peroxidation and depletion of antioxidants. Treatment with Ferrostatin-1 (Fer-1) and DFO effectively reversed these alterations. This suggests that DFO, like Fer-1, can mitigate ferroptosis in splenic macrophages under hypoxia, primarily by impacting MDA and GSH levels.

      Author response image 17.

      (3) Potential role of iron chelators in ferroptosis:

      The effectiveness of DFO in reducing markers of ferroptosis indicates that iron availability plays a crucial role in the ferroptotic process under hypoxic conditions. It is plausible that both bi- and trivalent iron chelators could influence ferroptosis, given their ability to modulate iron availability within cells. Since ferroptosis is an iron-dependent form of cell death, chelating iron, irrespective of its valence state, could potentially disrupt the process by limiting the iron necessary for the generation of reactive oxygen species and lipid peroxidation.

      (4) Additional research and manuscript updates:

      Our study highlights the need for further research to explore the differential effects of various iron chelators on ferroptosis, particularly under hypoxic conditions. Such studies could provide a more comprehensive understanding of the role of iron in ferroptosis and the potential therapeutic applications of iron chelators. We update our manuscript to include these findings and discuss the potential implications of iron chelation in the context of ferroptosis under hypoxic conditions. This will provide a broader perspective on our research and its significance in understanding the mechanisms of ferroptosis.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #2 (Public Review):

      Weaknesses:

      There are however, substantial concerns about the interpretation of the findings and limitations to the current analysis. In particular, Analysis of single unit activity is absent, making interpretation of population clusters and decoding less interpretable. These concerns should be addressed to make sure that the results can be interpreted clearly in an active field that already contains a number of confusing and possibly contradictory findings.

      We addressed this important point (which was also made by reviewer #1) in our previous revision. Specifically, we included additional analyses that operate at the level of single units rather than the population level, as requested by the reviewer. For example, we assessed, separately for each recorded neuron, whether there was a statistically significant difference in the magnitude of neural activity between hit and miss trials. This approach allowed us to fully balance the numbers of hit and miss trials at each sound level that were entered into the analysis. The results revealed that a large proportion (close to 50%) of units were task modulated, i.e. had significantly different response magnitudes between hit and miss trials, and that this proportion was not significantly different between lesioned and non-lesioned mice. It is therefore no longer correct to say that “analysis of single unit activity is absent”, and we would be grateful if this statement could be changed.  

      Reviewer #2 (Recommendations For The Authors): 

      The authors have done a good job addressing the main concerns from the previous review. There are a few additional points that hopefully do not require substantial additional edits. 

      Figure 5/supplements. While the authors provide compelling evidence that clusters and overall activity patterns are similar for lesioned and control animals, there do appear to be some differences. For instance, the hit/miss difference for cluster 3 (the "auditory" cluster) appears to be absent for lesioned mice (Fig 5S3 D). Can the hit-miss difference be quantified? 

      We agree that there are some differences between the activity profiles of lesioned and non-lesioned mice: Inspection of panels A and C of Figure 5 – figure supplement 3, for instance, indicates that there is a relatively high proportion of neurons in cluster 3 of the non-lesioned mice that exhibit prolonged elevated activity in hit trials and a relatively lower proportion of those neurons in cluster 3 of lesioned mice. This likely explains the difference in the average response profiles of cluster 3 between the two groups pointed out by the reviewer. Furthermore, there is a slightly larger pre-stimulus dip in hit trial activity for lesioned than non-lesioned mice in cluster 1, a more pronounced short latency peak in hit trial activity for lesioned mice in cluster 2 as well as differences in other clusters. However, these differences are not inconsistent with our interpretation of these data in that we describe the activity profiles as being “similar” and exhibiting a “close correspondence” (rather than as being identical). Having considered this carefully, we do not believe that attempting to quantify these small differences would add much value here or help the reader with the interpretation of these data, especially given that the activity profiles of all neurons that make up each cluster are plotted in panels A and C.  

      Could the mice have been using somatosensory information to perform the task? A wideband click presented from a free-field speaker could have energy in a low frequency range that triggers a whisker response. Given the moderate but not insignificant somatosensory input into the IC shell, this doesn't seem like a trivial concern, and it could substantially impact interpretation of the results. Without wanting to complicate things too much, the authors might consider one or more of these questions: What's the frequency content of the click? Can a deaf mouse perform the task? Can an AC-lesioned mouse learn/perform the task with close-field acoustic stimulation? Or for a highfrequency tone target rather than a click?

      This is an interesting suggestion. We have, in the context of another study, trained mice in our lab to detect somatosensory stimulation (a brush stroke to their whiskers) and consistently found that it takes them much longer (often two weeks or more) to learn to respond to a stimulation of their whiskers than to the presentation of a sound. The brush strokes applied to the whiskers in those experiments were 50-150 ms in duration and were thus orders of magnitude greater in both their duration and amplitude and considerably more salient than any somatosensory stimulus that could potentially arise from the clicks presented here. Therefore, we consider it highly unlikely that mice learned to use somatosensory information potentially picked up by their whiskers to perform the click detection task.  

      L. 63. The authors might want to cite some recent work from the Apostilides lab on the properties of AC-IC projections as well as non-auditory signals in the IC. 

      There are two recent papers from the Apostolides lab that are relevant to our study. We already cite Quass et al., 2023. We have now added Ford et al., 2024 as well.

      Changes to manuscript:

      Line 81: “This raises the possibility that these context-dependent effects may be inherited from the auditory cortex (Ford et al., 2024)”.

      L. 220. "sound-responsive neurons" It is possible to report the representation of sound-responsive neurons in the different clusters? This might help tease apart what processes contribute to their respective activity. Not a big problem if the samples can't be registered easily.

      Sound-driven neurons were identified on the basis of a subset (those trials in which sounds were presented at levels from 53 dB SPL to 65 dB SPL) of the trials used for the clustering analysis so the analyses are not directly comparable.

      p. 603. "quieter stimuli" What sound level was actually used in the 2p experiments? Was it fixed at a single level per animal?

      Sound level was not fixed at a single level. A total of nine different sound levels were used per mouse. We apologize that this was not made clear previously.  

      Changes to manuscript:

      Line 603: “Once the mice had achieved a stable level of performance (typically two days with d’ > 1.5), quieter stimuli (41-71 dB SPL) were introduced. For each mouse a total of 9 different sound levels were used and the range of sound levels was adjusted to each animal’s behavioral performance to avoid floor and ceiling effects and could, therefore, differ from mouse to mouse.”

      L. 747. Something is not right with this formula. It appears that it will always reduce to a value of 1/2.

      Thanks for spotting this. There are two typos in this formula. This has been fixed and now reads (line 749):  

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Recommendations for The Authors):

      Q1: In response to reviewers you noted totally 292 sequenced LECs, however in reviewer figure 3 B the numbers seem to add up to 221. Please include mention of the total number of LEC sequences. Please mention line 119, page 4 the total number of explored LEC transcriptomes

      Thank you for your carefully review. We have updated Fig 2A, 2C and 2E. It was 242 (not 292) LECs included in our initial analysis, which contains the sample of d5 post MI in raw data (E-MTAB-7895). We dropped d5 in our subsequent analysis because the change in d5 did not significant differ from d3. Therefore, we included 221 LECs in our final analysis as we updated in Fig2A, 2C and 2E.

      Q2-1: Figure 3A supposedly shows % of LEC subpopulations relative to their numbers found in day 0 samples. However, there seem to be some errors, because for example the subpop LEC Cap I include 13 cells day 1 and 6 cells day 1, which corresponds to 46% of initial numbers. However, from your graph 3B the blue population seems to occupy 10%. Please revise or explain how these relative % were calculated.

      Thank you for your question. In the Figure 3A, each column was calculated by dn/d0*100%, that is d0=57/57*100%=100%, and d1= 21/57*100%=36.84%, d3=9/57*100%=15.79%, d7, d14, d28...Therefor, Cap I in d0 (13 cells) is 13/57*100%=22.81%, and Cap I in d1(6 cells) is 6/57*100%= 10.53%.

      Q2-2: Further, based on the relative % of LEC subpopulations, using the numbers mentioned in Fig 3B, it would appear that the relative frequency LEC cap II population is actually stable at around 20-30% of all LECs per time point throughout the study (except day 1 drop). This contrasts with line 136 p. 4 statement. I would also urge caution for interpreting too much into the variation of relative levels of LEC co, as these represent exceeding rare cells in your samples, and could reflect technical issues rather than true biological variation (total LEC co numbers analyzed ranging from 1-24 cells/ time point). The same could be said of LEC cap II and cap III.

      We strongly agree with your comment on the proportion of LEC cell subtypes post MI. As you pointed out, we have revised the result description on Page 4, line 137-143 as followed.

      “In the early stages of myocardial infarction (D1 and D3), the quantity of LECs decreased sharply. The number of LECs gradually increasing from day 7 and returning to normal levels by day 14 after MI. Moreover, from day 14 onwards, the number and proportion of Ca I type LECs significantly increased.”

      Q3: Please list in supplement the gene features used to identify in spatial transcriptomics the different LEC subpopulations, as their profiles (notably for capillary LECs) don't appear to be very different based on data in Fig 2F.

      We have supplied gene features in supplementary materials.

      Q4: In section 2.7 you refer to Gal9 secretion. Please replace with expression as no measure of protein levels from LECs has been described in your study.

      Thank you for your suggestion, we have replaced secretion with expression.

      Q5: The updated method to exclude non-lymphatic cells from lymphatic vessel analyses by incorporating pdpn as an additional marker ('present costained areas wherever possible' line 350 p 10)

      Thank you for your correction. We have updated the description as follows and lighted them in the manuscript: rabbit anti-Lyve1 (1:300, ab14917, Abcam, UK), [Syrian hamster anti-Podoplanin (1:100, 53-5381-82, Thermo, USA), rabbit anti-Prox1(1:300, ab199359, Abcam, UK), both anti-podoplain and anti-prox1 are additional markers co-stained with Lyve1 to exclude non-lymphatic cells from lymphatic vessel].

      Q6: Fig 1B, it is highly surprising to see the lymphatic density in the BZ go from 25 um² at day 3 to more than 1000 um² only four days later (day 7). Is it possible that your day 3 measurements were in the infarct area, and not BZ area? The H&E image shown in Fig1a for d3 sample would seem to indicate the analysis was done in a dead area, rather than BZ. Please revise (perhaps select similar zone as shown for d1 in fig 6D, adjusted for subepicardial region and not mid-myocardial as seems to be the case currently), and also provide lymphatic area measures in healthy myocardium for day 0 samples. The unit used (um²) also would depend on the size of the area examined. Is this unit per image? If so please report total imaged area as a reference.

      A6: Thank you for your reminding and advises. We have labeled each zone on H&E and IF images in Fig1-supplementary Fig2B, and updated a clearer histological photo taken at 3 days post MI in Fig1A. Furthermore, we recalculated the lymphatic vessel area ratio as you suggested by calculating the ratio of LEC co-stained area to total imaged area under 100-fold magnification.

      Q7: The mention that CD68 antibody isn't compatible with lyve1 antibody could easily have been bridged by using other macrophage markers, such as F4/80, which is readily available and often used marker for macs in mice and comes notably as a rat anti-mouse F4-80. It would have added much more relevant information to exclude Lyve1-/F4/80+ cells as compared to the current analysis, which may indeed include in area measures Lyve1+ /Pdpn- single cells erroneously spotted as 'lymphatic vessels'

      Thank you for your excellent suggestion. We co-stained the sample with F4/80 and LYVE1 and supplied in the Fig1-supplementary Figure 1E, as shown in Author response image 1.

      Author response image 1.

      Immunofluorescence (IF) co-staining of tissue section with F4/80 and LYVE1 in sham and MI mice model at d3, d7, d14, and d28 post-MI. LYVE1: lymphatic vessel endothelial hyaluronan receptor 1; DAPI: 4’6-diamidino-2-phenylindole; scale bar in 10×-100 μm, 40×-25μm.

      Reviewer 2 (Recommendations for The Authors):

      Q1: Language expression must be improved. Many incomplete sentences exist throughout the manuscript. A few examples: Line 70-71: In order to further elucidate the effects and regulatory mechanisms of the lymphatic vessels in the repair process of myocardial injury following MI. Line 71-73. This study, integrated single-cell sequencing and spatial transcriptome data from mouse heart tissue at different timepoints after MI from publicly available data (E-MTAB-7895, GSE214611) in the ArrayExpress and gene expression omnibus (GEO) databases. Line 88-89: Since the membrane protein LYVE1 can present lymphatic vessel morphology more clearly than PROX1.

      Thank you for your correction. We have carefully inspected and corrected the whole manuscript.

      Q2: The type of animal models (i.e., permanent MI or MI plus reperfusion) included in Array Express and gene expression omnibus (GEO) databases must be clearly defined as these two models may have completely different effects on lymphatic vessel development during post-MI remodeling.

      Thank you for your excellent suggestion. The animal models used in both E-MTAB-7895 and GSE214611 are permanent MI. We have modified the model information in the methodology section (page 12, line 400-401).

      Q3: Line 119-120: Caution must be taken regarding Cav1 as a lymphocyte marker because Cav1 is expressed in all endothelial cells, not limited to LEC.

      Thanks for your reminding. Cav 1 used in our clustering is one of the marker gene for its different expression in sub-types of LECs, referred in article PMID: 31402260

      Q4: Figure 1 legend needs to be improved. RZ, BZ, and IZ need to be labeled in all IF images. Day 0 images suggest that RZ is the tissue section from the right ventricle.

      Thank you for your suggestion. We have labeled and updated the regions of RZ, BZ, and IZ in H&E and IF image in Figure1-Figure supplement 2B.

      Q5: The discussion section needs to be improved and better focused on the findings from the current study.

      Thank you for your good comment. Based on your suggestion, we have revised the first paragraph of the discussion from lines 250-256 (Page 7) as followed:

      Cardiac lymphatics play an important role in myocardial edema and inflammation. This study, for the first time, integrated single-cell sequencing data and spatial transcriptome data from mouse heart tissue at different time points of post-MI, and identified four transcriptionally distinct subtypes of LECs and their dynamic transcriptional heterogeneity distribution in different regions of myocardial tissue post-MI. These subgroups of LECs were shown to form different function involved in the inflammation, apoptosis, ferroptosis, and water absorption related regulation of vasopressin during the process of myocardial repair after MI.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In their manuscript entitled 'The domesticated transposon protein L1TD1 associates with its ancestor L1 ORF1p to promote LINE-1 retrotransposition', Kavaklıoğlu and colleagues delve into the role of L1TD1, an RNA binding protein (RBP) derived from a LINE1 transposon. L1TD1 proves crucial for maintaining pluripotency in embryonic stem cells and is linked to cancer progression in germ cell tumors, yet its precise molecular function remains elusive. Here, the authors uncover an intriguing interaction between L1TD1 and its ancestral LINE-1 retrotransposon.

      The authors delete the DNA methyltransferase DNMT1 in a haploid human cell line (HAP1), inducing widespread DNA hypo-methylation. This hypomethylation prompts abnormal expression of L1TD1. To scrutinize L1TD1's function in a DNMT1 knock-out setting, the authors create DNMT1/L1TD1 double knock-out cell lines (DKO). Curiously, while the loss of global DNA methylation doesn't impede proliferation, additional depletion of L1TD1 leads to DNA damage and apoptosis.

      To unravel the molecular mechanism underpinning L1TD1's protective role in the absence of DNA methylation, the authors dissect L1TD1 complexes in terms of protein and RNA composition. They unveil an association with the LINE-1 transposon protein L1-ORF1 and LINE-1 transcripts, among others.

      Surprisingly, the authors note fewer LINE-1 retro-transposition events in DKO cells compared to DNMT1 KO alone.

      Strengths:

      The authors present compelling data suggesting the interplay of a transposon-derived human RNA binding protein with its ancestral transposable element. Their findings spur interesting questions for cancer types, where LINE1 and L1TD1 are aberrantly expressed.

      Weaknesses:

      Suggestions for refinement:

      The initial experiment, inducing global hypo-methylation by eliminating DNMT1 in HAP1 cells, is intriguing and warrants more detailed description. How many genes experience misregulation or aberrant expression? What phenotypic changes occur in these cells? Why did the authors focus on L1TD1? Providing some of this data would be helpful to understand the rationale behind the thorough analysis of L1TD1.

      The finding that L1TD1/DNMT1 DKO cells exhibit increased apoptosis and DNA damage but decreased L1 retro-transposition is unexpected. Considering the DNA damage associated with retro-transposition and the DNA damage and apoptosis observed in L1TD1/DNMT1 DKO cells, one would anticipate the opposite outcome. Could it be that the observation of fewer transposition-positive colonies stems from the demise of the most transpositionpositive colonies? Further exploration of this phenomenon would be intriguing.

      Reviewer #2 (Public review):

      In this study, Kavaklıoğlu et al. investigated and presented evidence for a role for domesticated transposon protein L1TD1 in enabling its ancestral relative, L1 ORF1p, to retrotranspose in HAP1 human tumor cells. The authors provided insight into the molecular function of L1TD1 and shed some clarifying light on previous studies that showed somewhat contradictory outcomes surrounding L1TD1 expression. Here, L1TD1 expression was correlated with L1 activation in a hypomethylation dependent manner, due to DNMT1 deletion in HAP1 cell line. The authors then identified L1TD1 associated RNAs using RIPSeq, which display a disconnect between transcript and protein abundance (via Tandem Mass Tag multiplex mass spectrometry analysis). The one exception was for L1TD1 itself, is consistent with a model in which the RNA transcripts associated with L1TD1 are not directly regulated at the translation level. Instead, the authors found L1TD1 protein associated with L1-RNPs and this interaction is associated with increased L1 retrotransposition, at least in the contexts of HAP1 cells. Overall, these results support a model in which L1TD1 is restrained by DNA methylation, but in the absence of this repressive mark, L1TD1 is expression, and collaborates with L1 ORF1p (either directly or through interaction with L1 RNA, which remains unclear based on current results), leads to enhances L1 retrotransposition. These results establish feasibility of this relationship existing in vivo in either development or disease, or both.

      Comments on revised version:

      In general, the authors did an acceptable job addressing the major concerns throughout the manuscript. This revision is much clearer and has improved in terms of logical progression.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      The authors have addressed all my questions in the revised version of the manuscript.

      Reviewer #2 (Recommendations for the authors):

      Revised comments:

      A few points we'd like to see addressed are our comments about the model (Figure S7C), as this is important for the readership to understand this complex finding. Please try to apply some quantification, if possible (question 8). Please do your best to tone down the direct relationship of these findings to embryology (question 11). Based on both reviewer comments, we believe addressing reviewer #1s "Suggestions for refinement" (2 points), would help us change our view of solid to convincing.

      Responses to changes:

      Major

      (1) The study only used one knockout (KO) cell line generated by CRISPR/Cas9.

      Considering the possibility of an off-target effect, I suggest the authors attempt one or both of these suggestions.

      A)  Generate or acquire a similar DMNT1 deletion that uses distinct sgRNAs, so that the likelihood of off-targets is negligible. A few simple experiments such as qRT-PCR would be sufficient to suggest the same phenotype.

      B)  Confirm the DNMT1 depletion also by siRNA/ASO KD to phenocopy the KO effect.

      (2) In addition to the strategies to demonstrate reproducibility, a rescue experiment restoring DNMT1 to the KO or KD cells would be more convincing. (Partial rescue would suffice in this case, as exact endogenous expression levels may be hard to replicate).

      We have undertook several approaches to study the effect of DNMT1 loss or inactivation: As described above, we have generated a conditional KO mouse with ablation of DNMT1 in the epidermis. DNMT1-deficient keratinocytes isolated from these mice show a significant increase in L1TD1 expression. In addition, treatment of primary human keratinocytes and two squamous cell carcinoma cell lines with the DNMT inhibitor aza-deoxycytidine led to upregulation of L1TD1 expression. Thus, the derepression of L1TD1 upon loss of DNMT1 expression or activity is not a clonal effect.

      Also, the spectrum of RNAs identified in RIP experiments as L1TD1-associated transcripts in HAP1 DNMT1 KO cells showed a strong overlap with the RNAs isolated by a related yet different method in human embryonic stem cells. When it comes to the effect of L1TD1 on L1-1 retrotranspostion, a recent study has reported a similar effect of L1TD1 upon overexpression in HeLa cells [4].

      All of these points together help to convince us that our findings with HAP1 DNMT KO are in agreement with results obtained in various other cell systems and are therefore not due to off-target effects. With that in mind, we would pursue the suggestion of Reviewer 1 to analyze the effects of DNA hypomethylation upon DNMT1 ablation.

      Thank you for addressing this concern. The reference to Beck 2021 and the additional cells lines (R2: keratinocytes and R3: squamous cell carcinoma) provides sufficient evidence that this result is unlikely to be a result of clonal expansion or off targets.

      Question: Was the human ES Cell RIP Experiment shown here? What is the overlap?

      We refer to the recently published study by Jin et al. (PMID: 38165001). As stated in the Discussion, the majority of L1TD1-associated transcripts in HAP1 cells (69%) identified in our study were also reported as L1TD1 targets in hESCs suggesting a conserved binding affinity of this domesticated transposon protein across different cell types.  

      (3) As stated in the introduction, L1TD1 and ORF1p share "sequence resemblance" (Martin 2006). Is the L1TD1 antibody specific or do we see L1 ORF1p if Fig 1C were uncropped?

      (6) Is it possible the L1TD1 antibody binds L1 ORF1p? This could make Figure 2D somewhat difficult to interpret. Some validation of the specificity of the L1TD1 antibody would remove this concern (see minor concern below).

      This is a relevant question. We are convinced that the L1TD1 antibody does not crossreact with L1 ORF1p for the following reasons: Firstly, the antibody does not recognize L1 ORF1p (40 kDa) in the uncropped Western blot for Figure 1C (Figure R4A). Secondly, the L1TD1 antibody gives only background signals in DKO cells in the indirect immunofluorescence experiment shown in Figure 1E of the manuscript.

      Thirdly, the immunogene sequence of L1TD1 that determines the specificity of the antibody was checked in the antibody data sheet from Sigma Aldrich. The corresponding epitope is not present in the L1 ORF1p sequence.

      Finally, we have shown that the ORF1p antibody does not cross-react with L1TD1 (Figure R4B).

      Response: Thank you for sharing these images. These full images relieve concerns about specificity. The increase of ORF1P in R4B and Main figure 3C is interesting and pointed out in the manuscript. Not for the purposes of this review, but the observation of reduced transposition despite increased ORF1P could be an interesting follow up to this study (combined with the similar UPF1 result could indicate a complex of some kind).

      (4) In abstract (P2), the authors mentioned that L1TD1 works as an RNA chaperone, but in the result section (P13), they showed that L1TD1 associates with L1 ORF1p in an RNA independent manner. Those conclusions appear contradictory. Clarification or revision is required.

      Our findings that both proteins bind L1 RNA, and that L1TD1 interacts with ORF1p are compatible with a scenario where L1TD1/ORF1p heteromultimers bind to L1 RNA. The additional presence of L1TD1 might thereby enhance the RNA chaperone function of ORF1p. This model is visualized now in Suppl. Figure S7C.

      Response: Thank you for the model. To further clarify, do you mean that L1TD1 can bind L1 RNA, but this is not needed for the effect, however this "bonus" binding (that is enabled by heteromultimerization) appears to enhance the retrotransposition frequency? Do you think L1TD1 is binding L1 RNA in this context or simply "stabilizing" ORF1P (Trimer) RNP?

      Based on our data, L1TD1 associates with L1 RNA and interacts with L1 ORF1p. Both features might contribute to the enhanced retrotransposition frequency. Interestingly, the L1TD1 protein shares with its ancestor L1 ORF1p the non-canonical RNA recognition motif and the coiled-coil motif required for the trimerization but has two copies instead of one of the C-terminal domain (CTD), a structure with RNA binding and chaperone function. We speculate that the presence of an additional CTD within the L1TD1 protein might thereby enhance the RNA binding and chaperone function of L1TD1/ORF1p heteromultimers.

      (5) Figure 2C fold enrichment for L1TD1 and ARMC1 is a bit difficult to fully appreciate. A 100 to 200-fold enrichment does not seem physiological. This appears to be a "divide by zero" type of result, as the CT for these genes was likely near 40 or undetectable. Another qRT-PCR based approach (absolute quantification) would be a more revealing experiment. This is the validation of the RIP experiments and the presentation mode is specifically developed for quantification of RIP assays (Sigma Aldrich RIP-qRT-PCR: Data Analysis Calculation Shell). The unspecific binding of the transcript in the absence of L1TD1 in DNMT1/L1TD1 DKO cells is set to 1 and the value in KO cells represents the specific binding relative the unspecific binding. The calculation also corrects for potential differences in the abundance of the respective transcript in the two cell lines. This is not a physiological value but the quantification of specific binding of transcripts to L1TD1. GAPDH as negative control shows no enrichment, whereas specifically associated transcripts show strong enrichement. We have explained the details of RIPqRT-PCR evaluation in Materials and Methods (page 14) and the legend of Figure 2C in the revised manuscript.

      Response: Thank you for the clarification and additional information in the manuscript.

      (6) Is it possible the L1TD1 antibody binds L1 ORF1p? This could make Figure 2D somewhat difficult to interpret. Some validation of the specificity of the L1TD1 antibody would remove this concern (see minor concern below).

      See response to (3).

      Response: Thanks.

      (7) Figure S4A and S4B: There appear to be a few unusual aspects of these figures that should be pointed out and addressed. First, there doesn't seem to be any ORF1p in the Input (if there is, the exposure is too low). Second, there might be some L1TD1 in the DKO (lane 2) and lane 3. This could be non-specific, but the size is concerning. Overexposure would help see this.

      The ORF1p IP gives rise to strong ORF1p signals in the immunoprecipitated complexes even after short exposure. Under these conditions ORF1p is hardly detectable in the input. Regarding the faint band in DKO HAP1 cells, this might be due to a technical problem during Western blot loading. Therefore, the input samples were loaded again on a Western blot and analyzed for the presence of ORF1p, L1TD1 and beta-actin (as loading control) and shown as separate panel in Suppl. Figure S4A.

      The enhanced image is clearer. Thanks.

      S4A and S4B now appear to the S6A and S6B, is that correct? (This is due to the addition of new S1 and S2, but please verify image orders were not disturbed).

      Yes, the input is shown now as a separate panel in Suppl. Figure S6A.

      (8) Figure S4C: This is related to our previous concerns involving antibody cross-reactivity. Figure 3E partially addresses this, where it looks like the L1TD1 "speckles" outnumber the ORF1p puncta, but overlap with all of them. This might be consistent with the antibody crossreacting. The western blot (Figure 3C) suggests an upregulation of ORF1p by at least 23x in the DKO, but the IF image in 3E is hard to tell if this is the case (slightly more signal, but fewer foci). Can you return to the images and confirm the contrast are comparable? Can you massively overexpose the red channel in 3E to see if there is residual overlap? In Figure 3E the L1TD1 antibody gives no signal in DNMT1/L1TD1 DKO cells confirming that it does not recognize ORF1p. In agreement with the Western blot in Figure 3C the L1 ORF1p signal in Figure 3E is stronger in DKO cells. In DNMT1 KO cells the L1 ORF1p antibody does not recognize all L1TD1 speckles. This result is in agreement with the Western blot shown above in Figure R4B and indicates that the L1 ORF1p antibody does not recognize the L1TD1 protein. The contrast is comparable and after overexposure there are still L1TD1 specific speckles. This might be due to differences in abundance of the two proteins.

      Response: Suggestion: Would it be possible to use a program like ImageJ to supplement the western blot observation? Qualitatively, In figure 3E, it appears that there is more signal in the DKO, but this could also be due to there being multiple cells clustered together or a particularly nicely stained region. Could you randomly sample 20-30 cells across a few experiments to see if this holds up. I am interested in whether the puncta in the KO image(s) is a very highly concentrated region and in the DKO this is more disperse. Also, the representative DKO seems to be cropped slightly wrong. (Please use puncta as a guide to make the cropping more precise)

      As suggested by the reviewer we have quantified the signals of 60 KO cells and 56 DKO cells in three different IF experiments by ImageJ. We measured a 1.4-fold higher expression level of L1 ORF1p in DKO cells. However, the difference is not statistically significant. This is most probably due to the change in cell size and protein content during the cell cycle with increasing protein contents from G1 to G2. Western blot analysis provides signals of comparable protein amounts representing an average expression levels over ten thousands of cells. Nevertheless, the quantification results reflect in principle the IF pictures shown in Figure 3E but IF is probably not the best method to quantify protein amounts. We have also corrected Figure 3E.

      Author response image 1.

      (9) The choice of ARMC1 and YY2 is unclear. What are the criteria for the selection?

      ARMC1 was one of the top hits in a pilot RIP-seq experiment (IP versus input and IP versus IgG IP). In the actual RIP-seq experiment with DKO HAP1 cells instead of IgG IP as a negative control, we found ARMC1 as an enriched hit, although it was not among the top 5 hits. The results from the 2nd RIP-seq further confirmed the validity of ARMC1 as an L1TD1interacting transcript. YY2 was of potential biological relevance as an L1TD1 target due to the fact that it is a processed pseudogene originating from YY1 mRNA as a result of retrotransposition. This is mentioned on page 6 of the revised manuscript.

      Response: Appreciated!

      (10) (P16) L1 is the only protein-coding transposon that is active in humans. This is perhaps too generalized of a statement as written. Other examples are readily found in the literature.

      Please clarify.

      We will tone down this statement in the revised manuscript.

      Response: Appreciated! To further clarify, the term "active" when it comes to transposable elements, has not been solidified. It can span "retrotransposition competent" to "transcripts can be recovered". There are quite a few reports of GAG transcripts and protein from various ERV/LTR subfamilies in various cells and tissues (in mouse and human at least), however whether they contribute to new insertions is actively researched.

      (11) In both the abstract and last sentence in the discussion section (P17), embryogenesis is mentioned, but this is not addressed at all in the manuscript. Please refrain from implying normal biological functions based on the results of this study unless appropriate samples are used to support them.

      Much of the published data on L1TD1 function are related to embryonic stem cells [3- 7].

      Therefore, it is important to discuss our findings in the context of previous reports.

      Response: It is well established that embryonic stem cells are not a perfect or direct proxies for the inner cell mass of embryos, as multiple reports have demonstrated transcriptomic, epigenetic, chromatin accessibility differences. The exact origin of ES cells is also considered controversial. We maintain that the distinction between embryos/embryogenesis and the results presented in the manuscript are not yet interchangeable. An important exception would be complex models of embryogenesis such as embryoids, (or synthetic/artificial embryo models that have been carefully been termed as such so as to not suggest direct implications to embryos). https://www.nature.com/articles/ncb2965  

      https://link.springer.com/article/10.1007/s00018-018-2965-y  

      https://www.cell.com/developmental-cell/abstract/S1534-5807(24)00363-0?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS1534580724003630%3Fshowall%3Dtrue

      We have deleted the corresponding paragraph in the Discussion.

      (12) Figure 3E: The format of Figures 1A and 3E are internally inconsistent. Please present similar data/images in a cohesive way throughout the manuscript. We show now consistent IF Figures in the revised manuscript.

      Response: Thanks

      Minor:

      In general:

      Still need checking for typos, mostly in Materials and Methods section; Please keep a consistent writing style throughout the whole manuscript. If you use L1 ORF1p, then please use L1 instead of LINE-1, or if you keep LINE-1 in your manuscript, then you should use LINE-1 ORF1p.

      A lab member from the US checked again the Materials and Methods section for typos. We keep the short version L1 ORF1p.

      (1) Intro:

      - Is L1Td1 in mice and Humans? How "conserved" is it and does this suggest function? Murine and human L1TD1 proteins share 44% identity on the amino acid level and it was suggested that the corresponding genes were under positive selection during evolution with functions in transposon control and maintenance of pluripotency [8].

      - Why HAP1? (Haploid?) The importance of this cell line is not clear.

      HAP1 is a nearly haploid human cancer cell line derived from the KBM-7 chronic myelogenous leukemia (CML) cell line [9, 10]. Due to its haploidy is perfectly suited and widely used for loss-of-function screens and gene editing. After gene editing cells can be used in the nearly haploid or in the diploid state. We usually perform all experiments with diploid HAP1 cell lines. Importantly, in contrast to other human tumor cell lines, this cell line tolerates ablation of DNMT1. We have included a corresponding explanation in the revised manuscript on page 5, first paragraph.

      - Global methylation status in DNMT1 KO? (Methylations near L1 insertions, for example?)

      The HAP1 DNMT1 KO cell line with a 20 bp deletion in exon 4 used in our study was validated in the study by Smits et al. [11]. The authors report a significant reduction in overall DNA methylation. However, we are not aware of a DNA methylome study on this cell line. We show now data on the methylation of L1 elements in HAP1 cells and upon DNMT1 deletion in the revised manuscript in Suppl. Figure S1B.

      Response: Looks great!

      (2) Figure 1:

      - Figure 1C. Why is LMNB used instead of Actin (Fig1D)?

      We show now beta-actin as loading control in the revised manuscript.

      - Figure 1G shows increased Caspase 3 in KO, while the matching sentence in the result section skips over this. It might be more accurate to mention this and suggest that the single KO has perhaps an intermediate phenotype (Figure 1F shows a slight but not significant trend).

      We fully agree with the reviewer and have changed the sentence on page 6, 2nd paragraph accordingly.

      - Would 96 hrs trend closer to significance? An interpretation is that L1TD1 loss could speed up this negative consequence.

      We thank the reviewer for the suggestion. We have performed a time course experiment with 6 biological replicas for each time point up to 96 hours and found significant changes in the viability upon loss of DNMT1 and again significant reduction in viability upon additional loss of L1TD1 (shown in Figure 1F). These data suggest that as expected loss of DNMT1 leads to significant reduction viability and that additional ablation of L1TD1 further enhances this effect.

      Response: Looks good!

      - What are the "stringent conditions" used to remove non-specific binders and artifacts (negative control subtraction?)

      Yes, we considered only hits from both analyses, L1TD1 IP in KO versus input and L1TD1 IP in KO versus L1TD1 IP in DKO. This is now explained in more detail in the revised manuscript on page 6, 3rd paragraph.

      (3) Figure 2:

      - Figure 2A is a bit too small to read when printed.

      We have changed this in the revised manuscript.

      - Since WT and DKO lack detectable L1TD1, would you expect any difference in RIP-Seq results between these two?

      Due to the lack of DNMT1 and the resulting DNA hypomethylation, DKO cells are more similar to KO cells than WT cells with respect to the expressed transcripts.

      - Legend says selected dots are in green (it appears blue to me). We have changed this in the revised manuscript.

      - Would you recover L1 ORF1p and its binding partners in the KO? (Is the antibody specific in the absence of L1TD1 or can it recognize L1?) I noticed an increase in ORF1p in the KO in Figure 3C.

      Thank you for the suggestion. Yes, L1 ORF1p shows slightly increased expression in the proteome analysis and we have marked the corresponding dot in the Volcano plot (Figure 3A).

      - Should the figure panel reference near the (Rosspopoff & Trono) reference instead be Sup S1C as well? Otherwise, I don't think S1C is mentioned at all.

      - What are the red vs. green dots in 2D? Can you highlight ERV and ALU with different colors?

      We added the reference to Suppl. Figure S1C (now S3C) in the revised manuscript. In Figure 2D L1 elements are highlighted in green, ERV elements in yellow, and other associated transposon transcripts in red.

      Response: Much better, thanks!

      - Which L1 subfamily from Figure 2D is represented in the qRT-PCR in 2E "LINE-1"? Do the primers match a specific L1 subfamily? If so, which? We used primers specific for the human L1.2 subfamily.

      - Pulling down SINE element transcripts makes some sense, as many insertions "borrow" L1 sequences for non-autonomous retro transposition, but can you speculate as to why ERVs are recovered? There should be essentially no overlap in sequence.

      In the L1TD1 evolution paper [8], a potential link between L1TD1 and ERV elements was discussed:

      "Alternatively, L1TD1 in sigmodonts could play a role in genome defense against another element active in these genomes. Indeed, the sigmodontine rodents have a highly active family of ERVs, the mysTR elements [46]. Expansion of this family preceded the death of L1s, but these elements are very active, with 3500 to 7000 speciesspecific insertions in the L1-extinct species examined [47]. This recent ERV amplification in Sigmodontinae contrasts with the megabats (where L1TD1 has been lost in many species); there are apparently no highly active DNA or RNA elements in megabats [48]. If L1TD1 can suppress retroelements other than L1s, this could explain why the gene is retained in sigmodontine rodents but not in megabats."

      Furthermore, Jin et al. report the binding of L1TD1 to repetitive sequences in transcripts [12]. It is possible that some of these sequences are also present in ERV RNAs.

      Response: Interesting, thanks for sharing

      - Is S2B a screenshot? (the red underline).

      No, it is a Powerpoint figure, and we have removed the red underline.

      (4) Figure 3:

      - Text refers to Figure 3B as a western blot. Figure 3B shows a volcano plot. This is likely 3C but would still be out of order (3A>3C>3B referencing). I think this error is repeated in the last result section.

      - Figure and legends fail to mention what gene was used for ddCT method (actin, gapdh, etc.).

      - In general, the supplemental legends feel underwritten and could benefit from additional explanations. (Main figures are appropriate but please double-check that all statistical tests have been mentioned correctly).

      Thank you for pointing this out. We have corrected these errors in the revised manuscript.

      (5) Discussion:

      - Aluy connection is interesting. Is there an "Alu retrotransposition reporter assay" to test whether L1TD1 enhances this as well?

      Thank you for the suggestion. There is indeed an Alu retrotransposition reporter assay reported be Dewannieux et al. [13]. The assay is based on a Neo selection marker. We have previously tested a Neo selection-based L1 retrotransposition reporter assay, but this system failed to properly work in HAP1 cells, therefore we switched to a blasticidin based L1 retrotransposition reporter assay. A corresponding blasticidin-based Alu retrotransposition reporter assay might be interesting for future studies (mentioned in the Discussion, page 11 paragraph 4 of the revised manuscript.

      (6) Material and Methods :

      - The number of typos in the materials and methods is too numerous to list. Instead, please refer to the next section that broadly describes the issues seen throughout the manuscript.

      Writing style

      (1) Keep a consistent style throughout the manuscript: for example, L1 or LINE-1 (also L1 ORF1p or LINE-1 ORF1p); per or "/"; knockout or knock-out; min or minute; 3 times or three times; media or medium. Additionally, as TE naming conventions are not uniform, it is important to maintain internal consistency so as to not accidentally establish an imprecise version.

      (2) There's a period between "et al" and the comma, and "et al." should be italic.

      (3) The authors should explain what the key jargon is when it is first used in the manuscript, such as "retrotransposon" and "retrotransposition".

      (4) The authors should show the full spelling of some acronyms when they use it for the first time, such as RNA Immunoprecipitation (RIP).

      (5) Use a space between numbers and alphabets, such as 5 μg. (6) 2.0 × 105 cells, that's not an "x".

      (7) Numbers in the reference section are lacking (hard to parse).

      (8) In general, there are a significant number of typos in this draft which at times becomes distracting. For example, (P3) Introduction: Yet, co-option of TEs thorough (not thorough, it should be through) evolution has created so-called domesticated genes beneficial to the gene network in a wide range of organisms. Please carefully revise the entire manuscript for these minor issues that collectively erode the quality of this submission. Thank you for pointing out these mistakes. We have corrected them in the revised manuscript. A native speaker from our research group has carefully checked the paper. In summary, we have added Supplementary Figure S7C and have changed Figures 1C, 1E, 1F, 2A, 2D, 3A, 4B, S3A-D, S4B and S6A based on these comments.

      Response: Thank you for taking these comments on board!

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #1 (Public Review): 

      Summary: 

      The manuscript focuses on the role of the deubiquitinating enzyme UPS-50/USP8 in endosome maturation. The authors aimed to clarify how this enzyme drives the conversion of early endosomes into late endosomes. Overall, they did achieve their aims in shedding light on the precise mechanisms by which UPS-50/USP8 regulates endosome maturation. The results support their conclusions that UPS-50 acts by disassociating RABX-5 from early endosomes to deactivate RAB-5 and by recruiting SAND-1/Mon1 to activate RAB-7. This work is commendable and will have a significant impact on the field. The methods and data presented here will be useful to the community in advancing our understanding of endosome maturation and identifying potential therapeutic targets for diseases related to endosomal dysfunction. It is worth noting that further investigation is required to fully understand the complexities of endosome maturation. However, the findings presented in this manuscript provide a solid foundation for future studies. 

      We thank this reviewer for the instructive suggestions and encouragement.

      Strengths: 

      The major strengths of this work lie in the well-designed experiments used to examine the effects of UPS-50 loss. The authors employed confocal imaging to obtain a picture of the aftermath of the USP-50 loss. Their findings indicated enlarged early endosomes and MVB-like structures in cells deficient in USP-50/USP8. 

      We thank this reviewer for the instructive suggestions and encouragement.

      Weaknesses: 

      Specifically, there is a need for further investigation to accurately characterize the anomalous structures detected in the usp-50 mutant. Also, the correlation between the presence of these abnormal structures and ESCRT-0 is yet to be addressed, and the current working model needs to be revised to prevent any confusion between enlarged early endosomes and MVBs. 

      Excellent suggestions. USP8 has been identified as a protein associated with ESCRT components, which are crucial for endosomal membrane deformation and scission, leading to the formation of intraluminal vesicles (ILVs) within multivesicular bodies (MVBs). In usp-50 mutants, we observed a significant reduction in the punctate signals of HGRS-1::GFP and STAM-1 (Figure 1G and H; and Figure1-figure supplement 1B), indicating a disruption in ESCRT-0 complex localization (Author response image 1). Additionally, lysosomal structures are markedly reduced in these mutants. In contrast, we found that early endosomes, as marked by FYVE, RAB-5, RABEX5, and EEA1, are significantly enlarged in usp-50 mutants. Electron microscopy (EM) imaging further revealed an increase in large cellular vesicles containing various intraluminal structures. Given the reduction in lysosomal structures and the enlargement of early endosomes in usp-50 mutants, these enlarged vesicles are likely aberrant early endosomes rather than late endosomal or lysosomal structures. To address potential confusion, we have revised the manuscript according to the reviewer's comments and updated the model to accurately reflect these observations.

      Reviewer #2 (Public Review): 

      Summary: 

      In this study, the authors study how the deubiquitinase USP8 regulates endosome maturation in C. elegans and mammalian cells. The authors have isolated USP8 mutant alleles in C. elegans and used multiple in vivo reporter lines to demonstrate the impact of USP8 loss-of-function on endosome morphology and maturation. They show that in USP8 mutant cells, the early endosomes and MVB-like structures are enlarged while the late endosomes and lysosomal compartments are reduced. They elucidate that USP8 interacts with Rabx5, a guanine nucleotide exchange factor (GEF) for Rab5, and show that USP8 likely targets specific lysine residue of Rabx5 to dissociate it from early endosomes. They also find that the localization of USP8 to early endosomes is disrupted in Rabx5 mutant cells. They observe that in both Rabx5 and USP8 mutant cells, the Rab7 GEF SAND-1 puncta which likely represents late endosomes are diminished, although Rabex5 is accumulated in USP8 mutant cells. The authors provide evidence that USP8 regulates endosomal maturation in a similar fashion in mammalian cells. Based on their observations they propose that USP8 dissociates Rabex5 from early endosomes and enhances the recruitment of SAND-1 to promote endosome maturation. 

      We thank this reviewer for the instructive suggestions and encouragement.

      Strengths: 

      The major highlights of this study include the direct visualization of endosome dynamics in a living multi-cellular organism, C. elegans. The high-quality images provide clear in vivo evidence to support the main conclusions. The authors have generated valuable resources to study mechanisms involved in endosome dynamics regulation in both the worm and mammalian cells, which would benefit many members of the cell biology community. The work identifies a fascinating link between USP8 and the Rab5 guanine nucleotide exchange factor Rabx5, which expands the targets and modes of action of USP8. The findings make a solid contribution toward the understanding of how endosomal trafficking is controlled. 

      We thank this reviewer for the instructive suggestions and encouragement.

      Weaknesses: 

      - The authors utilized multiple fluorescent protein reporters, including those generated by themselves, to label endosomal vesicles. Although these are routine and powerful tools for studying endosomal trafficking, these results cannot tell whether the endogenous proteins (Rab5, Rabex5, Rab7, etc.) are affected in the same fashion. 

      Good suggestion. Indeed, to test whether the endogenous proteins (Rab5, Rabex5, Rab7, etc.) are affected in the same fashion as fluorescent protein reporters, we supplemented our approach with the utilization of endogenous markers. These markers, including Rab5, RAB-5, Rabex5, RABX-5, and EEA1 for early endosomes, as well as RAB-7, Mon1a, and Mon1b for late endosomes, were instrumental in our investigations (refer to Figure 3, Figure 6, Figure 5-figure supplement 1, Figure 5-figure supplement 2, and Figure 6-figure supplement 1). Our comprehensive analysis, employing various methodologies such as tissue-specific fused proteins, CRISPR/Cas9 knock-in, and antibody staining, consistently highlights the critical role of USP8 in early-to-late endosome conversion.

      - The authors clearly demonstrated a link between USP8 and Rabx5, and they showed that cells deficient in both factors displayed similar defects in late endosomes/lysosomes. However, the authors didn't confirm whether and/or to which extent USP8 regulates endosome maturation through Rabx5. Additional genetic and molecular evidence might be required to better support their working model. 

      Excellent point. To test whether USP-50 regulates endosome maturation through RABX-5, we performed additional genetic analyses. In rabx-5(null) mutant animals, the morphology of 2xFYVE-labeled early endosomes is comparable to that of wild-type controls (Figure 4H and I). Introducing the rabx-5(null) mutation into usp-50(xd413) backgrounds resulted in a significant suppression of the enlarged early endosome phenotype characteristic of usp-50(xd413) mutants (Figure 4H and I). These findings suggest that USP-50 may modulate the size of early endosomes through its interaction with RABX-5.

      Reviewer #3 (Public Review): 

      Summary: 

      The authors were trying to elucidate the role of USP8 in the endocytic pathway. Using C. elegans epithelial cells as a model, they observed that when USP8 function is lost, the cells have a decreased number and size in lysosomes. Since USP8 was already known to be a protein linked to ESCRT components, they looked into what role USP8 might play in connecting lysosomes and multivesicular bodies (MVB). They observed fewer ESCRT-associated vesicles but an increased number of "abnormal" enlarged vesicles when USP8 function was lost. At this specific point, it's not clear what the objective of the authors was. What would have been their hypothesis addressing whether the reduced lysosomal structures in USP8 (-) animals were linked to MVB formation? Then they observed that the abnormally enlarged vesicles, marked by the PI3P biosensor YFP-2xFYVE, are bigger but in the same number in USP8 (-) compared to wild-type animals, suggesting homotypic fusion. They confirmed this result by knocking down USP8 in a human cell line, and they observed enlarged vesicles marked by YFP-2xFYVE as well. At this point, there is quite an important issue. The use of YFP-2xFYVE to detect early endosomes requires the transfection of the cells, which has already been demonstrated to produce differences in the distribution, number, and size of PI3P-positive vesicles (doi.org/10.1080/15548627.2017.1341465). The enlarged vesicles marked by YFP-2xFYVE would not necessarily be due to the loss of UPS8. In any case, it appears relatively clear that USP8 localizes to early endosomes, and the authors claim that this localization is mediated by Rabex-5 (or Rabx-5). They finally propose that USP8 dissociates Rabx-5 from early endosomes facilitating endosome maturation. 

      Weaknesses: 

      The weaknesses of this study are, on one side, that the results are almost exclusively dependent on the overexpression of fusion proteins. While useful in the field, this strategy does not represent the optimal way to dissect a cell biology issue. On the other side, the way the authors construct the rationale for each approximation is somehow difficult to follow. Finally, the use of two models, C. elegans and a mammalian cell line, which would strengthen the observations, contributes to the difficulty in reading the manuscript. 

      The findings are useful but do not clearly support the idea that USP8 mediates Rab5-Rab7 exchange and endosome maturation, In contrast, they appear to be incomplete and open new questions regarding the complexity of this process and the precise role of USP8 within it. 

      We thank this reviewer for the insightful comments. Fluorescence-fused proteins serve as potent tools for visualizing subcellular organelles both in vivo and in live settings. Specifically, in epidermal cells of worms, the tissue-specific expression of these fused proteins is indispensable for studying organelle dynamics within living organisms. This approach is necessitated by the inherent limitations of endogenously tagged proteins, whose fluorescence signals are often weak and unsuitable for live imaging or genetic screening purposes. Acknowledging concerns raised by the reviewer regarding potential alterations in organelle morphology due to overexpression of certain fused proteins, we supplemented our approach with the utilization of endogenous markers. These markers, including Rab5, RAB-5, Rabex5, RABX-5, and EEA1 for early endosomes, as well as RAB-7, Mon1a, and Mon1b for late endosomes, were instrumental in our investigations (refer to Figure 3, Figure 6, Figure 5-figure supplement 1, Figure 5-figure supplement 2, and Figure 6-figure supplement 1). Our comprehensive analysis, employing various methodologies such as tissue-specific fused proteins, CRISPR/Cas9 knock-in, and antibody staining, consistently highlights the critical role of USP8 in early-to-late endosome conversion. Specifically, we discovered that the recruitment of USP-50/USP8 to early endosomes is depending on Rabex5. However, instead of stabilizing Rabex5, the recruitment of USP-50/USP8 leads to its dissociation from endosomes, concomitantly facilitating the recruitment of the Rab7 GEF SAND-1/Mon1. In cells with loss-of-function mutations in usp-50/usp8, we observed enhanced RABX-5/Rabex5 signaling and mis-localization of SAND-1/Mon1 proteins from endosomes. Consequently, this disruption impairs endolysosomal trafficking, resulting in the accumulation of enlarged vesicles containing various intraluminal contents and rudimentary lysosomal structures.

      Through an unbiased genetic screen, verified by cultured mammalian cell studies, we observed that loss-of-function mutations in usp-50/usp8 result in diminished lysosome/late endosomes. Electron microscopy (EM) analysis indicated that usp-50 mutation leads to abnormally enlarged vesicles containing various intraluminal structures in worm epidermal cells. USP8 is known to regulate the endocytic trafficking and stability of numerous transmembrane proteins. Given that lysosomes receive and degrade materials generated by endocytic pathways, we hypothesized that the abnormally enlarged vesicular structures observed in usp-50 or usp8 mutant cells correspond to the enlarged vesicles coated by early endosome markers. Indeed, in the absence of usp8/usp-50, the endosomal Rab5 signal is enhanced, while early endosomes are significantly enlarged. Given that Rab5 guanine nucleotide exchange factor (GEF), Rabex5, is essential for Rab5 activation, we further investigated its dynamics. Additional analyses conducted in both worm hypodermal cells and cultured mammalian cells revealed an increase of endosomal Rabex5 in response to usp8/usp-50 loss-of-function. Live imaging studies further demonstrated active recruitment of USP8 to newly formed Rab5-positive vesicles, aligning spatiotemporally with Rabex5 regulation. Through systematic exploration of putative USP-50 binding partners on early endosomes, we identified its interaction with Rabex5. Comprehensive genetics and biochemistry experiments demonstrated that USP8 acts through K323 site de-ubiquitination to dissociate Rabex5 from early endosomes and promotes the recruitment of the Rab7 GEF SAND-1/Mon1. In summary, our study began with an unbiased genetic screen and subsequent examination of established theories, leading to the formulation of our own hypothesis. Through multifaceted approaches, we unveiled a novel function of USP8 in early-to-late endosome conversion.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) Within Figures 1K-N, diverse anomalous structures were detected in the usp-50 mutant. Further scrutiny is needed to definitively characterize these structures, particularly as the images in Figures 1M and 1L exhibit notable similarities to lamellar bodies.

      We thank the reviewer for the insightful question regarding the resemblance between the vesicles observed in our study and lamellar bodies (LBs). Lamellar bodies are specialized organelles involved in lipid storage and secretion1, prominently studied in keratinocytes of the skin and alveolar type II (ATII) epithelial cells in the lung2. These organelles contain not only lipids but also cell-type specific proteins and lytic enzymes. Due to their acidic pH and functional similarities, LBs are classified as lysosome-related organelles (LROs) or secretory lysosomes3,4. In usp-50 mutants, we observed a considerable number of abnormal vesicles, some of which contain threadlike membrane structures and exhibit morphological similarities to LBs (Figure 2O). However, further analysis with a comprehensive panel of lysosome-related markers demonstrated a significant reduction in lysosomal structures within these mutants. In contrast, vesicles marked by early endosome markers, such as FYVE, RAB-5, RABX-5, and EEA1, were notably enlarged. These results suggest that the enlarged vesicles observed in usp-50 mutants are more likely aberrant early endosomes rather than true lamellar bodies. We have revised the manuscript to reflect these findings and to clearly differentiate between these structures and lysosome-related organelles.

      (2) The correlation between the presence of these abnormal structures and ESCRT-0 remains unaddressed, thus the assertion that UPS-50 regulates endolysosome trafficking in conjunction with ESCRT-0 lacks empirical support.

      We thank the reviewer for the valuable suggestions. We apologize for any confusion and appreciate the opportunity to clarify our findings. The ESCRT machinery is essential for driving endosomal membrane deformation and scission, which leads to the formation of intraluminal vesicles (ILVs) within multivesicular bodies (MVBs). Recent research has shown that the absence of ESCRT components results in a reduction of ILVs in worm gut cells5. In wild type animals, the ESCRT-0 components HGRS-1 and STAM-1 display a distinct punctate distribution (Figure 1G and H). However, in usp-50 mutants, the punctate signals of HGRS-1::GFP and STAM-1::GFP are significantly reduced (Figure 1G and H; and Figure 1-figure supplement 1B), indicating a role for USP-50 in stabilizing the ESCRT-0 complex. Our TEM analysis revealed an accumulation of abnormally enlarged vesicles containing intraluminal structures in usp-50 mutants. When we examined a panel of early endosome and late endosome/lysosome markers, we found that early endosomes are significantly enlarged, while late endosomal/lysosomal structures are markedly reduced in these mutants. This suggests that the abnormal structures observed in usp-50 mutants are likely enlarged early endosomes rather than classical MVBs. To further investigate whether the reduction in ESCRT components contributes to the late endosome/lysosome defects, we analyzed stam-1 mutants. In these mutants, the size of RAB-7-coated vesicles was reduced (Author response image 1C), and the lysosomal marker LAAT-1 indicated a reduction in lysosomal structures (Author response image 1B). These results highlight the importance of the ESCRT complex in late endosome/lysosome formation. However, the morphology of early endosomes, as marked by 2xFYVE, remained similar to that of wild type in stam-1 mutants (Author response image 1A). Therefore, while reduced ESCRT-0 components may contribute to the late endosome/lysosome defects observed in usp-50 mutants, the enlargement of early endosomes in these mutants may involve additional mechanisms. We have revised the manuscript to incorporate these insights and to address the reviewer's comments more comprehensively.

      Author response image 1.

      (A) Confocal fluorescence images of hypodermis expressing YFP::2xFYVE to detect EEs in L4 stage animals in wild type and stam-1(ok406) mutants. Scale bar: 5 μm. (B) Confocal fluorescence images of hypodermal cell 7 (hyp7) expressing the LAAT-1::GFP marker to highlight lysosome structures in 3-day-old adult animals. Compared to wild type, LAAT-1::GFP signal is reduced in stam-1(ok406) animals. Scale bar, 5 μm. (C) The reduction of punctate endogenous GFP::RAB-7 signals in stam-1(ok406) animals. Scale bar: 10 μm.

      (3) Endosomal dysfunction typically leads to significant alterations in the spatial arrangement of marker proteins across distinct endosomes. In the manuscript, the authors examined the distribution and morphology of early endosomes, multivesicular bodies (MVBs), late endosomes, and lysosomes in a usp-50 deficient background primarily through single-channel confocal imaging. By employing two color images showing RAB-5 and RAB-7, in conjunction with HGRS-1, a more comprehensive picture of the aftermath of USP-50 loss can be obtained.

      Good suggestions. We have conducted a double-labeling analysis to examine the distribution of RAB-5 and RAB-7 in conjunction with HGRS-1. In wild type animals, HGRS-1 exhibits a punctate distribution that is partially co-localized with both RAB-5 and RAB-7. In contrast, in usp-50 mutants, the punctate signal of HGRS-1 is significantly reduced, along with its co-localization with RAB-5 and RAB-7 (Author response image 2). These results suggest that, in the absence of USP-50, the stabilization of ESCRT-0 components on endosomes is compromised.

      Author response image 2.

      ESCRT-0 is adjacent to both early endosomes and late endosomes. (A) Confocal fluorescence images of wild-type and usp-50(xd413) hypodermis at L4 stage co-expressing HGRS-1::GFP (hgrs-1 promoter) and endogenous wrmScarlet::RAB-5. (B) HGRS-1 and RAB-5 puncta were analyzed to produce Manders overlap coefficient M1 (HGRS-1/RAB-5) and M2 (RAB-5/HGRS-1) (N=10). (C) Confocal fluorescence images of wild-type and usp-50(xd413) hypodermis at L4 stage co-expressing HGRS-1::GFP (hgrs-1 promoter) and endogenous wrmScarlet::RAB-7. (D) HGRS-1 and RAB-7 puncta were analyzed to produce Manders overlap coefficient M1 (HGRS-1/RAB-7) and M2 (RAB-7/HGRS-1) (N=10). Scale bar: 10 μm for (A) and (C).

      (4) The authors observed enlarged early endosomes in cells depleted of usp-50/usp8, along with enlarged MVB-like structures identified through TEM. The potential identity of these structures as the same organelle could be determined using CLEM.

      We thank the reviewer for the valuable suggestion. Our TEM analysis identified a large number of abnormally enlarged vesicles with various intraluminal structures accumulated in usp-50 mutants. As the reviewer correctly noted, CLEM (correlative light and electron microscopy) would be an ideal approach to further characterize these structures. We have been attempting to implement CLEM in C. elegans for a few years. Given that CLEM relies on fluorescence markers, in this study we focused on two tagged proteins, RAB-5 and RABX-5, which show enlargement in their vesicles in usp-50 mutants. Unfortunately, we encountered significant challenges with this approach, as the GFP-tagged RAB-5 and RABX-5 signals did not survive the electron microscopy procedure. Attempts to align EM sections with residual GFP signaling yielded results that were not convincing. Consequently, we concentrated our analysis on a panel of molecular markers, including 2xFYVE, RAB-5, RABX-5, RAB-7, and LAAT-1. These markers consistently indicated that early endosomes are specifically enlarged in usp-50 mutants, while late endosomal/lysosomal structures are notably reduced. Thus, the abnormal structures identified in usp-50 mutants via TEM are likely to be enlarged early endosomes rather than the classical view of MVBs. We have revised the manuscript to reflect these findings and to clarify this point.

      (5) The working model depicted in Figure 6 Y (right) requires revision, as it has the potential to mislead authors into mistaking enlarged early endosomes for multivesicular bodies (MVBs).

      We thank the reviewer for the excellent suggestion. We have revised the model to clarify that it is the enlarged early endosomes, rather than MVBs, that are observed in usp-50 mutants.

      Reviewer #2 (Recommendations For The Authors):

      (1) Is there any change of Rabx5 protein level in USP8/USP50 mutant cells?

      Good question. In the absence of usp-50/usp8, we indeed observed a noticeable increase in the signal of Rabex5 on endosomes. To determine whether usp-50/usp8 affects the protein level of Rabex5, we investigated the endogenous levels of RABX-5 using the RABX-5::GFP knock-in line. Compared to wild-type controls, we found an elevated protein level of RABX-5::GFP in the knock-in line (Author response image 3). This suggests that USP-50 may play a role in the destabilization of RABX-5/Rabex5 in vivo.

      Author response image 3.

      The endogenous RABX-5 protein level is increased in usp-50 mutants. (A) The RABX-5::GFP KI protein level is increased in usp-50(xd413). (B) Quantification of endogenous RABX-5::GFP protein level in wild type and usp-50(xd413) mutant animals.

      (2) It is interesting that "The rabx-5(null) animals are healthy and fertile and do not display obvious morphological or behavioral defects.", which seems contrary to its role in regulating USP8 localization and endosome maturation.

      It has been previously documented that rabx-5 functions redundantly with rme-6, another RAB-5 GEF in C. elegans, to regulate RAB-5 localization in oocytes6. RNA interference (RNAi) targeting rabx-5 in a rme-6 mutant background results in synthetic lethality, whereas neither rabx-5 nor rme-6 single mutants are essential for worm viability. RME-6 co-localizes with clathrin-coated pits, while Rabex-5 is localized to early endosomes. Rabex-5 forms a stable complex with Rabaptin-5 and is part of a large EEA1-positive complex on early endosomes, whereas RME-6 does not interact with Rabaptin-5 (RABN-5) or EEA-1. These findings suggest that while RME-6 and RABX-5 may function redundantly, they likely play distinct roles in regulating intracellular trafficking processes. In the absence of RABX-5, USP-50 appears to lose its endosomal localization, although the size of the early endosome remains comparable to that of wild type. This observation contrasts with the phenotype associated with USP-50 loss-of-function, in which the early endosome is notably enlarged. These results suggest that residual USP-50 present in the endosomes is sufficient to maintain its role in the endocytic pathway. Conversely, the complete absence of USP-50 likely disrupts the transition of early endosomes to late endosomes, indicating a crucial role of USP-50 in this conversion process. It is also noteworthy that, although loss-of-function of rabx-5 does not result in obvious changes to early endosomes, increasing the gene expression level of rabx-5/Rabex-5 alone is sufficient to cause enlargement of early endosomes (Author response image 4) . Indeed, we observed that loss-of-function mutations in u_sp-50/usp_8 lead to abnormally enlarged early endosomes, accompanied by an enhanced signal of endosomal RABX-5. When the rabx-5(null) mutation was introduced into usp-50 mutant animals, the enlarged early endosome phenotype seen in usp-50 mutants was significantly suppressed (Figure 4H and I). This implies that maintaining a lower level of Rab5 GEF may be crucial for endolysosomal trafficking.

      (3) Does Rabx5 mutation has any impact on early endosomes?

      To address the question, we utilized the CRISPR/Cas9 technique to create a molecular null for rabx-5 (Figure 4E). In the rabx-5(null) mutant animals, we found that the 2xFYVE-labeled early endosomes are indistinguishable from wild type (Figure 4H and 4I). Given that r_abx-5_ functions redundantly with rme-6, another RAB-5 GEF in C. elegans, it is likely that the regulation of early endosome size involves a cooperative interaction between RABX-5 and RME-6.

      (4) The authors observed a reduction of ESCRT-0 components in USP8 mutant cells, could this contribute to the late endosome/lysosome defects?

      Good suggestion. In wild-type animals, the two ESCRT-0 components, HGRS-1 and STAM-1, exhibit a distinct punctate distribution (Figure 1G and H). However, in usp-50 mutants, the punctate signals of HGRS-1::GFP and STAM-1::GFP are significantly diminished (Figure 1G and H; and Figure 1-figure supplement 1B), which aligns with the role of USP-50 in stabilizing the ESCRT-0 complex. To investigate whether the reduction in ESCRT components might contribute to defects in late endosome/lysosome formation, we examined stam-1 mutants. In stam-1 mutants, we observed a reduction in the size of RAB-7-coated vesicles (Author response image 1). Further, when we introduced the lysosomal marker LAAT-1::GFP into stam-1 mutants, we found a substantial decrease in lysosomal structures compared to wild-type animals (Author response image 1). This suggests that the ESCRT complex is essential for proper late endosome/lysosome formation. In contrast, the morphology of early endosomes, as indicated by the 2xFYVE marker, appeared normal in stam-1 mutants, similar to wild-type animals (Author response image 1). This implies that while a reduction in ESCRT-0 components may contribute to the late endosome/lysosome defects observed in usp-50 mutants, the early endosome enlargement phenotype in _usp-5_0 mutants may involve additional mechanisms.

      (5) Rabx5 is accumulated in USP8 mutant cells, I am very curious about the phenotype of USP8-Rabx5 double mutants. Could over-expression of Rabx5 (wild type or mutant forms) cause any defects?

      Excellent suggestions. To address the question, we employed the CRISPR/Cas9 technique to create a molecular null for rabx-5 (Figure 4E). In the rabx-5(null) mutant animals, we observed that the punctate USP-50::GFP signal became diffusely distributed (Figure 4F and G). This suggests that rabx-5 is necessary for the endosomal localization of USP-50. Interestingly, in rabx-5(null) mutant animals, the 2xFYVE-labeled early endosomes appeared similar to those in wild-type animals (Figure 4H and I). When rabx-5(null) was introduced into usp-50 mutant animals, the enlarged early endosome phenotype observed in usp-50 was significantly suppressed (Figure 4H and I). This finding indicates that usp-50 indeed functions through rabx-5 to regulate early endosome size. Additionally, we constructed strains overexpressing either wild-type or K323R mutant RABX-5. Our results showed that overexpression of wild-type RABX-5 led to early endosome enlargement (as indicated by YFP::2xFYVE labeling) (Author response image 4A, B and D). In contrast, overexpression of the K323R mutant RABX-5 did not result in noticeable early endosome enlargement (Author response image 4A, C and D). Together, these data are in consistent with our model that USP-50 may regulate RABX-5 by deubiquitinating the K323 site.

      Author response image 4.

      (A-C) Over-expression wild type RABX-5 causes enlarged EEs (labeled by YFP::2xFYVE) while RABX-5(K323R) mutant form does not. (D) Quantification of the volume of individual YFP::2xFYVE vesicles. Data are presented as mean ± SEM. ****P<0.0001. ns, not significant. One-way ANOVA with Tukey’s test.

      (6) Rabx5 could be ubiquitinated at K88 and K323, and Rabx5-K323R showed different activity when compared with the wild-type protein in USP8 mutant cells. Could the authors provide evidence that USP8 could remove the ubiquitin modification from K323 in Rabx5 protein?

      We appreciate the reviewer's insightful suggestions. To explore the potential of USP-50 in removing ubiquitin modifications from lysine 323 on the RABX-5 protein, we undertook a series of experiments. Initially, we sought to determine whether USP-50 influences the ubiquitination level of RABX-5 in vivo. However, due to the low expression levels of USP-50, we encountered challenges in obtaining adequate amounts of USP-50 protein from worm lysates. To overcome this, we expressed USP-50::4xFLAG in HEK293 cells for subsequent affinity purification. Concurrently, we utilized anti-GFP agarose beads to purify RABX-5::GFP from worms expressing the rabx-5::gfp construct. We then incubated RABX-5::GFP with USP-50::4xFLAG for varying durations and performed immunoblotting with an anti-ubiquitin antibody. As shown in Author response image 5A, our results revealed a decrease in the ubiquitination level of RABX-5 in the presence of USP-50, suggesting that USP-50 directly deubiquitinates RABX-5. Previous studies have indicated that only a minor fraction of recombinant RABX-5 undergoes ubiquitination in HeLa cells, which is believed to have functional significance7. Our findings are consistent with this observation, as only a small fraction of RABX-5 in worms is ubiquitinated. Rabex-5 is known to interact with both K63- and K48-linked poly-ubiquitin chains. To further elucidate whether USP-50 specifically targets K48 or K63-linked ubiquitination at the K323 site of RABX-5, we incubated various HA-tagged ubiquitin mutants with either wild-type or K323R mutant RABX-5 protein. Our results indicated that the K323R mutation reduces K63-linked ubiquitination of RABX-5 (Author response image 5). This experiment was repeated multiple times with consistent results. Additionally, while overexpression of wild-type RABX-5 led to an enlargement of early endosomes, as evidenced by YFP::2xFYVE labeling, overexpression of the K323R mutant did not produce a noticeable effect on endosome size (Author response image 4). Collectively, this finding indicates that RABX-5 is subject to ubiquitin modification in vivo and that USP-50 plays a significant role in regulating this modification at the K323 site.

      Author response image 5.

      (A) RABX-5::GFP protein was purified from worm lysates using anti-GFP antibody. FLAG-tagged USP-50 was purified from HEK293T cells using anti-FLAG antibody. Purified RABX-5::GFP was incubated with USP-50::4FLAG for indicated times (0, 15, 30, 60 mins), followed by immunoblotting using antibody against ubiquitin, FLAG or GFP. In the presence of USP-50::4xFLAG, the ubiquitination level of RABX-5::GFP is decreased. (B) Quantification of RABX-5::GFP ubiquitination level from three independent experiments. (C) HEK293T cells were transfected with HA-Ub or indicated mutants and 4xFLAG tagged RABX-5 or RABX-5 K323R mutant for 48h. The cells were subjected to pull down using the FLAG beads, followed by immunoblotting using antibody against HA or Flag.

      (7) The authors described "the almost identical phenotype of usp-50/usp8 and sand-1/Mon1 mutants", found protein-protein interaction between USP8 and sand-1, and showed that sand1-GFP signal is diminished in USP8 mutant cells. These observations fit with the possibility that USP8 regulates the stability of sand-1 to promote endosomal maturation. Could this be tested and integrated into the current model?

      are grateful for the insightful comments provided by the reviewer. Rab5, known to be activated by Rabex-5, plays a crucial role in the homotypic fusion of early endosomes. Rab5 effectors also include the Rab7 GEF SAND-1/Mon1–Ccz1 complex. Rab7 activation by SAND-1/Mon1-Ccz1 complex is essential for the biogenesis and positioning of late endosomes (LEs) and lysosomes, and for the fusion of endosomes and autophagosomes with lysosomes. The Mon1-Ccz1 complex is able to interact with Rabex5, causing dissociation of Rabex5 from the membrane, which probably terminates the positive feedback loop of Rab5 activation and then promotes the recruitment and activation of Rab7 on endosomes. In our study, we identified an interaction between USP-50 and the Rab5 GEF, RABX-5. In the absence of USP-50, we observed an increased endosomal localization of RABX-5 and the formation of abnormally enlarged early endosomes. This phenotype is reminiscent of that seen in sand-1 loss-of-function mutants, which also exhibit enlarged early endosomes and a concomitant reduction in late endosomes/lysosomes. Notably, USP-50 also interacts with SAND-1, suggesting a potential role in regulating its localization. We could propose several models to elucidate how USP-50 might influence SAND-1 localization, including:

      (1) USP-50 may stabilize SAND-1 through direct de-ubiquitination.

      (2) In the absence of USP-50, the sustained presence of RABX-5 could lead to continuous Rab5 activation, which might hinder or delay the recruitment of SAND-1.

      (3) USP-50 could facilitate SAND-1 recruitment by promoting the dissociation of RABX-5.

      We are actively investigating these models in our laboratory. Due to space constraints, a more detailed exploration of how USP-50 regulates SAND-1 stability will be presented in a separate publication.

      References:

      (1) Schmitz, G., and Müller, G. (1991). Structure and function of lamellar bodies, lipid-protein complexes involved in storage and secretion of cellular lipids. J Lipid Res 32, 1539-1570.

      (2) Dietl, P., and Frick, M. (2021). Channels and Transporters of the Pulmonary Lamellar Body in Health and Disease. Cells-Basel 11. https://doi.org/10.3390/cells11010045.

      (3) Raposo, G., Marks, M.S., and Cutler, D.F. (2007). Lysosome-related organelles: driving post-Golgi compartments into specialisation. Current opinion in cell biology 19, 394-401. https://doi.org/10.1016/j.ceb.2007.05.001.

      (4) Weaver, T.E., Na, C.L., and Stahlman, M. (2002). Biogenesis of lamellar bodies, lysosome-related organelles involved in storage and secretion of pulmonary surfactant. Semin Cell Dev Biol 13, 263-270. https://doi.org/10.1016/s1084952102000551.

      (5) Ott, D.P., Desai, S., Solinger, J.A., Kaech, A., and Spang, A. (2024). Coordination between ESCRT function and Rab conversion during endosome maturation. bioRxiv, 2024.2005.2014.594104. https://doi.org/10.1101/2024.05.14.594104.

      (6) Sato, M., Sato, K., Fonarev, P., Huang, C.J., Liou, W., and Grant, B.D. (2005). Caenorhabditis elegans RME-6 is a novel regulator of RAB-5 at the clathrin-coated pit. Nature cell biology 7, 559-569. https://doi.org/10.1038/ncb1261.

      (7) Mattera, R., Tsai, Y.C., Weissman, A.M., and Bonifacino, J.S. (2006). The Rab5 guanine nucleotide exchange factor Rabex-5 binds ubiquitin (Ub) and functions as a Ub ligase through an atypical Ub-interacting motif and a zinc finger domain. The Journal of biological chemistry 281, 6874-6883. https://doi.org/10.1074/jbc.M509939200.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public Review):

      Comments on revisions)

      The authors have done a good job at revising the manuscript to put this work into the context of earlier work on brainstem central pattern generators.

      Thank you.

      I still believe the case for the method is not as convincing as it would have been if the method had been validated first on oscillations produced by a known CPG model. Why would the inference of synaptic types from the model CPG voltage oscillations be predetermined? Such inverse problems are quite complicated and their solution is often not unique or sufficiently constrained. Recovering synaptic weights (or CPG parameters) from limited observations of a highly nonlinear system is not warranted (Gutenkunst et al., Universally sloppy parameter sensitivities in systems biology models, PLoS Comp. Biol. 2007; www.doi.org/10.1371/journal.pcbi.0030189) especially when using surrogate biological models like Hodgkin-Huxley models.

      The model of the CPG is irrelevant for such a test of validity because what we reconstruct are postsynaptic conductances of an individual neuron. The network creates a periodic input to this neuron and thus forms a periodic pattern of excitatory and inhibitory conductances. The nature of this input, whether autonomously generated or created artificially (say by periodic optogenetic stimulation), is generally not important. To illustrate this, we used a one-compartment conductance-based (Hodgkin-Huxley style) model neuron incorporating a certain common set of channels (fast sodium (I<sub>NaF</sub>), potassium delayed rectifier (I<sub>Kdr</sub>), persistent sodium (I<sub>NaP</sub>), calcium-dependent potassium (I<sub>KCa</sub>), and cationic non-specific current (I<sub>CAN</sub>)), as well as excitatory and inhibitory synaptic channels whose conductances were implemented as predefined periodic functions. The test suggested by the reviewer would be to implement a current-step protocol similar to the experiments and apply our technique to see if the reconstructed conductance profiles match those predefined functions. Below we show the reconstruction steps for the following arbitrarily chosen pattern:

      𝑔<sub>𝐸𝑋𝐶</sub>(𝑡) /𝑔<sub>𝐿𝐸𝐴𝐾</sub> = 0.1(1 + sin(π𝑡)) and 𝑔<sub>𝐼𝑁𝐻</sub>(𝑡)/𝑔<sub>𝐿𝐸𝐴𝐾</sub> = 0.1 (1 + cos(π𝑡)). Author response image 1 below shows the baseline activity of this model neuron in the absence of the injected current.

      Author response image 1.

      Then we applied a current-step protocol with four steps producing different levels of hyperpolarization and applied our method by calculating the total conductance using linear regression (see the current-voltage plots below) and then decomposing it into the excitatory and inhibitory components.

      Author response image 2.

      As one can see, the reconstructed conductances in Author response image 3 below are nearly identical to their theoretical profiles. This is not surprising because all voltage-dependent currents in the model neuron were inactive in the range of voltages matching our experimental conditions. Therefore, the model could be reduced to just the leak current, synaptic currents and the injected current, which matches precisely the model we used in our manuscript.

      Author response image 3.

      In p.2, the edited section refers to the interspike interval being much smaller than the period of the network. More important is to mention the relationship between the decay time of inhibitory synapses and the period of the network.

      This interpretation misunderstands the focus of our method. The edited sections (including in the theory section of Results) highlight the conditions under which the capacitive current becomes negligible, emphasizing that the membrane time constant must be much smaller than the network oscillation period. This separation of time scales ensures that the membrane potential adjusts quickly to changes in postsynaptic conductance, rendering the capacitive current insignificant over the network’s rhythm. In contrast, the synaptic decay time governs how presynaptic inputs are transduced into postsynaptic conductances—a process relevant to understanding synaptic dynamics but not directly tied to our method’s core objective. Our approach reconstructs postsynaptic conductances from intracellular recordings, not presynaptic spike trains. While interpreting these conductance profiles in terms of specific synaptic connections would indeed involve synaptic decay dynamics, such an analysis exceeds the scope of our paper. Thus, the condition emphasized in the edited sections—concerning the membrane time constant and network period—is the critical one for our method’s applicability, and the synaptic decay time, while relevant to broader synaptic modeling, does not undermine our conclusions.

      We have added the requirement for a much smaller membrane time constant in the Introduction on page 2. The Results theory section already incorporates an extensive discussion of this requirement.

      Comments from the editors:

      We apologize for the delay in coming to this decision, but there was quite a bit of post-review discussion that needed to be resolved. There are two issues that the reviewers agree should be addressed. They remain unconvinced that the simplifying assumptions of the approach are valid. 1) The main issue with the phase argument is that the biological synaptic conductance depends on time and not on the phase of the respiratory cycle as mentioned in the first round of reviews. The approximation g(t)=g(phase) seems to be far too simple to be biologically realistic.

      As we elaborate below, time and phase are fundamentally and mathematically equivalent representations of the same underlying dynamics in a periodic system, and thus, a phase-based representation—where conductances are expressed as functions of the cycle’s phase—is a justified and effective approach for capturing their behavior. We have added this explanation to the theory section of Results. Below are the bases for our assertion.

      In a periodic system, such as the respiratory CPG, the system’s behavior repeats at regular intervals, defined by a period T. For the respiratory cycle in our experimental preparation, this period is approximately 3–4 seconds, encompassing phases like inspiration, post-inspiration, and expiration. In such systems:

      Time (t) is a continuous variable that progresses linearly.

      Phase (φ) represents the position within one cycle, typically normalized between 0 and 1 (or 0 to 2π in some contexts). It can be mathematically related to time via: φ(t) = (t mod T)/T, where (t mod T) is the time elapsed within the current cycle.

      Because the system is periodic, any variable that repeats with period T—such as synaptic conductance in a rhythmically active network—can be expressed as a function of either time or phase. Specifically, if g(t) is periodic with period T, then g(t) = g(t+T). This periodicity allows us to redefine g(t) in terms of phase: g(t) = g(φ(t)), where φ(t) maps time onto a repeating cycle. Thus, in a periodic system, time and phase are fundamentally equivalent representations of the same underlying dynamics. Saying that synaptic conductance depends on phase is mathematically equivalent to saying it depends on time in a periodic manner.

      In a rhythmically active network like the respiratory central pattern generator (CPG), the synaptic conductances, regardless of the specific mechanisms by which they are formed, exhibit periodicity that matches the network’s oscillatory cycle. This occurs because the conductances are driven by the repetitive activity of presynaptic neurons, which are synchronized to the network’s overall rhythm. As a result, the synaptic conductances vary with the same period as the network, making a phase-based representation—where conductances are expressed as functions of the cycle’s phase—a justified and effective approach for capturing their behavior. In our study, we utilized the in situ arterially perfused brainstem-spinal cord preparation from mature rats, which is known to produce a highly periodic respiratory rhythm. To ensure the consistency of this periodicity, we carefully selected recordings where the coefficient of variation of the respiratory cycle period was less than 10%, as outlined in our methods. This strict selection criterion confirms the stability and regularity of the rhythm, supporting the validity of using a phase representation to analyze the synaptic conductances.

      (2) Figure S1 is problematic. First, the currents injected appear to be infinitesimally small.

      There was a typo in the current units, which should be nA and not pA, as evident from the injected current–membrane potential plots in Figure 1B. Figure S1 has been corrected.

      Second, the input resistance is completely independent of voltage, as though there was little or no contribution from hyperpolarization activated currents, which would be surprising.

      While hyperpolarization-activated currents are indeed present in many neuronal types and could theoretically affect input resistance, our data consistently show linear I-V relationships across the voltage range tested (-60 to -100 mV) for the neurons analyzed (see Figure S1 and Author response image 4-9 below). This linearity suggests that, under our experimental conditions, the contribution of voltage-dependent currents, such as h-currents, is negligible within this range.

      Additionally, we now indicate in the manuscript in the theory section of Results how the presence of significant hyperpolarization-activated h-currents would impact our synaptic conductance reconstruction method. In current-clamp recordings, non-linearity from h-currents could introduce voltage-dependent changes in total conductance unrelated to synaptic inputs, potentially skewing the reconstruction. However, this concern does not apply to voltage-clamp recordings, where the membrane potential is held constant, eliminating contributions from voltage-dependent intrinsic currents. As strong evidence of the minimal influence of h-currents, we directly compared synaptic conductance reconstructions using both current-clamp and voltage-clamp protocols in a subset of neurons. The results from these two approaches were highly consistent, indicating that h-currents do not significantly affect our findings. This robustness across experimental methods reinforces the reliability of our conclusions.

      Together, the linear I-V relationships and the agreement between current- and voltage-clamp reconstructions provide compelling evidence that our method accurately captures synaptic conductances without interference from h-currents.

      Typical examples of I-V relationships for each respiratory neuron firing phenotype:

      Author response image 4.

      ramp-I

      Author response image 5.

      pre-I/I

      Author response image 6.

      post-I

      Author response image 7.

      aug-E

      Author response image 8.

      early-I

      Author response image 9.

      late-I

    1. Author Response

      The following is the authors’ response to the previous reviews.

      We thank the reviewers for truly valuable advice and comments. We have made multiple corrections and revisions to the original pre-print accordingly per the following comments:

      1. Pro1153Leu is extremely common in the general population (allele frequency in gnomAD is 0.5). Further discussion is warranted to justify the possibility that this variant contributes to a phenotype documented in 1.5-3% of the population. Is it possible that this variant is tagging other rare SNPs in the COL11A1 locus, and could any of the existing exome sequencing data be mined for rare nonsynonymous variants?

      One possible avenue for future work is to return to any existing exome sequencing data to query for rare variants at the COL11A1 locus. This should be possible for the USA MO case-control cohort. Any rare nonsynonymous variants identified should then be subjected to mutational burden testing, ideally after functional testing to diminish any noise introduced by rare benign variants in both cases and controls. If there is a significant association of rare variation in AIS cases, then they should consider returning to the other cohorts for targeted COL11A1 gene sequencing or whole exome sequencing (whichever approach is easier/less expensive) to demonstrate replication of the association.

      Response: Regarding the genetic association of the common COL11A1 variant rs3753841 (p.(Pro1335Leu)), we do not propose that it is the sole risk variant contributing to the association signal we detected and have clarified this in the manuscript. We concluded that it was worthy of functional testing for reasons described here. Although there were several common variants in the discovery GWAS within and around COL11A1, none were significantly associated with AIS and none were in linkage disequilibrium (R2>0.6) with the top SNP rs3753841. We next reviewed rare (MAF<=0.01) coding variants within the COL11A1 LD region of the associated SNP (rs3753841) in 625 available exomes representing 46% of the 1,358 cases from the discovery cohort. The LD block was defined using Haploview based on the 1KG_CEU population. Within the ~41 KB LD region (chr1:103365089- 103406616, GRCh37) we found three rare missense mutations in 6 unrelated individuals, Table below. Two of them (NM_080629.2: c.G4093A:p.A1365T; NM_080629.2:c.G3394A:p.G1132S), from two individuals, are predicted to be deleterious based on CADD and GERP scores and are plausible AIS risk candidates. At this rate we could expect to find only 4-5 individuals with linked rare coding variants in the total cohort of 1,358 which collectively are unlikely to explain the overall association signal we detected. Of course, there also could be deep intronic variants contributing to the association that we would not detect by our methods. However, given this scenario, the relatively high predicted deleteriousness of rs3753841 (CADD= 25.7; GERP=5.75), and its occurrence in a GlyX-Y triplet repeat, we hypothesized that this variant itself could be a risk allele worthy of further investigation.

      Author response table 1.

      We also appreciate the reviewer’s suggestion to perform a rare variant burden analysis of COL11A1. We did conduct pilot gene-based analysis in 4534 European ancestry exomes including 797 of our own AIS cases and 3737 controls and tested the burden of rare variants in COL11A1. SKATO P value was not significant (COL11A1_P=0.18), but this could due to lack of power and/or background from rare benign variants that could be screened out using the functional testing we have developed.

      1. COL11A1 p.Pro1335Leu is pursued as a direct candidate susceptibility locus, but the functional validation involves both: (a) a complementation assay in mouse GPCs, Figure 5; and (b) cultured rib cartilage cells from Col11a1-Ad5 Cre mice (Figure 4). Please address the following:

      2A. Is Pro1335Leu a loss of function, gain of function, or dominant negative variant? Further rationale for modeling this change in a Col11a1 loss of function cell line would be helpful.

      Response: Regarding functional testing, by knockdown/knockout cell culture experiments, we showed for the first time that Col11a1 negatively regulates Mmp3 expression in cartilage chondrocytes, an AIS-relevant tissue. We then tested the effect of overexpressing the human wt or variant COL11A1 by lentiviral transduction in SV40-transformed chondrocyte cultures. We deleted endogenous mouse Col11a1 by Cre recombination to remove the background of its strong suppressive effects on Mmp3 expression. We acknowledge that Col11a1 missense mutations could confer gain of function or dominant negative effects that would not be revealed in this assay. However as indicated in our original manuscript we have noted that spinal deformity is described in the cho/cho mouse, a Col11a1 loss of function mutant. We also note the recent publication by Rebello et al. showing that missense mutations in Col11a2 associated with congenital scoliosis fail to rescue a vertebral malformation phenotype in a zebrafish col11a2 KO line. Although the connection between AIS and vertebral malformations is not altogether clear, we surmise that loss of the components of collagen type XI disrupt spinal development. in vivo experiments in vertebrate model systems are needed to fully establish the consequences and genetic mechanisms by which COL11A1 variants contribute to an AIS phenotype.

      2B. Expression appears to be augmented compared WT in Fig 5B, but there is no direct comparison of WT with variant.

      Response: Expression of the mutant (from the lentiviral expression vector) is increased compared to mutant. We observed this effect in repeated experiments. Sequencing confirmed that the mutant and wildtype constructs differed only at the position of the rs3753841 SNP. At this time, we cannot explain the difference in expression levels. Nonetheless, even when the variant COL11A1 is relatively overexpressed it fails to suppress MMP3 expression as observed for the wildtype form.

      2C. How do the authors know that their complementation data in Figure 5 are specific? Repetition of this experiment with an alternative common nonsynonymous variant in COL11A1 (such as rs1676486) would be helpful as a comparison with the expectation that it would be similar to WT.

      Response: We agree that testing an allelic series throughout COL11A1 could be informative, but we have shifted our resources toward in vivo experiments that we believe will ultimately be more informative for deciphering the mechanistic role of COL11A1 in MMP3 regulation and spine deformity.

      2D. The y-axes of histograms in panel A need attention and clarification. What is meant by power? Do you mean fold change?

      Response: Power is directly comparable to fold change but allows comparison of absolute expression levels between different genes.

      2E. Figure 5: how many technical and biological replicates? Confirm that these are stated throughout the figures.

      Response: Thank you for pointing out this oversight. This information has been added throughout.

      1. Figure 2: What does the gross anatomy of the IVD look like? Could the authors address this by showing an H&E of an adjacent section of the Fig. 2 A panels?

      Response: Panel 2 shows H&E staining. Perhaps the reviewer is referring to the WT and Pax1 KO images in Figure 3? We have now added H&E staining of WT and Pax1 KO IVD as supplemental Figure 3E to clarify the IVD anatomy.

      1. Page 9: "Cells within the IVD were negative for Pax1 staining ..." There seems to be specific PAX1 expression in many cells within the IVD, which is concerning if this is indeed a supposed null allele of Pax1. This data seems to support that the allele is not null.

      Response: We have now added updated images for the COL11A1 and PAX1 staining to include negative controls in which we omitted primary antibodies. As can be seen, there is faint autofluorescence in the PAX1 negative control that appears to explain the “specific staining” referred to by the reviewer. These images confirm that the allele is truly a null.

      1. There is currently a lack of evidence supporting the claim that "Col11a1 is positively regulated by Pax1 in mouse spine and tail". Therefore, it is necessary to conduct further research to determine the direct regulatory role of Pax1 on Col11a1.

      Response: We agree with the reviewer and have clarified that Pax1 may have either a direct or indirect role in Col11a1 regulation.

      1. There is no data linking loss of COL11A1 function and spine defects in the mouse model. Furthermore, due to the absence of P1335L point mutant mice, it cannot be confirmed whether P1335L can actually cause AIS, and the pathogenicity of this mutation cannot be directly verified. These limitations need to be clearly stated and discussed. A Col11a1 mouse mutant called chondroysplasia (cho), was shown to be perinatal lethal with severe endochondral defects (https://pubmed.ncbi.nlm.nih.gov/4100752/). This information may help contextualize this study.

      Response: We partially agree with the reviewer. Spine defects are reported in the cho mouse (for example, please see reference 36 Hafez et al). We appreciate the suggestion to cite the original Seegmiller et al 1971 reference and have added it to the manuscript.

      1. A recent article (PMID37462524) reported mutations in COL11A2 associated with AIS and functionally tested in zebrafish. That study should be cited and discussed as it is directly relevant for this manuscript.

      Response: We agree with the reviewer that this study provides important information supporting loss of function I type XI collagen in spinal deformity. Language to this effect has been added to the manuscript and this study is now cited in the paper.

      1. Please reconcile the following result on page 10 of the results: "Interestingly, the AISassociated gene Adgrg6 was amongst the most significantly dysregulated genes in the RNA-seq analysis (Figure 3c). By qRT-PCR analysis, expression of Col11a1, Adgrg6, and Sox6 were significantly reduced in female and male Pax1-/- mice compared to wild-type mice (Figure 3d-g)." In Figure 3f, the downregulation of Adgrg6 appears to be modest so how can it possibly be highlighted as one of the most significantly downregulated transcripts in the RNAseq data?

      Response: By “significant” we were referring to the P-value significance in RNAseq analysis, not in absolute change in expression. This language was clearly confusing, and we have removed it from the manuscript.

      1. It is incorrect to refer to the primary cell culture work as growth plate chondrocytes (GPCs), instead, these are primary costal chondrocyte cultures. These primary cultures have a mixture of chondrocytes at differing levels of differentiation, which may change differentiation status during the culturing on plastic. In sum, these cells are at best chondrocytes, and not specifically growth plate chondrocytes. This needs to be corrected in the abstract and throughout the manuscript. Moreover, on page 11 these cells are referred to as costal cartilage, which is confusing to the reader.

      Response: Thank you for pointing out these inconsistencies. We have changed the manuscript to say “costal chondrocytes” throughout.

      Minor points

      • On 10 of the Results: "These data support a mechanistic link between Pax1 and Col11a1, and the AIS-associated genes Gpr126 and Sox6, in affected tissue of the developing tail." qRT-PCR validation of Sox6, although significant, appears to be very modestly downregulated in KO. Please soften this statement in the text.

      Response: We have softened this statement.

      • Have you got any information about how the immortalized (SV40) costal cartilage affected chondrogenic differentiation? The expression of SV40 seemed to stimulate Mmp13 expression. Do these cells still make cartilage nodules? Some feedback on this process and how it affects the nature of the culture what be appreciated.

      Response: The “+ or –“ in Figure 5 refers to Ad5-cre. Each experiment was performed in SV40-immortalized costal chondrocytes. We have removed SV40 from the figure and have clarified the legend to say “qRT-PCR of human COL11A1 and endogenous mouse Mmp3 in SV40 immortalized mouse costal chondrocytes transduced with the lentiviral vector only (lanes 1,2), human WT COL11A1 (lane 3), or COL11A1P1335L. Otherwise we absolutely agree that understanding Mmp13 regulation during chondrocyte differentiation is important. We plan to study this using in vivo systems.

      • Figure 1: is the average Odds ratio, can this be stated in the figure legend?

      Response: We are not sure what is being asked here. The “combined odds ratio” is calculated as a weighted average of the log of the odds.

      • A more consistent use of established nomenclature for mouse versus human genes and proteins is needed.

      Human:GENE/PROTEIN Mouse: Gene/PROTEIN

      Response: Thank you for pointing this out. The nomenclature has been corrected throughtout the manuscript.

      • There is no Figure 5c, but a reference to results in the main text. Please reconcile. -There is no Figure 5-figure supplement 5a, but there is a reference to it in the main text. Please reconcile.

      Response: Figure references have been corrected.

      • Please indicate dilutions of all antibodies used when listed in the methods.

      Response: Antibody dilutions have been added where missing.

      • On page 25, there is a partial sentence missing information in the Histologic methods; "#S36964 Invitrogen, CA, USA)). All images were taken..."

      Response: We apologize for the error. It has been removed.

      • Table 1: please define all acronyms, including cohort names.

      Response: We apologize for the oversight. The legend to the Table has been updated with definitions of all acronyms.

      • Figure 2: Indicate that blue staining is DAPI in panel B. Clarify that "-ab" as an abbreviation is primary antibody negative.

      Response: A color code for DAPI and COL11A! staining has been added and “-ab” is now defined.

      • Page 4: ADGRG6 (also known as GPR126)...the authors set this up for ADGRG6 but then use GPR126 in the manuscript, which is confusing. For clarity, please use the gene name Adgrg6 consistently, rather than alternating with Gpr126.

      Response: Thank you for pointing this out. GPR126 has now been changed to ADGRG6 thoughout the manuscript.

      • REF 4: Richards, B.S., Sucato, D.J., Johnston C.E. Scoliosis, (Elsevier, 2020). Is this a book, can you provide more clarity in the Reference listing?

      Response: Thank you for pointing this out. This reference has been corrected.

      • While isolation was addressed, the methods for culturing Rat cartilage endplate and costal chondrocytes are poorly described and should be given more text.

      Response: Details about the cartilage endplate and costal chondrocyte isolation and culture have been added to the Methods.

      • Page 11: 1st paragraph, last sentence "These results suggest that Mmp3 expression"... this sentence needs attention. As written, I am not clear what the authors are trying to say.

      Response: This sentence has been clarified and now reads “These results suggest that Mmp3 expression is negatively regulated by Col11a1 in mouse costal chondrocytes.”

      • Page 13: line 4 from the bottom, "ECM-clearing"? This is confusing do you mean ECM degrading?

      Response: Yes and thank you. We have changed to “ECM-degrading”.

      • Please use version numbers for RefSeq IDs: e.g. NM_080629.3 instead of NM_080629

      Response: This change has been made in the revised manuscript.

      • It would be helpful for readers if the ethnicity of the discovery case cohort was clearly stated as European ancestry in the Results main text.

      Response: “European ancestry” has been added at first description of the discovery cohort in the manuscript.

      • Avoid using the term "mutation" and use "variant" instead.

      Response: Thank you for pointing this out. “Variant” is now used throughout the manuscript.

      • Define error bars for all bar charts throughout and include individual data points overlaid onto bars.

      Response: Thank you. Error bars are now clarified in the Figure legends.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #2 (Public review):

      Summary:

      The authors reported that mutations were identified in the ZC3H11A gene in four adolescents from 1015 high myopia subjects in their myopia cohort. They further generated Zc3h11a knockout mice utilizing the CRISPR/Cas9 technology.

      Comments on revisions:

      Chong Chen and colleagues revised the manuscript; however, none of my suggestions from the initial review have been sufficiently addressed.

      (1) I indicated that the pathogenicity and novelty of the mutation need to be determined according to established guidelines and databases. However, the conclusion was still drawn without sufficient justification.

      Thank you for your valuable feedback on the assessment of mutation pathogenicity and novelty. We regret to inform you that complete familial genetic information required for segregation analysis is currently unavailable in this study. Despite our exhaustive efforts to contact the four mutation carriers and their relatives, we encountered the following uncontrollable limitations: Two patients could not be further traced due to invalid contact information, one patient had relocated to another region, making sample collection logistically unfeasible, the remaining patient explicitly declined family participation in genetic testing due to privacy concerns.

      We fully acknowledge that the lack of pedigree data may affect the certainty of pathogenicity evaluation. To address this limitation, we systematically analyzed the four ZC3H11A missense mutations (c.412G>A p.V138I, c.128G>A p.G43E, c.461C>T p.P154L, and c.2239T>A p.S747T) based on ACMG guidelines and database evidence. The key findings are summarized below: All of the identified mutations exhibited very low frequencies or does not exist in the Genome Aggregation Database (gnomAD) and Clinvar, and using pathogenicity prediction software SIFT, PolyPhen2, and CADD, most of them display high pathogenicity levels. Among them, c.412G>A, c.128G>A and c.461C>T were located in or around a domain named zf-CCCH_3 (Figure 1A and B). Furthermore, all of the mutation sites were located in highly conserved amino acids across different species (Figure 1C). The four mutations induced higher structural flexibility and altered the negative charge at corresponding sites, potentially disrupting protein-RNA interactions (Figure 1D and E). Concurrently, overexpression of mutant constructs (ZC3H11A-V138I, ZC3H11A-G43E, ZC3H11A-P154L, and ZC3H11A-S747T) revealed significantly reduced nuclear IκBα mRNA levels compared to the wild-type, suggesting impaired NF-κB pathway regulation (Supplementary Figure 4). Zc3h11a knockout mice also exhibited a myopic phenotype, with alterations in the PI3K-AKT and NF-κB signaling pathways. Integrating this evidence, the mutations meet the following ACMG criteria: PM1 (domain-located mutations), PM2 (extremely low population frequency), PP3 (computational predictions supporting pathogenicity), PS3 (functional validation via experimental assays). Under the ACMG framework, these mutations are classified as "Likely Pathogenic".

      Regarding the novelty of this mutation, comprehensive searches in ClinVar, dbSNP, and HGMD databases revealed no prior reports associating this variant with myopia. Similarly, a PubMed literature search identified no direct evidence linking this mutation to myopia. Based on this evidence, we classify this variant as a likely pathogenic and novel mutation.

      On the other hand, we acknowledge that the absence of family segregation data may reduce the confidence in pathogenicity assessment. Nevertheless, functional experiments and converging multi-level evidence strongly support the reliability of our conclusion. Future studies will prioritize family-based validation to strengthen the evidence chain. We sincerely appreciate your attention to this matter and kindly request your understanding of the practical limitations inherent to this research.

      (2) The phenotype of heterozygous mutant mice is too weak to support the gene's contribution to high myopia. The revised manuscript does not adequately address these discrepancies. Furthermore, no explanation was provided for why conditional gene deletion was not used to avoid embryonic lethality, nor was there any discussion on tissue- or cell-specific mechanistic investigations.

      We sincerely appreciate your insightful comments regarding the relationship between murine phenotypes and human disease. We fully acknowledge your concerns about the phenotypic strength of Zc3h11a heterozygous mutant mice and their association with high myopia (HM) pathogenesis. Here we provide point-by-point responses to your valuable comments: Our study demonstrates that Zc3h11a heterozygous mutant mice exhibit myopic refractive phenotypes with upregulated myopia-associated factors (TGF-β1, MMP2, and IL6), although axial elongation did not reach statistical significance. Notably, at 4 and 6 weeks of age, Het mice did display longer axial lengths and vitreous chamber depths compared to WT mice. While these differences did not reach statistical significance at other time points, an increasing trend was still observed. Several technical considerations may explain these findings: The small murine eye size (where 1D refractive change corresponds to only 5-6μm axial length change). The theoretical resolution limit of 6μm for the SD-OCT device used in this study. These factors likely contributed to the marginal statistical significance observed in the subtle changes of vitreous chamber depth and axial length measurements. Additionally, existing research indicates that axial length measurements from frozen sections in age-matched mice tend to be longer than those obtained through in vivo measurements. This phenomenon may reflect species differences between humans and mice - while both show significant refractive power changes, the axial length differences are less pronounced in mice. These results align with previous reports of phenotypic differences between mouse models and human myopia.

      To address these issues comprehensively, we have added a dedicated discussion section in the revised manuscript specifically examining these axial length measurement considerations, following your valuable suggestion.

      Additionally, we regret to inform you that the currently available floxed ZC3H11A mouse strain requires a minimum of 12-18 months for custom construction, which exceeds our research timeline due to current resource limitations in our team. To address this gap, we have supplemented the discussion section with additional content regarding tissue- and cell-specific mechanisms. Based on your constructive suggestions, we will prioritize the following in our subsequent work: Collaborate with transgenic animal centers to generate Zc3h11a conditional knockout mice. Evaluate the impact of specific knockouts on myopia progression using form-deprivation (FDM) models. While we recognize the limitations of our current study, we believe that by integrating clinical cohort data, phenotypic evidence, and functional experiments, this research provides valuable directional evidence for ZC3H11A's potential role in myopia pathogenesis. Your comments will significantly contribute to improving our future research design, and we sincerely hope you can recognize the exploratory significance of our current findings.

      (3) The title, abstract, and main text continue to misrepresent the role of the inflammatory intracellular PI3K-AKT and NF-κB signaling cascade in inducing high myopia. No specific cell types have been identified as contributors to the phenotype. The mice did not develop high myopia, and no relationship between intracellular signaling and myopia progression has been demonstrated in this study.

      Thank you for your valuable comments regarding the interpretation of signaling pathways in our study. We fully acknowledge your rigorous concerns about the role of PI3K-AKT and NF-κB signaling cascades in high myopia and recognize that we did not identify specific cell types contributing to the observed phenotype. In response to your feedback, we have removed the hypothetical statement linking genetic changes within inflammatory cells to the development of myopia. The current interpretation is strictly based on experimental evidence of pathway relevance and is supported by the theoretical basis presented in the reference, specifically that loss of Zc3h11a leads to activation of the PI3K-AKT and NF-κB pathways in retinal cells, contributing to the myopic phenotype.

      Author response image 1.

      Model of the association between inflammation and myopia progression. Activated mAChR3 (M3R) activates phosphoinositide 3-kinase (PI3K)–AKT and mitogen-associated protein kinase (MAPK) signaling pathways, in turn activating NF-κB and AP1 (i.e., the Jun.-Fos heterodimer) and stimulating the expression of the target genes NF-κB, MMP2, TGFβ, IL- 1β and -6, and TNF-α. MMP2 and TGF-β promote tissue remodeling and TNF-α may act in a paracrine feedback loop in the retina or sclera to activate NF-κB during myopia progression.

      To address the limitations raised, we will prioritize the following in future studies: Cell-type-specific knockout models to identify key cellular contributors. Mechanistic investigations to establish causal relationships between signaling pathways and myopia progression. We sincerely appreciate your rigorous review, which has significantly improved the scientific accuracy and clarity of our manuscript. We believe the revised version better reflects both the novelty and limitations of our findings. We kindly request your recognition of the study’s contributions while acknowledging its current constraints.

      Reviewer #3 (Public review):

      Chen et al have identified a new candidate gene for high myopia, ZC3H11A, and using a knock-out mouse model, have attempted to validate it as a myopia gene and explain a potential mechanism. They identified 4 heterozygous missense variants in highly myopic teenagers. These variants are in conserved regions of the protein, and predicted to be damaging, but the only evidence the authors provide that these specific variants affect protein function is a supplement figure showing decreased levels of IκBα after transfection with overexpression plasmids (not specified what type of cells were transfected). This does not prove that these mutations cause loss of function, in fact it implies they have a gain-of-function mechanism. They then created a knock-out mouse. Heterozygotes show myopia at all ages examined but increased axial length only at very early ages. Unfortunately, the authors do not address this point or examine corneal structure in these animals. They show that the mice have decreased B-wave amplitude on electroretinogram (a sign of retinal dysfunction associated with bipolar cells), and decreased expression of a bipolar cell marker, PKCα. On electron microscopy, there are morphologic differences in the outer nuclear layer (where bipolar, amacrine, and horizontal cell bodies reside). Transcriptome analysis identified over 700 differentially expressed genes. The authors chose to focus on the PI3K-AKT and NF-κB signaling pathways and show changes in expression of genes and proteins in those pathways, including PI3K, AKT, IκBα, NF-κB, TGF-β1, MMP-2 and IL-6, although there is very high variability between animals. They propose that myopia may develop in these animals either as a result of visual abnormality (decreased bipolar cell function in the retina) or by alteration of NF-κB signaling. These data provide an interesting new candidate variant for development of high myopia, and provide additional data that MMP2 and IL6 have a role in myopia development. For this revision, none of my previous suggestions have been addressed.

      Reviewer #3 (Recommendations for the authors):

      None of these suggestions were addressed in the revision:

      Major issues:

      (1) Figure 2: refraction is more myopic but axial length is not longer - why is this not discussed and explored? The text claims the axial length is longer, but that is not supported by the figure. If this is a measurement issue, that needs to be discussed in the text.

      We sincerely appreciate your valuable comments regarding the relationship between refractive status and axial length in our study. In response to your concerns, we have conducted an in-depth analysis and would like to address the issues as follows:

      Our data demonstrate significant differences in refractive error between heterozygous (Het) and wild-type (WT) mice during the 4-10 weeks. Notably, at 4 and 6 weeks of age, Het mice did exhibit longer axial lengths and greater vitreous chamber depth compared to WT mice, although these differences did not reach statistical significance at other time points while still showing an increasing trend. Additional measurements of corneal curvature revealed no significant differences between groups. Considering the small size of mouse eyes (where a 1D refractive change corresponds to only 5-6μm axial length change) and the theoretical resolution limit of 6μm for the SD-OCT device used in this study, these technical factors may account for the marginal statistical significance of the observed small changes in vitreous chamber depth and axial length measurements. Furthermore, existing studies have shown that axial length measurements from frozen sections tend to be longer than those obtained from in vivo measurements in age-matched mice. These considerations provide plausible explanations for the apparent discrepancy between refractive changes and axial length parameters. Following your suggestion, we have added a dedicated discussion section addressing these axial length measurement issues in the revised manuscript. We fully understand your concerns regarding data consistency, and your comments have prompted us to conduct more comprehensive and thorough analysis of our results. We believe the revised manuscript now more accurately reflects our findings while providing important technical references for future studies.

      (2)  Slipped into the methods is a statement that mice with small eyes or ocular lesions were excluded. How many mice were excluded? Are the authors ignoring another phenotype of these mice?

      We appreciate your attention to the exclusion criteria and their implications. Below we provide a detailed clarification: A total of 7 mice (4 Het-KO and 3 WT) with small eyes or ocular lesions were excluded from the observation cohort. These anomalies were consistent with the baseline incidence of spontaneous malformations observed in historical colony data of wild-type C57BL/6J mice (approximately 11%), and were not attributed to the Zc3h11a heterozygous knockout. We have added the above content in the methods section. Your insightful comment has significantly strengthened our reporting rigor. We hope this clarification alleviates your concerns regarding potential selection bias or overlooked phenotypes.

      Minor/Word choice issues:

      All the figure legends need to be improved so that each figure can be interpreted without having to refer to the text.

      Thank you for your valuable comments. We have made modifications to the legend of each graphic, as detailed in the main text.

      Abstract: line 24: use refraction, not "vision"

      Thank you for your valuable comments. The “Vision” has been changed to “refraction”.

      Line 28: re-word "density of bipolar cell-labeled proteins" Do the authors mean density of bipolar cells? Or certain proteins were less abundant in bipolar cells?

      Thank you for your rigorous review of this terminology. We acknowledge the need to clarify the precise meaning of the phrase "density of bipolar cell-labeled proteins." In the original text, this term specifically refers to the expression abundance of the bipolar cell-specific marker protein PKCα, which was identified using immunofluorescence labeling techniques. Specifically: We utilized PKCα (a bipolar cell marker) to label bipolar cell populations. The "density" was quantified by measuring the fluorescence signal intensity per unit area in confocal microscopy images, rather than direct cell counting. This metric reflects changes in the expression of the specific marker protein (PKCα) within bipolar cells, which indirectly correlates with alterations in bipolar cell populations. To address ambiguity, we have revised the terminology throughout the manuscript to "bipolar cell-labelled protein PKCα immunofluorescence abundance".

      Additionally, since fluorescence intensity quantification is inherently semi-quantitative, we have included Western blot results for PKCα in the revised manuscript (Figure 3I, J) to validate the expression changes observed via immunofluorescence. We sincerely appreciate your feedback, which has significantly improved the precision of our manuscript.

      Line 45: axial length, not ocular axis

      Thank you for your valuable comments. The “ocular axis” has been changed to “axial length”.

      Lines73-75: confusing

      Thank you for your valuable comments. The relevant content has been modified to “Multiple zinc finger protein genes (e.g., ZNF644, ZC3H11B, ZFP161, ZENK) are associated with myopia or HM. Of these, ZC3H11B (a human homolog of ZC3H11A) and five GWAS loci (Schippert et al., 2007; Shi et al., 2011; Szczerkowska et al., 2019; Tang et al., 2020; Wang et al., 2004) correlate with AL elongation or HM severity. Proteomic studies further suggest ZC3H11A involvement in the TREX complex, implicating RNA export mechanisms in myopia pathogenesis”

      Line 138: what is dark 3.0 and dark 10.0

      Thank you for your valuable comments. The relevant content has been modified to “Upon dark adaptation, b-wave amplitudes in seven-week-old Het-KO mice were significantly lower at dark 3.0 (0.48 log cd·s/m²) and dark 10.0 (0.98 log cd·s/m²) compared to WT mice.” A detailed description has been added to the main text methods.

      Line 171-175: the GO terms of "biological processes" and "molecular functions" are so broad as to be meaningless.

      Thank you for your valuable comments. The relevant content has been modified to “GO enrichment analysis revealed significant enrichment of differentially expressed genes in the following functions: Zinc ion transmembrane transport (GO:0071577) within metal ion homeostasis, associated with retinal photoreceptor maintenance (Ugarte and Osborne, 2001), RNA biosynthesis and metabolism (GO:0006366) in transcriptional regulation, potentially influencing ocular development, negative regulation of NF-κB signaling (GO:0043124) in inflammatory modulation, a pathway involved in scleral remodelling (Xiao et al., 2025), calcium ion binding (GO:0005509), critical for phototransduction (Krizaj and Copenhagen, 2002), zinc ion transmembrane transporter activity (GO:0005385), participating in retinal zinc homeostasis (Figure 5C and D).”

      Line 257-259: which results indicated loss of Zc3h11a inhibited translocation of IκBα from nucleus to cytoplasm? Results of this study, or the previously referenced study?

      We sincerely appreciate your critical inquiry regarding the mechanistic relationship between Zc3h11a deficiency and IκBα translocation. We are grateful for this opportunity to clarify this important point. The findings regarding Zc3h11a-mediated regulation of IκBα mRNA nuclear export and its impact on NF-κB signaling originate from the study by Darweesh et al. The key experimental evidence demonstrates that: The depletion of Zc3h11a leads to nuclear retention of IκBα mRNA, resulting in failure to maintain normal levels of cytoplasmic IκBα mRNA and protein. This defect in IκBα mRNA export disrupts the essential inhibitory feedback loop on NF-κB activity, causing hyperactivation of this pathway. This manifests as upregulation of numerous innate immune-related mRNAs, including IL-6 and a large group of interferon-stimulated genes.While our study references this mechanism to explain the observed NF-κB dysregulation in Zc3h11a Het-KO mice, the specific nuclear export mechanism was indeed elucidated by Darweesh et al. The reference has been inserted into the corresponding position in the main text. Importantly, our research extends these previous molecular insights into the phenotypic context of myopia.

      We sincerely regret any ambiguity in the original text and deeply appreciate your rigorous approach in ensuring proper attribution of these fundamental findings. Your comment has significantly improved the clarity and accuracy of our manuscript.

      Figure 6 shows decrease of both mRNA and protein expression, but nothing about translocation.

      Thank you for your valuable comments. The research results of Darweesh et al. showed that Zc3h11a protein plays a role in regulation of NF-κB signal transduction. Depletion of Zc3h11a resulted in enhanced NF-κB mediated signaling, with upregulation of numerous innate immune related mRNAs, including IL-6 and a large group of interferon-stimulated genes. IL-6 upregulation in the absence of the Zc3h11a protein correlated with an increased NF-κB transcription factor binding to the IL-6 promoter and decreased IL-6 mRNA decay. The enhanced NF-κB signaling pathway in Zc3h11a deficient cells correlated with a defect in IκBα inhibitory mRNA and protein accumulation. Upon Zc3h11a depletion The IκBα mRNA was retained in the cell nucleus resulting in failure to maintain normal levels of the cytoplasmic IκBα mRNA and protein that is essential for its inhibitory feedback loop on NF-κB activity. These findings demonstrate that ZC3H11A can regulate the NF-κB pathway by controlling the translocation of IκBα mRNA, a mechanism that was indeed elucidated by Darweesh et al. We sincerely apologize for any lack of clarity in our original description and have now inserted the appropriate reference in the relevant section of the main text.

      We deeply appreciate your valuable comments in identifying this ambiguity in our manuscript, which have significantly improved the accuracy and clarity of our work.

      Line 283: what do you mean "may confer embryonic lethality"? Were they embryonic lethal or not?

      We sincerely appreciate your critical request for clarification. Our experimental data from 15 pregnancies of Zc3h11a Het-KO mice intercrosses (n = 15 litters) conclusively confirmed the absence of homozygous knockout (Homo-KO) pups at birth. These findings align with the embryonic lethality of Zc3h11a homozygous deletion as reported by Younis et al. We fully acknowledge the ambiguity in our original phrasing and have revised the text to:“Second, Zc3h11a homozygous KO (Homo-KO) mice were not obtained in our study because homozygous deletion of exons confer embryonic lethality.”Your vigilance in ensuring terminological precision has greatly strengthened the rigor of our manuscript. We hope this clarification fully resolves your concerns.

      Line 338: What is meant that Het-KO mice were constructed at 4 weeks of age? Do these mice not have a germline mutation?

      Thank you for your valuable comments. We have revised the following content: “The germline heterozygous Zc3h11a knockout (Het-KO) mice were generated by CRISPR/Cas9-mediated gene editing at the embryonic stage on a C57BL/6J background, provided by GemPharmatech Co., Ltd (Nanjing, China). Phenotypic analyses were initiated when the mice reached four weeks of age.”

      Line 346-347: how many mice were excluded due to having small eyes or ocular lesions? The methods section should state how refraction and ocular biometrics were measured.

      Thank you for your valuable comments. We have added or revised the following content: “To exclude potential confounding effects of spontaneous ocular developmental abnormalities, a total of 7 mice (4 Het-KO and 3 WT) with small eyes or ocular lesions were excluded from the observation cohort. These anomalies were consistent with the baseline incidence of spontaneous malformations observed in historical colony data of wild-type C57BL/6J mice (approximately 11%), and were not attributed to the Zc3h11a heterozygous knockout.

      The methods for measuring refraction and ocular biometrics are as follows and have been added to the original method. Refractive measurements were performed by a researcher blinded to the genotypes. Briefly, in a darkroom, mice were gently restrained by tail-holding on a platform facing an eccentric infrared retinoscope (EIR) (Schaeffel et al., 2004; Zhou et al., 2008a). The operator swiftly aligned the mouse position to obtain crisp Purkinje images centered on the pupil using detection software (Schaeffel et al., 2004), enabling axial measurements of refractive state and pupil size. Three repeated measurements per eye were averaged for analysis. The anterior chamber (AC) depth, lens thickness, vitreous chamber (VC) depth, and axial length (AL) of the eye were measured by real-time optical coherence tomography (a custom built OCT) (Zhou et al., 2008b). In simple terms, after anesthesia, each mouse was placed in a cylindrical holder on a positioning stage in front of the optical scanning probe. A video monitoring system was used to observe the eyes during the process. Additionally, by detecting the specular reflection on the corneal apex and the posterior lens apex in the two dimensional OCT image, the optical axis of the mouse eye was aligned with the axis of the probe. Eye dimensions were determined by moving the focal plane with a stepper motor and recording the distance between the interfaces of the eyes. Then, using the designed MATLAB software and appropriate refractive indices, the recorded optical path length was converted into geometric path length. Each eye was scanned three times, and the average value was taken.”

      Line 428: what age retinas

      Thank you for your meticulous attention to the experimental design details. Regarding the age of retinal samples, we have clarified the following in the revised manuscript:" Retinas were harvested from four-week-old mice for RNA sequencing." This revision enhances the transparency and reproducibility of our methodology. We deeply appreciate your rigorous review.

      Figure 3 D-F: these images are too small to adequately assess, please show at higher magnification. Are there fewer bipolar cells, or just decreased expression of PKC? From these images, expression of ZC3H11A does not appear decreased, but the retina appears thinner. Is that true, or are these poorly matched sections?

      Thank you for your professional insights regarding image quality and data interpretation. Your rigorous review has significantly enhanced the scientific rigor of our study. We hereby address your concerns point by point: The images in Figures 3D-F were acquired using a Zeiss LSM880 confocal microscope with a 10x eyepiece and 20x objective lens, a standard magnification for retinal section imaging that balances cellular resolution with full-thickness structural preservation. We quantified PKCα immunofluorescence intensity (a bipolar cell-specific marker) to assess changes in bipolar cell populations, rather than direct cell counting. This metric reflects PKCα expression abundance as a proxy for bipolar cell alterations (Figure 3H). To clarify terminology, we have revised the text to "bipolar cell-labelled protein PKCα immunofluorescence abundance" and detailed the methodology in the revised Methods section. Recognizing the semi-quantitative nature of fluorescence intensity analysis, we supplemented these data with Western blot results confirming reduced PKCα protein levels (Figure 3I). Zc3h11a expression was validated both by immunofluorescence intensity (Figure 3G) and Western blot (Figures 6F, H) quantification, confirming reduced expression in Zc3h11a Het-KO retinas. The apparent "retinal thinning" observed in histology sections stems from technical artifacts during tissue processing (fixation, dehydration, sectioning), not biological differences. HE staining, which better preserves sample morphology, showed no structural or thickness differences between Zc3h11a Het-KO mice and wild-type mice (Supplementary Figure 2).

      Your expert feedback has driven us to establish a more robust validation framework. We believe the revised data now more accurately reflect the biological reality and sincerely hope these improvements meet your approval.

      Figure 3G-J: Relative fluorescence intensity of immunohistochemistry is not a valid measure of protein expression.

      We sincerely appreciate your thorough review and valuable comments regarding the immunofluorescence quantification method in Figures 3G-J. In response to your concern that "relative fluorescence intensity is not an effective quantitative measure of protein expression," we have implemented the following improvements to our analysis and validation: To ensure result reliability, all immunofluorescence experiments followed strict protocols: experimental and control samples were fixed, stained, and imaged in the same batch to eliminate inter-batch variability. Imaging was performed using a Zeiss LSM 880 confocal microscope with identical parameters, and the relative fluorescence intensity of specific signals per unit area was measured and statistically analyzed using ZEN software. We fully acknowledge the semi-quantitative nature of relative fluorescence intensity measurements. Therefore, we validated key differentially expressed proteins using Western blot analysis: The Western blot results for Zc3h11a (Figures 6F, H) were completely consistent with the relative fluorescence intensity trends (Figure 3G). Additionally, the newly included Western blot data for PKCα (Figure 3 I) further confirmed the reliability of our relative fluorescence intensity quantification. Your expert advice has significantly enhanced the rigor of our study. Should any additional data or clarification be required, we would be pleased to provide further support.

      Figure 4: what are the arrows pointing at? This should be in the Figure legend. What is MB? Why are there no scale bars? What is difference between E and F, not clear from legend.

      We sincerely appreciate your thorough review of Figure 4 and your valuable suggestions. In response to your concerns, we have carefully examined and improved the relevant content with the following modifications and clarifications: We sincerely apologize for not clearly indicating the arrow annotations in the original figure legend. In the revised version, we have provided detailed explanations for the arrow indicators: black arrows indicate perinuclear space dilation, blue arrows indicate cytoplasmic edema, and red arrows indicate disorganized and loosely arranged membrane discs. The updated legend has been clearly marked below Figure 4 in the main text. MB represents membrane discs, which are critical subcellular structures in the outer segments of retinal photoreceptor cells (rods and cones). They are responsible for light signal capture and transduction (containing visual pigments such as rhodopsin). The structural integrity of MB is essential for normal visual function. The scale bars in the original figures were located in the lower right corner of each subpanel, with specific parameters as follows: Figures 4A and B: magnification ×1000, scale bar 10 μm; Figures 4C and D: magnification ×700, scale bar 20 μm; Figures 4E and G: magnification ×2000, scale bar 5 μm; Figures 4F and H: magnification ×7000, scale bar 2 μm. Both Figures 4E and 4F show electron microscopy images of membrane discs (MB) in wild-type mouse photoreceptor cells. The only difference lies in the magnification: Figure 4E (×2000) demonstrates the overall arrangement pattern of membrane discs, while Figure 4F (×7000) focuses on ultrastructural details of the membrane discs (such as structural integrity). We have thoroughly checked the consistency between the figures and text, and have supplemented detailed legend descriptions in the main text. Once again, we sincerely appreciate your rigorous review, which has significantly enhanced the scientific rigor and readability of our study. Should you have any further suggestions, we would be happy to incorporate them.

      Figure 5A: Why such a large y-axis? Figure legend does not match figure

      We sincerely appreciate your careful review of Figure 5A and your valuable suggestions regarding the figure details. In response to your concerns, we have thoroughly examined and improved the relevant content as follows: The Y-axis of the volcano plot represents -log₁₀(p-value), where the magnitude of the values reflects statistical significance. Our RNA-seq data underwent rigorous multiple testing correction, and the adjusted p-values for some genes were extremely small, resulting in large values after -log₁₀ transformation. We have re-examined the data distribution and confirmed that the expanded Y-axis range is solely due to a small number of highly significant genes (as shown in the figure, the majority of genes remain clustered in the lower half of the Y-axis). This result accurately reflects the true data characteristics.

      We sincerely apologize for the inadvertent error in the original labeling of "Up/Down" in the figure legend. This has now been corrected, and we strictly adhere to the following threshold criteria: Significantly upregulated (Up): adjusted p-value < 0.05 and log₂(FC) ≥ 1. Significantly downregulated (Down): adjusted p-value < 0.05 and log₂(FC) ≤ -1. To ensure the reliability of our conclusions, we have rechecked the raw data, statistical analysis, and visualization process. We confirmed that all significant genes strictly meet the above threshold criteria and that the visualization accurately reflects the true results. The revised figure has been updated in the manuscript as Figure 5A. We deeply appreciate your valuable feedback, which has helped us correct the errors in the figure and improve its accuracy and readability.

      Figure 6F: Based on the western blot, only Zc3h11a appears different.

      Thank you for your careful evaluation of the Western blot data in Figure 6F. We fully understand your concerns regarding the visual differences in PI3K and p-AKT/AKT bands and appreciate the opportunity to clarify the quantitative methodology and biological significance of these findings. Below we provide a detailed explanation of the experimental design and data analysis.

      First, the data for each group were derived from retinal samples of three independent mice, with all experiments performed in parallel to control for technical variability. Image analysis was conducted using ImageJ software with standardized settings for grayscale quantification. Zc3h11a and PI3K levels were normalized to GAPDH as an internal reference, while p-AKT levels were calculated as a ratio to total AKT. The results showed that Zc3h11a protein levels were significantly reduced (p < 0.01, Figures 6F and H), consistent with the expected effects of heterozygous knockout, with good agreement between visual and statistical results. For PI3K and p-AKT/AKT, the bands appeared visually similar due to: The nonlinear nature of Western blot chemiluminescence signals in the saturation range, which compresses subtle quantitative differences in the images; the fact that p-AKT represents only 5-15% of the total AKT pool, making small proportional changes difficult to discern visually. However, it is important to note that both PI3K and p-AKT/AKT showed statistically significant differences between groups (p < 0.001 and p < 0.01, respectively; Figures 6G and I). Furthermore, signal transduction pathways exhibit cascade amplification effects - in the PI3K-AKT pathway, even small changes in upstream proteins can produce significant downstream effects (e.g., NF-κB activation) through kinase cascades (Figure 6J). Additionally, our RNA-Seq results revealed activation of the PI3K-AKT signaling pathway in Zc3h11a Het-KO mice (Figure 5D), and the qRT-PCR results were consistent with the western blot results (Figure 6A-C). Your expert comments have prompted us to present these data differences with greater biological rigor. Although the visual differences are subtle, based on statistical significance, pathway characteristics, and RNA sequencing, and qRT-PCR data, we believe these changes have biological relevance. We sincerely appreciate your commitment to data rigor and respectfully request your recognition of both the experimental results and the scientific logic of this study.

      Figure 8: What is the role of ZC3H11A in this figure? Are the authors proposing that ZC3H11A regulates the translation of IκBα? They have not shown any evidence of that.

      Thank you for your insightful exploration of the role of ZC3H11A in Figure 8. We appreciate your critical review and hope to elucidate the mechanistic framework behind our findings. In Figure 8, Zc3h11a is depicted as a regulator of IκBα mRNA nucleocytoplasmic transport, a mechanism originally elucidated by Darweesh et al. Their studies demonstrated that Zc3h11a binds to IκBα mRNA and promotes its nuclear export. Loss of Zc3h11a results in nuclear retention of IκBα mRNA, leading to reduced cytoplasmic IκBα protein levels and subsequent hyperactivation of the NF-κB pathway. While the specific nuclear export mechanism has been elucidated by Darweesh et al., our study demonstrates that Zc3h11a haploinsufficiency results in decreased IκBα mRNA and protein levels in the retina (Figure 7), linking Zc3h11a haploinsufficiency to NF-κB pathway dysregulation in myopia and highlighting that these molecular insights can be extended to a new pathological context (myopia). Your critical comments have enhanced the clarity of our mechanistic concepts and we hope that these descriptions will demonstrate the importance of ZC3H11A as a new candidate gene for myopia.

    1. Author Response

      The following is the authors’ response to the previous reviews.

      eLife assessment

      This study presents valuable findings about synaptic connectivity among subsets of unipolar brush cells (UBCs), a specialized interneuron primarily located in the vestibular lobules of the cerebellar cortex. The evidence supporting the claims are interesting although incomplete in some areas. The work will be of interest to cerebellar neuroscientists as well as those focussed on synaptic properties and mechanisms. Although several compelling pieces of data were presented, substantial work remains to be conducted in order for the hypothesis and predictions of the manuscript to confirm how these factors play out in the actual brain circuit and how it would impact the processing of feedback or feedforward activity that would be required to promote behavior.

      Public Reviews:

      Reviewer #1 (Public Review):

      The manuscript by Hariani et al. presents experiments designed to improve our understanding of the connectivity and computational role of Unipolar Brush Cells (UBCs) within the cerebellar cortex, primarily lobes IX and X. The authors develop and cross several genetic lines of mice that express distinct fluorophores in subsets of UBCs, combined with immunocytochemistry that also distinguishes subtypes of UBCs, and they use confocal microscopy and electrophysiology to characterize the electrical and synaptic properties of subsets of so-labelled cells, and their synaptic connectivity within the cerebellar cortex. The authors then generate a computer model to test possible computational functions of such interconnected UBCs.

      Using these approaches, the authors report that:

      1. GRP-driven TDtomato is expressed exclusively in a subset (20%) of ON-UBCs, defined electrophysiologically (excited by mossy fiber afferent stimulation via activation of UBC AMPA and mGluR1 receptors) and immunocytochemically by their expression of mGluR1.

      2. UBCs ID'd/tagged by mCitrine expression in Brainbow mouse line P079 is expressed in a similar minority subset of OFF-UBCs defined electrophysiologically (inhibited by mossy fiber afferent stimulation via activation of UBC mGluR2 receptors) and immunocytochemically by their expression of Calretinin. However, such mCitrine expression was also detected in some mGluR1 positive UBCs, which may not have shown up electrophysiologically because of the weaker fluorophore expression without antibody amplification.

      3. Confocal analysis of crossed lines of mice (GRP X P079) stained with antibodies to mGluR1 and calretinin documented the existence of all possible permutations of interconnectivity between cells (ON-ON, ON-OFF, OFF-OFF, OFF-ON), but their overall abundance was low, and neither their absolute or relative abundance was quantified.

      4. A computational model (NEURON ) indicated that the presence of an intermediary UBC (in a polysynaptic circuit from MF to UBC to UBC) could prolong bursts (MF-ON-ON), prolong pauses (MF-ON-OFF), cause a delayed burst (MF-OFF-OFF), cause a delayed pause (MF-OFF-ON) relative to solely MF to UBC synapses which would simply exhibit long bursts (MF-ON) or long pauses (MF-OFF).

      The authors thus conclude that the pattern of interconnected UBCs provides an extended and more nuanced pattern of firing within the cerebellar cortex that could mediate longer lasting sensorimotor responses.

      The cerebellum's long known role in motor skills and reflexes, and associated disorders, combined with our nascent understanding of its role in cognitive, emotional, and appetitive processing, makes understanding its circuitry and processing functions of broad interest to the neuroscience and biomedical community. The focus on UBCs, which are largely restricted to vestibular lobes of the cerebellum reduces the breadth of likely interest somewhat. The overall design of specific experiments is rigorous and the use of fluorophore expressing mouse lines is creative. The data that is presented and the writing are clear. However, despite some additional analysis in response to the initial review, the overall experimental design still has issues that reduce overall interpretation (please see specific issues for details), which combined with a lack of thorough analysis of the experimental outcomes undermines the value of the NEURON model results and the advance in our understanding of cerebellar processing in situ (again, please see specific issues for details).

      Specific issues:

      1. All data gathered with inhibition blocked. All of the UBC response data (Fig. 1) was gathered in the presence of GABAAR and Glycine R blockers. While such an approach is appropriate generally for isolating glutamatergic synaptic currents, and specifically for examining and characterizing monosynaptic responses to single stimuli, it becomes problematic in the context of assaying synaptic and action potential response durations for long lasting responses, and in particular for trains of stimuli, when feed-forward and feed-back inhibition modulates responses to afferent stimulation. I.e. even for single MF stimuli, given the >500ms duration of UBC synaptic currents, there is plenty of time for feedback inhibition from Golgi cells (or feedforward, from MF to Golgi cell excitation) to interrupt AP firing driven by the direct glutamatergic synaptic excitation. This issue is compounded further for all of the experiments examining trains of MF stimuli. Beyond the impact of feedback inhibition on the AP firing of any given UBC, it would also obviously reduce/alter/interrupt that UBC's synaptic drive of downstream UBCs. This issue fundamentally undermines our ability to interpret the simulation data of Vm and AP firing of both the modeled intermediate and downstream UBC, in terms of applying it to possible cerebellar cortical processing in situ.

      The goal of Figure 1 was to determine the cell types of labeled UBCs in transgenic mouse lines, which is determined entirely by their synaptic responses to glutamate (Borges-Merjane and Trussell, 2015). Thus, blocking inhibition was essential to produce clear results in the characterization of GRP and P079 UBCs. While GABAergic/glycinergic feedforward and feedback inhibition is certainly important in the intact circuit, it was not our intention, nor was it possible, to study its contribution in the present study. Leaving inhibition unblocked does not lead to a physiologically realistic stimulation pattern in acute brain slices, because electrical stimulation produces synchronous excitation and inhibition by directly exciting Golgi cells, rather than their synaptic inputs. The main inhibition that UBCs receive that are crucial to determining burst or pause durations is not via GABA/glycine, but instead through mGluR2, which lasts for 100-1000s of milliseconds. The main excitation that drives UBC firing is mGluR1 and AMPA, which both last 100-1000s of milliseconds. Thus, these large conductances are unlikely to be significantly shaped by 1-10 ms IPSCs from feedforward and feedback GABA/glycine inhibition. Recent studies that examined the duration of bursting or pausing in UBCs had inhibition blocked in their experiments, presumably for the reasons outlined above (Guo et al., 2021; Huson et al., 2023).

      Below is an example showing the synaptic currents and firing patterns in an ON UBC before and after blocking inhibition. The GABA/glycinergic inhibition is fast, occurs soon after the stimuli and has little to no effect on the slow inward current that develops after the end of stimulation, which is what drives firing for 100s of milliseconds.

      Author response image 1.

      Example showing small effect of GABAergic and glycinergic inhibition on excitatory currents and burst duration. A) Excitatory postsynaptic currents in response to train of 10 presynaptic stimuli at 50 Hz before (black) and after (Grey) blocking GABA and glycine receptors. The slow inward current that occurs at the end of stimulation is little affected. B) Expanded view of the synaptic currents evoked during the train of stimuli. GABA/glycine receptors mediate the fast outward currents that occur immediately after the first couple stimuli. C) Three examples of the bursts caused by the 50 Hz stimulation in the same cell without blocking GABA and glycine receptors. D) Three examples in the same cell after blocking GABA and glycine receptors.

      The authors' response to the initial concern is (to paraphrase), "its not possible to do and its not important", neither of which are soundly justified.

      As stated in the original review, it is fully understandable and appropriate to use GABAAR/GlycineR antagonists to isolate glutamatergic currents, to characterize their conductance kinetics. That was not the issue raised. The issue raised was that then using only such information to generate a model of in situ behavior becomes problematic, given that feedback and lateral inhibition will sculpt action potential output, which of course will then fundamentally shape their synaptic drive of secondary UBCs, which will be further sculpted by their own inhibitory inputs. This issue undermines interpretation of the NEURON model.

      The argument that taking inhibition into account is not possible because of assumed or possible direct electrical excitation of Golgi cells is confusing for two interacting reasons. First, one can certainly stimulate the mossy fiber bundle to get afferent excitation of UBCs (and polysynaptic feedback/lateral inhibitory inputs) without directly stimulating the Golgi cells that innervate any recorded UBC. Yes, one might be stimulating some Golgi cells near the stimulating electrode, but one can position the stimulating electrode far enough down the white matter track (away from the recorded UBC), such that mossy fiber inputs to the recorded UBC can be stimulated without affecting Golgi cells near or synaptically connected to the recorded UBC. Moreover, if the argument were true, then presumably the stimulation protocol would be just as likely to directly stimulate neighboring UBCs, which then drove the recorded UBC's responses. Thus, it is both doable and should be ensured that stimulation of the white matter is distant enough to not be directly activating relevant, connected neurons within the granule cell layer.

      Finally, the authors present three examples of UBC recordings with and without inhibitory inputs blocked, and state "Thus, these large conductances are unlikely to be significantly shaped by 1-10 ms IPSCs from feedforward and feedback GABA/glycine inhibition" and "GABA/glycinergic inhibition...has little to no effect on the slow inward current that develops after the end of stimulation". This response reflects on original concerns about lack of quantification or consideration of important parameters. In particular, while the traces with and without inhibition are qualitatively similar, quantitative considerations indicate otherwise. First, unquantified examples are not adequate to drive conclusions. Regardless, the main issue (how inhibition affects actual responses in situ) is actually highlighted by the authors current clamp recordings of UBC responses, before and after blocking inhibition. The output response is dramatically different, both at early and late time points, when inhibition is blocked. Again, a lack of quantification (of adequate n's) makes it hard to know exactly how important, but quick "eye ball" estimates of impact include: 1) a switch from only low frequency APs initially (without inhibition blocked) to immediate burst of high frequency APs (high enough to not discern individual APs with given figure resolution) when inhibition is blocked, 2) Slow rising to a peak EPSP, followed by symmetrical return to baseline (without inhibition blocked) versus immediate rise to peak, followed by prolonged decay to baseline (with inhibition blocked), 3) substantially shorter duration (~34% shorter) secondary high frequency burst (individual APs not discernible) of APs (with inhibition blocked versus without inhibition blocked), and 4) substantial reduction in number of long delayed APs (with inhibition blocked versus without inhibition blocked). Thus, clearly, feedback/lateral inhibition is actually sculpting AP output at all phases of the UBC response to trains of afferent stimulations. Importantly, the single voltage clamp trace showing little impact of transient IPSCs on the slow EPSC do not take into account likely IPSC influences on voltage-activated conductances that would not occur in voltage-clamp recordings but would be free to manifest in current clamp, and thereby influence AP output, as observed.

      So again, our ability to understand how interconnected UBCs behave in the intact system is undermined by the lack of consideration and quantification of the impact of inhibition, and it not being incorporated into the model. At the very least a strong proviso about lack of inclusion of such information, given the authors' data showing its importance in the few examples shown, should be added to the discussion.

      Thank you for this substantive explanation. Your points are well described and we agree that the single experiment shown is not strong evidence for a lack of importance of Golgi cell inhibition, especially on the temporal dynamics of spiking. Previous work has clearly shown that Golgi cells have several important roles in shaping the activity of the granular layer, including affecting the temporal dynamics of granule cell spikes. However, the work presented here focuses on the feedforward circuitry of UBCs and the large inward and large outward glutamatergic currents that drive spiking or pausing for 100s of milliseconds. Our model does not focus on the aspects that are most sensitive to Golgi cell inhibition, including timing of the first spikes in the UBC’s response. Nor does our model focus on short term plasticity, which we thought was reasonable because the slow currents in UBCs are quite insensitive to the temporal characteristics of glutamate release (See the example in the previous rebuttal). Our model does not include long term plasticity, which is also affected by Golgi cells. For these reasons we agree that the model presented does not explain how feedforward UBC circuits might “play out in the actual brain circuit and how it would impact the processing of feedback or feedforward activity that would be required to promote behavior.” We have included a new paragraph in the discussion clarifying the limitations of this study and the model, reproduced below.

      "Limitations of the model

      Here we addressed how feedforward glutamatergic excitation and inhibition is transformed from one UBC to the next depending on their subtype. The model focuses on AMPA receptor mediated excitation and mGluR2 mediated inhibition. One limitation of the model is that it does not consider feedforward and lateral inhibition from Golgi cells, which shape the spiking of UBCs in response to afferent stimulation. Golgi cells receive mossy fiber input and inhibit UBCs through their corelease of GABA and glycine (Dugue et al., 2005; Rousseau et al., 2012). Golgi cells control the temporal dynamics of the firing of granule cells as well as their gain (Rossi et al., 2003; Kanichay and Silver, 2008) and are critical to larger scale dynamics of the cerebellar cortical network (D‘Angelo, 2008). Purkinje cells provide additional inhibition to ON UBCs that could influence how UBC circuits transform signals (Guo et al., 2016). A more complex model that implements Golgi cells and other critical circuit elements will be needed to investigate the role of feedforward UBC circuits in cerebellar network dynamics and motor behaviors in vivo."

      1. No consideration for involvement of polysynaptic UBCs driving UBC responses to MF stimulation in electrophysiology experiments. Given the established existence (in this manuscript and Dino et al. 2000 Neurosci, Dino et al. 2000 ProgBrainRes, Nunzi and Mugnaini 2000 JCompNeurol, Nunzi et al. 2001 JCompNeurol) of polysynaptic connections from MFs to UBCs to UBCs, the MF evoked UBC responses established in this manuscript, especially responses to trains of stimuli could be mediated by direct MF inputs, or to polysynaptic UBC inputs, or possibly both (to my awareness not established either way). Thus the response durations could already include extension of duration by polysynaptic inputs, and so would overestimate the duration of monosynaptic inputs, and thus polysynaptic amplification/modulation, observed in the NEURON model.

      We are confident that the synaptic responses shown are monosynaptic for several reasons. UBCs receive a single mossy fiber input on their dendritic brush, and thus if our stimulation produces a reliable, short-latency response consistent with a monosynaptic input, then there is not likely to be a disynaptic input, because the main input is accounted for by the monosynaptic response. In all cells included in our data set, the fast AMPA receptor-mediated currents always occurred with short latency (1.24 ± 0.29 ms; mean ± SD; n = 13), high reliability (no failures to produce an EPSC in any of the 13 GRP UBCs in this data set), and low jitter (SD of latency; 0.074 ± 0.046 ms; mean ± SD; n = 13). These measurements have been added to the results section.

      In some rare cases, we did observe disynaptic currents, which were easily distinguishable because a single electrical stimulation produced a burst of EPSCs at variable latencies. Please see example below. These cases of disynaptic input, which have been reported by others (Diño et al., 2000; Nunzi and Mugnaini, 2000; van Dorp and De Zeeuw, 2015) support the conclusion that UBCs receive input from other UBCs.

      Author response image 2.

      Example of GRP UBC with disynaptic input. Three examples of the effect of a single presynaptic stimulus (triangle) in a GRP UBC with presumed disynaptic input. Note the variable latency of the first evoked EPSC, bursts of EPSCs, and spontaneous EPSCs.

      Author response: "UBCs receive a single mossy fiber input on their dendritic brush, and thus if our stimulation produces a reliable, short-latency response consistent with a monosynaptic input, then there is not likely to be a disynaptic input."

      This statement is not congruent with the literature, with early work by Mugnaini and colleagues (Mugnaini et al. 1994 Synapse; Mugnaini and Flores 1994 J. Comp. Neurol.) indicating that UBCs are innervated by 1-2 mossy fibers, which are as likely other UBC terminals as MFs. This leaves open the possibility that so called monosynaptic responses do, as originally suggested, already include polysynaptic feedforward amplification of duration. While the authors also indicate that isolated disynaptic currents can be observed when they occur in isolation, a careful examination and objective documentation of "monosynaptic" responses would address this issue. Presumably, if potential disynaptic UBC inputs occur during a monosynaptic MF response, it would be detected as an abrupt biphasic inward/outward current, due to additional AMPA receptor activation but further desensitization of those already active (as observed by Kinney et al. 1997 J. Neurophysiol: "The delivery of a second MF stimulus at the peak of the slow EPSC evoked a fast EPSC of reduced amplitude followed by an undershoot of the subsequent slow current"). If such polysynaptic inputs are truly absent and are "rare" in isolation, some estimation of how common or not such synaptic amplification is, would improve our understanding of the overall significance of these inputs.

      We are confident that these currents are monosynaptic, because, as suggested, we carefully analyzed the latency, jitter and reliability, which was added to the previous revision. The latency and jitter are strong (quantitative) evidence that the first EPSC evoked was monosynaptic. While some UBCs have been reported to have multiple brushes, or brushes that branch and may contact multiple mossy fibers, or receive synaptic input onto their somas, these cases are rare in our experience in this age of mouse and there is no evidence for them in this dataset. For every trace we made a careful examination and documented that no delayed EPSCs were present. The presence of delayed EPSCs (or ‘abrupt biphasic inward/outward currents’ as described in Kinney et al 1997) would indeed suggest the presence of disynaptic activity or multiple inputs to the UBC, but these would be easily identified, even during a stimulation train. For these reasons we feel that we have established that polysynaptic feedforward amplification of duration is not present

      We agree that the monosynaptic responses could be due to the stimulation of UBC axons. However, the absence of delayed EPSCs again suggests that if stimulation of a presynaptic UBC axon was producing the currents in the recorded UBC, then the axon was severed from the soma and AIS, because this region is necessary for the cell to produce more than a single spike per stimulation. We added a sentence describing the potential for the monosynaptic EPSCs to be due to the stimulation of presynaptic UBC axons.

      Your point is well taken that a discussion of how common or rare these UBC to UBC connections is necessary to more clearly explain how we interpret their significance and we have expanded the paragraph in the discussion that does so. Thank you for this suggestion.

      1. Lack of quantification of subtypes of UBC interconnectivity. Given that it is already established that UBCs synapse onto other UBCs (see refs above), the main potential advance of this manuscript in terms of connectivity is the establishment and quantification of ON-ON, ON-OFF, OFF-ON, and OFF-OFF subtypes of UBC interconnections. But, the authors only establish that each type exists, showing specific examples, but no quantification of the absolute or relative density was provided, and the authors' unquantified wording explicitly or implicitly states that they are not common. This lack of quantification and likely small number makes it difficult to know how important or what impact such synapses have on cerebellar processing, in the model and in situ.

      As noted by the reviewer, the connections between UBCs were rare to observe. We decided against attempting to quantify the absolute or relative density of connections for several reasons. A major reason for rare observations of anatomical connections between UBCs is likely due to the sparse labeling. First, the GRP mouse line only labels 20% of ON UBCs and we are unable to test whether postsynaptic connectivity of GRP ON UBCs is the same as that of the rest of the population of ON UBCs that are not labeled in the GRP mouse line. Second, the Brainbow reporter mouse only labels a small population of Cre expressing cells for unknown reasons. Third, the Brainbow reporter expression was so low that antibody amplification was necessary, which then limited the labeled cells to those close to the surface of the brain slices, because of known antibody penetration difficulties. Therefore, we refrained from estimating the density of these connections, because each of these variables reduced the labeling to unknown degrees and we reasoned that extrapolating our rare observations to the total population would be inaccurate.

      A paper that investigated UBC connectivity using organotypic slice cultures from P8 mice suggests that 2/3 of the UBC population receives UBC input, based on the observation that 2/3 of the mossy fibers did not degenerate as would be expected after 2 days in vitro if they were severed from a distant cell body (Nunzi and Mugnaini, 2000). It remains to be seen if this high proportion is due to the young age of these mice or is also the case in adult mice. Even if these connections are indeed rare, they are expected to have profound effects on the circuit, as each UBC has multiple mossy fiber terminals (Berthie and Axelrad, 1994), and mossy fiber terminals are estimated to contact 40 granule cells each (Jakab and Hamori, 1988). We have added a comment regarding this point to the discussion.

      To address this issue, the authors added the following text to the discussion section: "We did not estimate the density of these UBC to UBC connections, because the sparseness of labeling using these approaches made an accurate calculation impossible. Previous work using organotypic slice cultures from P8 mice estimated that 2/3 of the UBC population receives input from other UBCs (Nunzi & Mugnaini, 2000), although it is unclear whether this is the case in older mice."

      While accurate, the addition doesn't really address the situation, which is that apparently the reported connections are rare. Adding the information about 2/3 of UBCs having UBC inputs in culture, implies the opposite might be true (i.e. that they might be quite common), which is in contrast to the authors' data, so should be reworded for clarity, which should also incorporate the considerations covered in point #2 above. I.e. if the authors do establish that none of their recordings have polysynaptic inputs, and if they determine that the number of cells that showed isolated di-synaptic inputs is indeed rare, then it suggests that these specific polysynaptic connections are in fact rare.

      Thank you for pointing this out. We agree that adding this information is somewhat contradictory to our results and we have added more to this section in the discussion, provided below.

      Anatomically identifiable connections between UBCs were not present in all brain slices and finding them required a careful search. UBC labeling was sparse due to the highly specific genetic labeling techniques and further sparsification by the Brainbow reporter, which made it impossible to estimate the density of these UBC to UBC connections. Electrophysiological evidences suggest that UBC to UBC connections are not common, because spontaneous EPSCs that would indicate a spontaneously firing presynaptic UBC are only rarely observed in UBCs recorded in acute brain slices. In an analysis of feedforward excitation of granule layer neurons, only 4 out of 140 UBCs had this indirect evidence of a firing presynaptic UBC (van Dorp and De Zeeuw, 2015), which suggests that UBC to UBC connections may be rare. On the other hand, previous work using organotypic slice cultures from P8 mice estimated that 2/3 of the UBC population receives input from other UBCs (Nunzi & Mugnaini, 2000). This suggests a much higher density of UBC to UBC connections, but could be due to the young age of the brains used, which is before UBCs have matured (Morin et al., 2001), and also due to increased collateral sprouting that can occur in culture (Jaeger et al., 1988). Another study imaged 2-4 week old rat cerebellar slices at an electron microscopic level and found that 4 out of 14 UBC axon terminals contacted UBC brushes (Diño et al., 2000). Future work is necessary to accurately estimate the density and impact of these feedforward UBC circuits.

      1. Lack of critical parameters in NEURON model.

      A) The model uses # of molecules of glutamate released as the presumed quantal content, and this factor is constant.

      However, no consideration of changes in # of vesicles released from single versus trains of APs from MFs or UBCs is included. At most simple synapses, two sequential APs alters release probability, either up or down, and release probability changes dynamically with trains of APs. It is therefore reasonable to imagine UBC axon release probability is at least as complicated, and given the large surface area of contact between two UBCs, the number of vesicles released for any given AP is also likely more complex.

      B) the model does not include desensitization of AMPA receptors, which in the case of UBCs can paradoxically reduce response magnitude as vesicle release and consequent glutamate concentration in the cleft increases (Linney et al. 1997 JNeurophysiol, Lu et al. 2017 Neuron, Balmer et al. 2021 eLIFE), as would occur with trains of stimuli at MF to ON-UBCs.

      A) The model produces synaptic AMPA and mGluR2 currents that reproduce those we recorded in vitro. We did not find it necessary to implement changes in glutamate release during a train as the model was fit to UBC data with the assumption that the glutamate transient did not change during the train. If there is a change in neurotransmitter release during a train, it is therefore built into the model, which has the advantage of reducing its complexity. UBCs are a special case where the postsynaptic currents are mediated mostly by the total amount of transmitter released. Most of the evoked current occurs tens to hundreds of milliseconds after neurotransmitter release and is therefore much more sensitive to total release and less sensitive to how it is released during the train. The figure below shows the effect of reducing the amount of glutamate released by 10% on each stimulus in the model. Despite a significant change in the pattern of neurotransmitter release, as well as a reduction in the total amount of glutamate, the slow EPSC still decays over the course of hundreds of milliseconds.

      B) The detailed kinetic AMPA receptor model used here accurately reproduces desensitization, which in fact mediates that the slow ON UBC current. This AMPA receptor is a 13-state model, including 4 open states with 1-4 glutamates bound, 4 closed states with 1-4 glutamates bound, 4 desensitized states with 1-4 glutamates bound, and 5 closed states with 0-4 glutamates bound. The forward and reverse rates between different states in the model were fit to AMPA receptor currents recorded from dissociated UBCs and they accurately reproduced the ON UBC currents evoked by synaptic stimulation in our previous work (Balmer et al., 2021).

      Author response image 3.

      Effect of short-term depression of neurotransmitter release. A) The top trace shows the glutamate transient that drives the AMPA receptor model used in our study. No change in release is implemented, although the slow tail of the transient summates during the train. The bottom trace shows the modeled AMPA receptor mediated current. B) In this model the amount of glutamate released on each stimulus is reduced by 10%. The duration of the slow AMPA current is similar, despite a profound change in the pattern of neurotransmitter exposure.

      While the authors have not added the suggested additional parameters, their clarifications regarding the implications of existing parameters, and demonstration of reasonable fits to experimental data, and lack of substantial effect of simulating reduced vesicle release probability,

      1. Lack of quantification of various electrophysiological responses. UBCs are defined (ON or OFF) based on inward or outward synaptic response, but no information is provided about the range of the key parameter of duration across cells, which seems most critical to the current considerations. There is a similar lack of quantification across cells of AP duration in response to stimulation or current injections, or during baseline. The latter lack is particularly problematic because in agreement with previous publications, the raw data in Fig. 1 shows ON UBCs as quiescent until MF stimulation and OFF UBCs firing spontaneously until MF stimulation, but, for example, at least one ON UBC in the NEURON model is firing spontaneously until synaptically activated by an OFF UBC (Fig. 11A), and an OFF UBC is silent until stimulated by a presynaptic OFF UBC (Fig. 11C). This may be expected/explainable theoretically, but then such cells should be observed in the raw data.

      To address this reasonable concern of a general lack of quantification of electrophysiological responses we have added data characterizing the slow inward and outward currents evoked by synaptic stimulation in GRP and P079 UBCs in the results section and in new panels in Figure 1. We report the action potential pause lengths in P079 UBCs and burst lengths in ON UBCs in the results section. However, we favor the duration of the currents to the length of burst and pause, because the currents do not depend on a stable resting membrane potential, which is itself difficult to determine in intracellular recordings of these small cells. In a series of recent publications that focused on UBC firing, the authors argue that cell-attached recordings are necessary to determine accurately the burst and pause lengths, as well as spontaneous firing rates (Guo et al., 2021; Huson et al., 2023). (The trade-off of these extracellular recordings is that the monosynaptic nature of the input is nearly impossible to confirm.) Spontaneous firing rates were variable within both GRP and P079 UBCs from silent to firing regularly or in bursts, as previously reported (Kim et al., 2012; van Dorp and De Zeeuw, 2015). For clarity, we chose to model the GRP UBCs as silent unless receiving synaptic input and P079 UBCs as active unless receiving synaptic input. As the reviewer suggests, we have observed UBCs firing in the patterns similar to those shown in the model UBCs having input from spontaneous presynaptic UBCs. Below are some examples of spontaneous EPSCs and IPSCs in UBCs that suggest the presence of a presynaptic UBC.

      Author response image 4.

      Examples of UBCs that receive spontaneous input. A) Three ON UBCs that had spontaneous EPSCs, suggesting the presence of an active presynaptic UBC. B) Two OFF UBCs that had spontaneous outward currents.

      The authors have added additional analysis and discussion, which adequately addresses this concern.

      Reviewer #2 (Public Review):

      In this paper, the authors presented a compelling rationale for investigating the role of UBCs in prolonging and diversifying signals. Based on the two types of UBCs known as ON and OFF UBC subtypes, they have highlighted the existing gaps in understanding UBCs connectivity and the need to investigate whether UBCs target UBCs of the same subtype, different subtypes, or both. The importance of this knowledge is for understanding how sensory signals are extended and diversified in the granule cell layer.

      The authors designed very interesting approaches to study UBCs connectivity by utilizing transgenic mice expressing GFP and RFP in UBCs, Brainbow approach, immunohistochemical and electrophysiological analysis, and computational models to understand how the feed-forward circuits of interconnected UBCs transform their inputs.

      This study provided evidence for the existence of distinct ON and OFF UBC subtypes based on their electrophysiological properties, anatomical characteristics, and expression patterns of mGluR1 and calretinin in the cerebellum. The findings support the classification of GRP UBCs as ON UBCs and P079 UBCs as OFF UBCs and suggest the presence of synaptic connections between the ON and OFF UBC subtypes. In addition, they found that GRP and P079 UBCs form parallel and convergent pathways and have different membrane capacitance and excitability. Furthermore, they showed that UBCs of the same subtype provide input to one another and modify the input to granule cells, which could provide a circuit mechanism to diversify and extend the pattern of spiking produced by mossy fiber input. Accordingly, they suggested that these transformations could provide a circuit mechanism for maintaining a sensory representation of movement for seconds.

      Overall, the article is well written in a sound detailed format, very interesting with excellent discovery and suggested model.

      I believe the authors have provided appropriate responses and have consequently revised the manuscript in a convincing manner. Although I am not an expert in physiology, I find the explanations and clarifications to be acceptable.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Recommendations for the authors):

      Major comments

      (1) Line 201: The threshold of 0.25 was maintained to select enriched genes, which minimize the value of the GO term enrichment analyses. It may notably explain why the term phagosome is enriched in cluster 7, while experimental data indicate that cluster 7 is not phagocytic. In addition, the authors mentioned in the 1st response to reviewer that they would include DotPlot to illustrate the specificity of the genes corresponding to the main GO terms. This should notably include the ribosomal genes found enriched in cluster 4, which constitute the basis used by the authors to call cluster 4 the progenitor cluster.

      We appreciate the reviewer’s concern regarding our chosen log2FC threshold (0.25) for GO term enrichment. To assess the robustness of our approach, we tested more stringent thresholds (e.g., 0.5) and verified that our overall interpretations remain consistent. However, we acknowledge that certain GO terms, such as phagosome, may appear in clusters that are not primarily phagocytic. This is likely due to the fact that genes involved in vesicle trafficking, endo-lysosomal compartments and intracellular degradation processes overlap with those classically associated with phagocytosis.

      Therefore, the KEGG-based enrichment of phagosome in cluster 7 does not necessarily imply active phagocytosis but could instead reflect these alternative vesicular processes. As we show, cluster 7 correspond to vesicular cells, and as seen in cytology we named these cells after their very high content of vesicular structures. As functional annotation based solely on transcriptomic data can sometimes lead to overinterpretations, we emphasize the importance of biological validation, which we have partially addressed through functional assays in this study.

      Regarding the specificity of ribosomal gene expression in cluster 4, we analyzed the distribution of ribosomal genes expressed across all clusters, as shown in Supplementary Figure S1-J. This analysis demonstrates that cluster 4 is specifically enriched in ribosome-related genes, reinforcing its characterization as a transcriptionally active population. Given that ribosomal gene expression is a key feature often associated with proliferative or metabolically active cells, these findings support our initial interpretation that cluster 4 may represent an undifferentiated or progenitor-like population.

      We acknowledge the reviewer’s suggestion to include a DotPlot to further illustrate the specificity of these genes in cluster 4. However, we believe that Supplementary Figure S1-J already effectively demonstrates this enrichment by presenting the percentage of ribosomal genes per cluster. A DotPlot representation would primarily convey the same information in a different format, but without providing additional insight into the specificity of ribosomal gene expression within cluster 4.

      (2) The lineage analysis is highly speculative and based on weak evidences. Initiating the hemocyte lineage to C4 is based on rRNA expression levels. C6 would constitute a better candidate, notably with the expression of PU-1, ELF2 and GATA3 that regulate progenitors differentiation in mammals (doi: 10.3389/fimmu.2019.00228, doi:10.1128/microbiolspec.mchd-0024-2, doi: 10.1098/rsob.180152) while C4 do not display any specific transcription factors (Figure 7I). In addition, the representation and interpretation of the transcriptome dynamics in the different lineages are erroneous. There are major inconsistencies between the data shown in the heatmaps Fig7C-H, Fig S10 and the dotplot in Fig7I. For example, Gata3 (G31054) and CgTFEB (G30997) illustrate the inconsistency. Fig S10C show GATA3 going down from cluster 4 to cluster 6 while Fig 7I show an increase level of expression in 6 compared to 4. CgTFEB (G30997) decrease from C4 to VC in Fig 7F while it increases according to Fig 7I. At last, Figure 7D: the umap show transition from C4 to C5 while the heatmap mention C4 to C6 (I believe there is a mix up with Figure 7E.

      We sincerely apologize for the inconsistencies noted between the different panels of Figure 7. These discrepancies resulted from using an incorrect matrix dataset during the initial representation. To address this issue, we have fully reprocessed the data and now provide a corrected and improved depiction of gene expression dynamics along the pseudotime trajectory. We are grateful to the reviewer for having help us to correct theses mistakes.

      In the revised version, we offer a comprehensive and consistent representation of expression level variations for key genes identified by the Monocle3 algorithm. Supplementary Figure S10 now presents the average expression variation of these significant genes as a function of pseudotime. Based on this dataset, we carefully selected representative genes to construct panels C to H of Figure 7, ensuring coherence across all figures. These updated panels show both average expression levels and the percentage of expressing cells along the pseudotime trajectory, providing a clearer interpretation of transcriptomic dynamics.

      We appreciate the reviewer’s helpful feedback regarding our lineage analysis and the suggestion that cluster 6 might be a more appropriate progenitor based on the expression of mammalian-like transcription factors such as PU-1, ELF2, and GATA3. Below, we clarify our rationale for choosing cluster 4 as the root of the pseudotime and discuss the functional implications of the identified transcription factors.

      We can hypothesize that clusters 4, 5, or 6 could each potentially represent early progenitor-like states, as these three clusters are transcriptionally close (Lines 539-541). These clusters have not yet been conclusively identified in terms of classical hemocyte morphology, and they appear to arise from ABL- or BBL-type cells. Our decision to root the pseudotime at cluster 4 was motivated by its strong expression of core transcription and translation genes, suggesting a particular stage of translation activity that was not observed for cluster 5 or cluster 6. Cluster 5 and 6 may correspond to a similar population of cells, most probably Blast-Like cells at different stages of cell cycle or differentiation engagement.

      Although cluster 6 expresses PU-1, ELF2, and GATA3, which are known regulators of haematopoietic progenitor differentiation in vertebrates, it is essential to highlight that structural homology does not necessarily imply functional equivalence. Moreover, the expression of PU-1, ELF2, and GATA3 does not strictly characterize a population as “undifferentiated” or progenitor-like. Studies such as those by Buenrostro et al. (Cell, 2018) have demonstrated that these transcription factors can remain active in or reemerge during more lineage-committed stages. For instance, PU-1 is essential for myeloid and B-cell differentiation, GATA3 is involved in T-lymphocyte lineage commitment (though transiently expressed in early progenitors), and ELF2 participates in lineage-specific pathways. Thus, their presence does not imply a primitive state but rather highlights their broader functional roles in guiding and refining lineage decisions. Functional annotation of these transcription factors in invertebrate systems remains speculative, particularly as morphological or molecular markers specific to these early hemocyte lineages are not yet fully established. Further functional assays (e.g., knockdown/overexpression or lineage tracing using cells (ABL and BBL) from clusters 4, 5 and 6) will be necessary to determine which hemocyte population harbor progenitor properties and differentiation potential.

      To further address the reviewer’s concern, we performed complementary pseudotime analyses by initiating Monocle 3 trajectories from clusters 4, 5, and 6 individually, as well as collectively (4/5/6). These analyses (see attached figure) confirm that the overall differentiation topology remains unchanged regardless of the selected root, consistently revealing two main pathways: one leading to hyalinocytes and the other to the granular lineage (ML, SGC, and VC). This consistency strongly suggests that clusters 4, 5, and 6 represent related pools of progenitor-like cells. Therefore, choosing cluster 4 based on its transcription/translation readiness does not alter the inferred branching architecture of hemocyte differentiation.

      We appreciate the reviewer’s suggestions, which have helped us improve our manuscript and clarify our rationale.

      Author response image 1.

      Representation of the trajectories obtained from Monocle3 analysis using different pseudotime origins, showing that changing the rooting did not alter the overall differentiation topology. (A) Pathways identified with cluster 4, (B) cluster 5, (C) cluster 6, and (D) cluster 4/5/6 origins.

      (3) Concerning the AMP expression analysis in Figure 6: the qPCR data show that Cg-BPI and Cg-Defh are expressed broadly in all fractions including 6 and 7, which is in conflict with the statement Line 473 indicating that SGC (fractions 6 and 7) is not expressing AMP. In addition, this analysis should be combined with the expression profile of all AMP in the scRNAseq data (list available in 10.1016/j.fsi.2015.02.040).

      We thank the reviewer for highlighting this point. We acknowledge that the qPCR data show expression of Cg-BPI and Cg-Defh across all fractions, including fractions 6 and 7 corresponding to SGC. However, our conclusion that SGCs do not express antimicrobial peptides (AMPs) was based on a correlation analysis rather than direct detection of AMPs in granular cells. Specifically, the qPCR experiments were designed to measure AMP expression levels in fractionated hemocyte populations relative to a control sample of whole hemolymph. We then performed a correlation analysis between AMP expression levels and the proportion of each hemocyte type in the fractions. This approach allowed us to infer a lower expression of AMP in granular cells, as reflected in the heatmap presented in Figure 6.

      Regarding the suggestion to integrate AMP expression profiles from scRNA-seq data, we wrote that the limited sequencing depth of our scRNA-seq analysis was insufficient to accurately detect AMP expression (Ligne 472-473 → “However, due to the limited sequencing depth, the scRNA-seq analysis was not sensitive enough to reveal AMP expression.”.  Additionally, many of the known AMPs of Crassostrea gigas are not annotated in the genome, further complicating their identification within the scRNA-seq dataset. As a result, we were unable to perform the requested integration of AMP expression profiles from scRNA-seq data.

      (4) The transcription factor expression analysis is descriptive and the interpretation too partial. These data should be compared with other systems. Most transcription factors show functional conservation, notably in the inflammatory pathways, which can provide valuable information to understand the function of the clusters 5 and 6 for which limited data are available.

      We appreciate the reviewer’s suggestion to compare the identified transcription factors with other systems. However, since we did not perform a detailed phylogenetic analysis of the transcription factors identified in our dataset, we refrain from making assumptions about their functional conservation across species. Our analysis aims to provide a descriptive overview of transcription factor expression patterns in hemocyte clusters, which serves as a foundation for future functional studies. While transcription factor profiles may provide insights into the potential roles of clusters 5 and 6, assigning precise functions based solely on bioinformatic predictions remains speculative. Further experimental validation, including functional assays and evolutionary analyses, would be necessary to confirm the roles of these transcription factors, which is beyond the scope of the present study.

      Minor comments

      Line 212-213: the text should be reformulated. In the result part, it is more important to mention that the reannotation is based on conserved proteins functions than to mention the tool Orson.

      We have reworded this section to emphasize that the updated annotation is function-based, using Orson primarily as the bioinformatics tool for improved GO annotation. We now place the emphasis on the conserved protein functions underlying the reannotation. Lines 212-215 : “Using the Orson pipeline (see Materials and Methods), these files were used to extract and process the longest CDSs for GO-term annotation, and we then reannotated each predicted protein by sequence homology, assigning putative functions and improving downstream GO-term analyses.”

      Figure 2: I would recommend homogenizing the two Dotplot representation with the same color gradient and representing the gene numbers in both case.

      We appreciate the reviewer’s suggestion to improve the clarity and consistency of Figure 2. In response, we have homogenized the color gradients across the two DotPlot representations and have included gene numbers in both cases to ensure a more uniform and informative visualization.

      Table 2: pct1 and pct2 should be presented individually like in table 1

      We now present these columns separately (pct1, pct2) as in Table 1, so readers can compare the fraction of expressing cells in each cluster more transparently.

      Line 403-414: how many cells were quantified for the phagocytic experiments ?

      We have added the exact number of cells that were counted to determine phagocytic indices and the number of technical/biological replicates. Line 411, the text was modified : “Macrophage-like cells and small granule cells showed a phagocytic activity of 49 % and 55 %, respectively, and a phagocytosis index of 3.5 and 5.2 particles per cell respectively (Fig. 5B and Supp. Fig. 7B), as confirmed in 3 independent experiments examining a total of 2,807 cells.”

      Line 458: for copper staining, how many cells and how many replicates were done for the quantification ?

      We have specified the number of hemocytes and number of independent replicates used when quantifying rhodanine-stained (copper-accumulating) cells. Line 458 the following text was added : “and a total of 1,562 cells were examined across three independent experiments.”

      Line 461: what are the authors referring to when mentioning the link between copper homeostasis and scRNAseq?

      Single-cell RNA sequencing (scRNA-seq) analysis revealed an upregulation of several copper transport– related genes, including G4790 (a copper transporter) with a 2.7 log2FC and a pct ratio of 42, as well as the divalent cation transporters G5864 (zinc transporter ZIP10) and G4920 (zinc transporter 8), specifically in cluster 3 cells identified as small granule cells. These findings reinforce a potential role for this cluster in metal homeostasis.

      We modified lines 462-467 as : “ These results provide functional evidence that small granule cells (SGCs) are specialized in metal homeostasis in addition to phagocytosis, as suggested by the scRNA-seq data identifying cluster 3. Specifically, single-cell RNA sequencing revealed an upregulation of copper transport– related genes, including G4790 (a copper transporter) with a 2.7 log2FC and a pct ratio of 42, reinforcing the role of SGCs in copper homeostasis (see Supp. File S1).”

      Line 611: it would be nice to display the enrichment of the phagocytic receptor in cluster 3 (dotplot or feature plot) to illustrate the comment.

      We appreciate the reviewer’s insightful suggestion regarding a more comprehensive analysis of phagocytic receptors. While a full inventory is beyond the scope of this study, we acknowledge the value of such an approach and hope that our findings will serve as a foundation for future investigations in this direction.

      Although we have highlighted certain phagocytic receptors (e.g., a scavenger receptor domain-containing gene) in our scRNA-seq dataset, it is beyond the scope of the current study to inventory all phagocytosisrelated receptors in the C. gigas genome, which itself would be a substantial undertaking. Moreover, singlecell RNA sequencing captures only about 15–20% of each cell’s mRNA, so we inherently lose a significant portion of the transcriptome, further limiting our ability to pinpoint all relevant phagocytic receptor genes. Adding more figures to cover every candidate receptor would risk overloading this paper, thus we focus on the most prominent examples. A promising approach for more exhaustive analysis would involve efficiently isolating granulocytes (e.g., via Percoll gradient) and performing targeted RNA-seq on this cell population to thoroughly explore genes involved in phagocytosis.

      Line 640-644: the authors mentioned that ML may be able to perform ETosis based on the oxidative burst.

      This hypothesis requires further evidences. Are other markers of ETosis expressed in this cell type?

      We agree that additional experimental evidence (e.g., detection of histone citrullination, extracellular DNA networks) is necessary to confirm ETosis in molluscan immune cells. We present ML-mediated ETosis only as a speculative possibility based on oxidative burst capacity as it was shown in different pieces of work that ETosis is inhibited by NADPH inhibitors (Poirier et al. 2014). Nevertheless, the expression of histones in the macrophage-like cluster (cluster 1) reinforces this possibility, as histone modifications play a key role in chromatin decondensation during ETosis.

      Reviewer #2 (Recommendations for the authors):

      Figure 1: In Figure 1B, the cell clusters are named 1 to 7, whereas in Figure 1C they are displayed as clusters 0 to 6. There is a mismatch between the identification of the clusters.

      We thank the reviewer for identifying this inconsistency. The cluster numbering has been corrected to ensure consistency between Figures 1B and 1C.

      Figure 2B: the font size could be increased for greater clarity.

      We thank the reviewer for this suggestion. The font size in Figure 2B has been increased to improve clarity and readability.

      Line 221: "Figures 2B, C and D" appears to refer to Figure S2 rather than the main Figure 2.

      The text has been corrected to properly reference the figure.

      Line 754: "Anopheles gambiae" should be italicised

      We thank the reviewer for pointing this out. "Anopheles gambiae" has been italicized accordingly.

      Bibliography

      Integrated Single-Cell Analysis Maps the Continuous Regulatory Landscape of Human Hematopoietic

      Differentiation. Buenrostro, Jason D. et al. Cell, Volume 173, Issue 6, 1535 - 1548.e16

      Antimicrobial Histones and DNA Traps in Invertebrate Immunity

      Poirier, Aurore C. et al. Journal of Biological Chemistry, Volume 289, Issue 36, 24821 - 24831

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Shen et al. conducted three experiments to study the cortical tracking of the natural rhythms involved in biological motion (BM), and whether these involve audiovisual integration (AVI). They presented participants with visual (dot) motion and/or the sound of a walking person. They found that EEG activity tracks the step rhythm, as well as the gait (2-step cycle) rhythm. The gait rhythm specifically is tracked superadditively (power for A+V condition is higher than the sum of the A-only and V-only condition, Experiments 1a/b), which is independent of the specific step frequency (Experiment 1b). Furthermore, audiovisual integration during tracking of gait was specific to BM, as it was absent (that is, the audiovisual congruency effect) when the walking dot motion was vertically inverted (Experiment 2). Finally, the study shows that an individual's autistic traits are negatively correlated with the BM-AVI congruency effect.

      Strengths:

      The three experiments are well designed and the various conditions are well controlled. The rationale of the study is clear, and the manuscript is pleasant to read. The analysis choices are easy to follow, and mostly appropriate.

      Weaknesses:

      There is a concern of double-dipping in one of the tests (Experiment 2, Figure 3: interaction of Upright/Inverted X Congruent/Incongruent). I raised this concern on the original submission, and it has not been resolved properly. The follow-up statistical test (after channel selection using the interaction contrast permutation test) still is geared towards that same contrast, even though the latter is now being tested differently. (Perhaps not explicitly testing the interaction, but in essence still testing the same.) A very simple solution would be to remove the post-hoc statistical tests and simply acknowledge that you're comparing simple means, while the statistical assessment was already taken care of using the permutation test. (In other words: the data appear compelling because of the cluster test, but NOT because of the subsequent t-tests.)

      We are sorry that we did not explain this issue clearly before, which might have caused some misunderstanding. When performing the cluster-based permutation test, we only tested whether the audiovisual congruency effect (congruent vs. incongruent) between the upright and inverted conditions was significantly different [i.e., (UprCon – UprInc) vs. (InvCon – InvInc)], without conducting extra statistical analyses on whether the congruency effect was significant in each orientation condition. Such an analysis yielded a cluster with a significant interaction between audiovisual integration and BM orientation for the cortical tracking effect at 1Hz (but not at 2Hz). However, this does not provide valid information about whether the audiovisual congruency effect at this cluster is significant in each orientation condition, given that a significant interaction effect may result from various patterns of data across conditions: such as significant congruency effects in both orientation conditions (Author response image 1a), a significant congruency effect in the upright condition and a non-significant effect in the inverted condition (Author response image 1b), or even non-significant yet opposite effects in the two conditions (Author response image 1c). Here, our results conform to the second pattern, indicating that cortical tracking of the high-order gait cycles involves a domain-specific process exclusively engaged in the AVI of BM. In a similar vein, the non-significant interaction found at 2Hz does not necessarily indicate that the congruency effect is non-significant in each orientation condition (Author response image 1f&e). Indeed, the congruency effect was significant in both the upright and inverted conditions at 2Hz in our study despite the non-significant interaction, suggesting that neural tracking of the lower-order step cycles is associated with a domain-general AVI process mostly driven by temporal correspondence in physical stimuli.

      Therefore, we need to perform subsequent t-tests to examine the significance of the simple effects in the two orientation conditions, which do not duplicate the clusterbased permutation test (for interaction only) and cause no double-dipping. Results from interaction and simple effects, put together, provide solid evidence that the cortical tracking of higher-order and lower-order rhythms involves BM-specific and domaingeneral audiovisual processing, respectively.

      To avoid ambiguity, we have removed the sentence “We calculated the audiovisual congruency effect for the upright and the inverted conditions” (line 194, which referred to the calculation of the indices rather than any statistical tests) from the manuscript. We have also clarified the meanings of the findings based on the interaction and simple effects together at the two temporal scales, respectively (Lines 205-207; Lines 213-215).

      Author response image 1.

      Examples of different patterns of data yielding a significant or nonsignificant interaction effect.

      Reviewer #2 (Public review):

      Summary:

      The authors evaluate spectral changes in electroencephalography (EEG) data as a function of the congruency of audio and visual information associated with biological motion (BM) or non-biological motion. The results show supra-additive power gains in the neural response to gait dynamics, with trials in which audio and visual information was presented simultaneously producing higher average amplitude than the combined average power for auditory and visual conditions alone. Further analyses suggest that such supra-additivity is specific to BM and emerges from temporoparietal areas. The authors also find that the BM-specific supra-additivity is negatively correlated with autism traits.

      Strengths:

      The manuscript is well-written, with a concise and clear writing style. The visual presentation is largely clear. The study involves multiple experiments with different participant groups. Each experiment involves specific considered changes to the experimental paradigm that both replicate the previous experiment's finding yet extend it in a relevant manner.

      Weaknesses:

      In the revised version of the paper, the manuscript better relays the results and anticipates analyses, and this version adequately resolves some concerns I had about analysis details. Still, it is my view that the findings of the study are basic neural correlate results that do not provide insights into neural mechanisms or the causal relevance of neural effects towards behavior and cognition. The presence of an inversion effect suggests that the supra-additivity is related to cognition, but that leaves open whether any detected neural pattern is actually consequential for multi-sensory integration (i.e., correlation is not causation). In other words, the fact that frequency-specific neural responses to the [audio & visual] condition are stronger than those to [audio] and [visual] combined does not mean this has implications for behavioral performance. While the correlation to autism traits could suggest some relation to behavior and is interesting in its own right, this correlation is a highly indirect way of assessing behavioral relevance. It would be helpful to test the relevance of supra-additive cortical tracking on a behavioral task directly related to the processing of biological motion to justify the claim that inputs are being integrated in the service of behavior. Under either framework, cortical tracking or entrainment, the causal relevance of neural findings toward cognition is lacking.

      Overall, I believe this study finds neural correlates of biological motion, and it is possible that such neural correlates relate to behaviorally relevant neural mechanisms, but based on the current task and associated analyses this has not been shown.

      Thank you for providing these thoughtful comments regarding the theoretical implications of our neural findings. Previous behavioral evidence highlights the specificity of the audiovisual integration (AVI) of biological motion (BM) and reveals the impairment of such ability in individuals with autism spectrum disorder. However, the neural implementation underlying the AVI of BM, its specificity, and its association with autistic traits remain largely unknown. The current study aimed to address these issues.

      It is noteworthy that the operation of multisensory integration does not always depend on specific tasks, as our brains tend to integrate signals from different sensory modalities even when there is no explicit task. Hence, many studies have investigated multisensory integration at the neural level without examining its correlation with behavioral performance. For example, the widely known super-additivity mode for multisensory integration proposed by Perrault and colleagues was based on single-cell recording findings without behavioral tasks (Perrault et al., 2003, 2005). As we mentioned in the manuscript, the super-additive and sub-additive modes indicate non-linear interaction processing, either with potentiated neural activation to facilitate the perception or detection of near-threshold signals (super-additive) or a deactivation mechanism to minimize the processing of redundant information cross-modally (subadditive) (Laurienti et al., 2005; Metzger et al., 2020; Stanford et al., 2005; Wright et al., 2003). Meanwhile, the additive integration mode represents a linear combination between two modalities. Distinguishing among these integration modes helps elucidate the neural mechanism underlying AVI in specific contexts, even though sometimes, the neural-level AVI effects do not directly correspond to a significant behavioral-level AVI effect (Ahmed et al., 2023; Metzger et al., 2020). In the current study, we unveiled the dissociation of multisensory integration modes between neural responses at two temporal scales (Exps. 1a & 1b), which may involve the cooperation of a domain-specific and a domain-general AVI processes (Exp. 2). While these findings were not expected to be captured by a single behavioral index, they revealed the multifaceted mechanism whereby hierarchical cortical activity supports audiovisual BM integration. They also advance our understanding of the emerging view that multi-timescale neural dynamics coordinate multisensory integration (Senkowski & Engel, 2024), especially from the perspective of natural stimuli processing.

      Meanwhile, our finding that the cortical tracking of higher-order rhythmic structure in audiovisual BM specifically correlated with individual autistic traits extends previous behavioral evidence that ASD children exhibited reduced orienting to audiovisual synchrony in BM (Falck-Ytter et al., 2018), offering new evidence that individual differences in audiovisual BM processing are present at the neural level and associated with autistic traits. This finding opens the possibility of utilizing the cortical tracking of BM as a potential neural maker to assist the diagnosis of autism spectrum disorder (see more details in our Discussion Lines 334-346).

      However, despite the main objective of the current study focusing on the neural processing of BM, we agree with the reviewer that it would be helpful to test the relevance of supra-additive cortical tracking on a behavioral task directly related to BM perception, for further justifying that inputs are being integrated in the service of behavior. In the current study, we adopted a color-change detection task entirely unrelated to audiovisual correspondence but only for maintaining participants’ attention. The advantage of this design is that it allows us to investigate whether and how the human brain integrates audiovisual BM information under task-irrelevant settings, as people in daily life can integrate such information even without a relevant task. However, this advantage is accompanied by a limitation: the task does not facilitate the direct examination of the correlation between neural responses and behavioral performance, since the task performance was generally high (mean accuracy >98% in all experiments). Future research could investigate this issue by introducing behavioral tasks more relevant to BM perception (e.g., Shen et al., 2023). They could also apply advanced neuromodulation techniques to elucidate the causal relevance of the cortical tracking effect to behavior (e.g., Ko sem et al., 2018, 2020).

      We have discussed the abovementioned points as a separate paragraph in the revised manuscript (Lines 322-333). In addition, since the scope of the current study does not involve a causal correlation with behavioral performance, we have removed or modified the descriptions related to "functional relevance" in the manuscript (Abstract; Introduction, lines 101-103; Results, lines 239; Discussion, line 336; Supplementary Information, line 794、803). Moreover, we have strengthened the descriptions of the theoretical implications of the current findings in the abstract.

      We hope these changes adequately address your concern.

      References

      Ahmed, F., Nidiffer, A. R., O’Sullivan, A. E., Zuk, N. J., & Lalor, E. C. (2023). The integration of continuous audio and visual speech in a cocktail-party environment depends on attention. Neuroimage, 274, 120143. https://doi.org/10.1016/j.neuroimage.2023.120143

      Falck-Ytter, T., Nystro m, P., Gredeba ck, G., Gliga, T., Bo lte, S., & the EASE team. (2018). Reduced orienting to audiovisual synchrony in infancy predicts autism diagnosis at 3 years of age. Journal of Child Psychology and Psychiatry, 59(8), 872–880. https://doi.org/10.1111/jcpp.12863

      Ko sem, A., Bosker, H., Jensen, O., Hagoort, P., & Riecke, L. (2020). Biasing the Perception of Spoken Words with Transcranial Alternating Current Stimulation. Journal of Cognitive Neuroscience, 32, 1–10. https://doi.org/10.1162/jocn_a_01579

      Ko sem, A., Bosker, H. R., Takashima, A., Meyer, A., Jensen, O., & Hagoort, P. (2018). Neural Entrainment Determines the Words We Hear. Current Biology, 28(18), 2867-2875.e3. https://doi.org/10.1016/j.cub.2018.07.023

      Laurienti, P. J., Perrault, T. J., Stanford, T. R., Wallace, M. T., & Stein, B. E. (2005). On the use of superadditivity as a metric for characterizing multisensory integration in functional neuroimaging studies. Experimental Brain Research, 166(3), 289–297. https://doi.org/10.1007/s00221-005-2370-2

      Metzger, B. A., Magnotti, J. F., Wang, Z., Nesbitt, E., Karas, P. J., Yoshor, D., & Beauchamp, M. S. (2020). Responses to Visual Speech in Human Posterior Superior Temporal Gyrus Examined with iEEG Deconvolution. The Journal of Neuroscience: The Official Journal of the Society for Neuroscience, 40(36), 6938–6948. https://doi.org/10.1523/JNEUROSCI.0279-20.2020

      Perrault, T. J., Vaughan, J. W., Stein, B. E., & Wallace, M. T. (2003). Neuron-Specific Response Characteristics Predict the Magnitude of Multisensory Integration. Journal of Neurophysiology, 90(6), 4022–4026. https://doi.org/10.1152/jn.00494.2003

      Perrault, T. J., Vaughan, J. W., Stein, B. E., & Wallace, M. T. (2005). Superior Colliculus Neurons Use Distinct Operational Modes in the Integration of Multisensory Stimuli. Journal of Neurophysiology, 93(5), 2575–2586. https://doi.org/10.1152/jn.00926.2004

      Senkowski, D., & Engel, A. K. (2024). Multi-timescale neural dynamics for multisensory integration. Nature Reviews Neuroscience, 25(9), 625–642. https://doi.org/10.1038/s41583-024-00845-7

      Shen, L., Lu, X., Wang, Y., & Jiang, Y. (2023). Audiovisual correspondence facilitates the visual search for biological motion. Psychonomic Bulletin & Review, 30(6), 2272–2281. https://doi.org/10.3758/s13423-023-02308-z

      Stanford, T. R., Quessy, S., & Stein, B. E. (2005). Evaluating the Operations Underlying Multisensory Integration in the Cat Superior Colliculus. Journal of Neuroscience, 25(28), 6499–6508. https://doi.org/10.1523/JNEUROSCI.5095-04.2005

      Wright, T. M., Pelphrey, K. A., Allison, T., McKeown, M. J., & McCarthy, G. (2003). Polysensory Interactions along Lateral Temporal Regions Evoked by Audiovisual Speech. Cerebral Cortex, 13(10), 1034–1043. https://doi.org/10.1093/cercor/13.10.1034

    1. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer #1 (Public review):

      Summary:

      In this work, Qiu and colleagues examined the effects of preovulatory (i.e., proestrous or late follicular phase) levels of circulating estradiol on multiple calcium and potassium channel conductances in arcuate nucleus kisspeptin neurons. Although these cells are strongly linked to a role as the "GnRH pulse generator," the goal here was to examine the physiological properties of these cells in a hormonal milieu mimicking late proestrus, the time of the preovulatory GnRH-LH surge. Computational modeling is used to manipulate multiple conductances simultaneously and support a role for certain calcium channels in facilitating a switch in firing mode from tonic to bursting. CRISPR knockdown of the TRPC5 channel reduced overall excitability, but this was only examined in cells from ovariectomized mice without estradiol treatment.

      Comments to address most recent author response:

      The concern regarding the CRISPR experiments being confined to OVX mice is that the results can only suggest that CRISPR-mediated knockdown of TRPC5 can, at best, phenocopy the OVX+E condition. A reciprocal experiment in the opposite direction (for example, that returning TRPC5 to OVX levels in OVX+E mice prevents the changes in firing activity and pattern typical of the OVX+E2 condition) would strengthen the indication that E2-sensitive changes in TRPC5 expression and function are critically important to surge function. Acknowledging this as a limitation of the studies would help to better contextualize the value of the CRISPR experiments to an understanding of surge mechanisms when done only in OVX conditions.

      We have noted in the manuscript that “It would be of interest in future experiments to do the reciprocal experiment to see if overexpressing Trpc5 channels in Kiss1ARH neurons from OVX + E2 females restores the RMP and  “rescues” the synchronization phenotype.”

      The nature of the confusion regarding the consideration of OVX+E2 conditions in the computational model primarily arises from the methods description in the supplemental file: "The effect of E2 on ionic currents is modelled as a change in the maximum conductance parameter. For currents IM,IT, ICa and ITRPC5 this change is inferred from the qPCR data assuming that the conductance is directly proportional to the mRNA expression." If these were instead based on the whole-cell recordings as the authors now indicate in their response, then this description needs to be edited and clarified accordingly. Furthermore, the section states, "For ISK, IBK, Ileak, the OVX and OVX+E2 conductances are obtained from current-voltage relationships recorded from Kiss1ARH neurons in the absence/presence of iberiotoxin (BK blocker) and apamin (SK blocker). All other currents were assumed to be unaffected by E2." This section thus does not directly indicate that the recordings in the stated figures were used in the model, and moreover suggests that currents besides ISK, IBK, and Ileak were not different in OVX+E2 conditions.

      The prior evidence stated for correlation of mRNA and channel conductance is not explicitly cited in the manuscript. It is well known that post-translational modifications, physiological modulation of individual channel biophysical properties, and many other factors can influence the end output of a membrane conductance. Therefore, the authors should, at minimum, provide a literature citation supporting the assumption used here.

      We have re-written the paragraph on “Modelling the effects of E2” in the Supplemental Information (now Appendix 1)  to clarify the that the modeling was based on a combination of electrophysiological recordings and the qPCR data presented in this and previous publications. The statement that “all other currents were assumed to be unaffected by E2” was a misstatement and has been deleted. As per the reviewer’s request, we have listed seven publications that document the correlation between the mRNA expression and channel conductance for the various channels. We thank the reviewer for the suggestion.

      Reviewer #2 (Public review):

      Summary:

      Kisspeptin neurons of the arcuate nucleus (ARC) are thought to be responsible for the pulsatile GnRH secretory pattern and to mediate feedback regulation of GnRH secretion by estradiol (E2). Evidence in the literature, including the work of the authors, indicates that ARC kisspeptin coordinate their activity through reciprocal synaptic interactions and the release of glutamate and of neuropeptide neurokinin B (NKB), which they co-express. The authors show here that E2 regulates the expression of genes encoding different voltage-dependent calcium channels, calcium-dependent potassium channels and canonical transient receptor potential (TRPC5) channels and of the corresponding ionic currents in ARC kisspeptin neurons. Using computer simulations of the electrical activity of ARC kisspeptin neurons, the authors also provide evidence of what these changes translate into in terms of these cells' firing patterns. The experiments reveal that E2 upregulates various voltage-gated calcium currents as well as 2 subtypes of calcium-dependent potassium currents, while decreasing TRPC5 expression (an ion channel downstream of NKB receptor activation), the slow excitatory synaptic potentials (slow EPSP) elicited in ARC kisspeptin neurons by NKB release and expression of the G protein-associated inward-rectifying potassium channel (GIRK). Based on these results, and on those of computer simulations, the authors propose that E2 promotes a functional transition of ARC kisspeptin neurons from neuropeptide-mediated sustained firing that supports coordinated activity for pulsatile GnRH secretion to a less intense burst-like firing pattern that could favor glutamate release from ARC kisspeptin. The authors suggest that the latter might be important for the generation of the preovulatory surge in females.

      Strengths:

      The authors combined multiple approaches in vitro and in silico to gain insights into the impact of E2 on the electrical activity of ARC kisspeptin neurons. These include patch-clamp electrophysiology combined with selective optogenetic stimulation of ARC kisspeptin neurons, reverse transcriptase quantitative PCR, pharmacology and CRISPR-Cas9-mediated knockdown of the Trpc5 gene. The addition of computer simulations for understanding the impact of E2 on the electrical activity of ARC kisspeptin cells is also a strength.

      The authors add interesting information on the complement of ionic currents in ARC kisspeptin neurons and on their regulation by E2 to what was already known in the literature. Pharmacological and electrophysiological experiments appear of the highest standards and robust statistical analyses are provided throughout. The impact of E2 replacement on calcium and potassium currents is compelling. Likewise, the results of Trpc5 gene knockdown do provide good evidence that the TRPC5 channel plays a key role in mediating the NKB-mediated slow EPSP. Surprisingly, this also revealed an unsuspected role for this channel in regulating the membrane potential and excitability of ARC kisspeptin neurons.

      Weaknesses:

      The manuscript also has weaknesses that obscure some of the conclusions drawn by the authors.

      One is that the authors compare here two conditions, OVX versus OVX replaced with high E2, that may not reflect the physiological conditions under which the proposed transition between neuropeptide-dependent sustained firing and less intense burst firing might take place (i.e. the diestrous [low E2] and proestrous [high E2] stages of the estrous cycle). This is an important caveat to keep in mind when interpreting the authors' findings. Indeed, that E2 alters certain ionic currents when added back to OVX females, does not mean that the magnitude of all of these ionic currents will vary during the estrous cycle.

      We do know that the slow EPSP, which is generated by TRPC5 channels, tracks beautifully with the steroid state of female mice.  Using our E2 treatment paradigm that generates a LH surge in OVX females (left panel in Author response image 1), there is no difference in the amplitude of the slow EPSP in proestrous versus OVX + E2 females (right panel in Author response image 1).    

      Author response image 1.

      In addition, although the computational modeling indicates a role of the various E2-modulated conductances in causing a transition in ARC kisspeptin neuron firing pattern, their role is not directly tested in physiological recordings, weakening the link between these changes and the shift in firing patterns.

      In future experiments we will test directly the physiological contribution of the other E2-modulated conductances in causing the transition in the firing pattern of arcuate Kiss1 neurons using CRISPR/SaCas9 technology as we have documented for the TRPC5 channel (e.g., Figures 11 and 12).

      Overall, the manuscript provides interesting information about the effects of E2 on specific ionic currents in ARC kisspeptin neurons and some insights into the functional impact of these changes. However, some of the conclusions of the work, with regard, in particular, to the role of these changes in ion channels and to their implications for the LH surge, are not fully supported by the findings.

      ---------

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this work, Qiu and colleagues examined the effects of preovulatory (i.e., proestrous or late follicular phase) levels of circulating estradiol on multiple calcium and potassium channel conductances in arcuate nucleus kisspeptin neurons. Although these cells are strongly linked to a role as the "GnRH pulse generator," the goal here was to examine the physiological properties of these cells in a hormonal milieu mimicking late proestrus, the time of the preovulatory GnRH-LH surge. Computational modeling is used to manipulate multiple conductances simultaneously and support a role for certain calcium channels in facilitating a switch in firing mode from tonic to bursting. CRISPR knockdown of the TRPC5 channel reduced overall excitability, but this was only examined in cells from ovariectomized mice without estradiol treatment. The manuscript has been substantially improved from the initial version by the addition of new experiments and clarification of important figures. Importantly, the overlap of data with previous reports from the same group has been corrected.

      Strengths:

      (1) Examination of multiple types of calcium and potassium currents, both through electrophysiology and molecular biology.

      (2) Focus on arcuate kisspeptin neurons during the surge is relatively conceptually novel as the anteroventral periventricular nucleus (AVPV) kisspeptin neurons have received much more attention as the "surge generator" population.

      (3) The modeling studies allow for direct examination of manipulation of single and multiple conductances, whereas the electrophysiology studies necessarily require examination of each current in isolation. Construction of an arcuate kisspeptin neuron model promises to be of value to the reproductive neuroendocrinology field.

      Weaknesses:

      A remaining weakness in this revised version of the manuscript is that the relevance of the CRISPR experiments is still rather tenuous given that the goal is to understand what happens in the estrogen-treatment condition, and these experiments were performed only in OVX mice. Similar concerns reflect that the computational model examining the effect of E2 infers multiple conductances based on qPCR data and an assumption that the conductances are directionally proportional to the level of gene expression, and then tunes these to the current recordings obtained from OVX mice, without a direct confirmation in OVX+E2 conditions that the model parameters accurately reflect the properties of these currents in the presence of estrogen.

      We are still puzzled by Reviewer’s concerns about doing the CRISPRing of Trpc5 in the OVX+E2 females.  The Trpc5 channel expression is significantly reduced with the E2 treatment (Figure 10E) which we know translates into a minimal slow EPSP (Figure 2, Qiu eLife 2016) and is essentially equivalent to the slow EPSP amplitude in the Trpc5 mutagenesis in the ovariectomized females (Figure 12).  TRPC5 channel conductance is already at “rock bottom.”  The modeling informs us that such a low TRPC5 conductance will not support a long lasting slow EPSP and sustained firing (Figure 13A).

      Also, we respectively point out that we have published a score of papers over the past 20 years showing that the channel conductance does correlate with the mRNA expression (e.g., Qiu et al., eLife 2018).  Secondly, the model does take into consideration the OVX + E2 conditions (Figure 13B,C) which is based on the extensive whole-cell recordings presented in Figures 4,5,6,7,8 and 9.

      Reviewer #2 (Public Review):

      Summary:

      Kisspeptin neurons of the arcuate nucleus (ARC) are thought to be responsible for the pulsatile GnRH secretory pattern and to mediate feedback regulation of GnRH secretion by estradiol (E2). Evidence in the literature, including the work of the authors, indicates that ARC kisspeptin coordinate their activity through reciprocal synaptic interactions and the release of glutamate and of neuropeptide neurokinin B (NKB), which they co-express. The authors show here that E2 regulates the expression of genes encoding different voltage-dependent calcium channels, calcium-dependent potassium channels and canonical transient receptor potential (TRPC5) channels and of the corresponding ionic currents in ARC kisspeptin neurons. Using computer simulations of the electrical activity of ARC kisspeptin neurons, the authors also provide evidence of what these changes translate into in terms of these cells' firing patterns. The experiments reveal that E2 upregulates various voltage-gated calcium currents as well as 2 subtypes of calcium-dependent potassium currents while decreasing TRPC5 expression (an ion channel downstream of NKB receptor activation), the slow excitatory synaptic potentials (slow EPSP) elicited in ARC kisspeptin neurons by NKB release and expression of the G protein-associated inward-rectifying potassium channel (GIRK). Based on these results, and on those of computer simulations, the authors propose that E2 promotes a functional transition of ARC kisspeptin neurons from neuropeptide-mediated sustained firing that supports coordinated activity for pulsatile GnRH secretion to a less intense burst-like firing pattern that could favor glutamate release from ARC kisspeptin. The authors suggest that the latter might be important for the generation of the preovulatory surge in females.

      Strengths:

      The authors combined multiple approaches in vitro and in silico to gain insights into the impact of E2 on the electrical activity of ARC kisspeptin neurons. These include patch-clamp electrophysiology combined with selective optogenetic stimulation of ARC kisspeptin neurons, reverse transcriptase quantitative PCR, pharmacology and CRISPR-Cas9-mediated knockdown of the Trpc5 gene. The addition of computer simulations for understanding the impact of E2 on the electrical activity of ARC kisspeptin cells is also a strength.

      The authors add interesting information on the complement of ionic currents in ARC kisspeptin neurons and on their regulation by E2 to what was already known in the literature. Pharmacological and electrophysiological experiments appear of the highest standards and robust statistical analyses are provided throughout. The impact of E2 replacement on calcium and potassium currents is compelling. Likewise, the results of Trpc5 gene knockdown do provide good evidence that the TRPC5 channel plays a key role in mediating the NKB-mediated slow EPSP. Surprisingly, this also revealed an unsuspected role for this channel in regulating the membrane potential and excitability of ARC kisspeptin neurons.

      Weaknesses:

      The manuscript also has weaknesses that obscure some of the conclusions drawn by the authors.

      One is that the authors compare here two conditions, OVX versus OVX replaced with high E2, that may not reflect the physiological conditions under which the proposed transition between neuropeptide-dependent sustained firing and less intense burst firing might take place (i.e. the diestrous [low E2] and proestrous [high E2] stages of the estrous cycle). This is an important caveat to keep in mind when interpreting the authors' findings. Indeed, that E2 alters certain ionic currents when added back to OVX females, does not mean that the magnitude of all of these ionic currents will vary during the estrous cycle.

      Unfortunately, mice are a poor reproductive model since female mice do not have a clear follicular (estradiol-driven) phase distinctive from the luteal (progesterone-driven) phase.  Had we utilized a “proestrous” female, we could not with certainty distinguish between the effects of estradiol versus progesterone on the expression of the calcium and potassium channels that were the focus of this study.  Therefore, using our physiological model we can state with confidence that “estradiol elicits distinct firing patterns in arcuate nucleus kisspeptin neurons….”

      Overall, the manuscript provides interesting information about the effects of E2 on specific ionic currents in ARC kisspeptin neurons and some insights into the functional impact of these changes. However, some of the conclusions of the work, with regard, in particular, to the role of these changes in ion channels and their implications for the LH surge, are not fully supported by the findings.

      As we pointed out in the Discussion, the O’Byrne lab has clearly shown the relevance of Kiss1ARH neuronal burst firing and the release of glutamate to its effects on the LH surge:

      “Rather, we postulate that glutamate neurotransmission is more important for excitation of Kiss1AVPV/PeN neurons and facilitating the GnRH (LH) surge with high circulating levels of E2 when peptide neurotransmitters are at a nadir and glutamate levels are high in female Kiss1ARH neurons. Indeed, low frequency (5 Hz) optogenetic stimulation of Kiss1ARH neurons, which only releases glutamate in E2-treated, ovariectomized females (Qiu J. et al., 2016), generates a surge-like increase in LH release during periods of optical stimulation (Lin et al., 2021; Voliotis et al., 2021).  In a subsequent study optical stimulation of Kiss1ARH neuron terminals in the AVPV at 20 Hz, a frequency commonly used for terminal stimulation in vivo, generated a similar surge of LH (Shen et al., 2022).  Additionally, intra-AVPV infusion of glutamate antagonists, AP5+CNQX, completely blocked the LH surge induced by Kiss1ARH terminal photostimulation in the AVPV (Shen et al., 2022).”

      Recommendations for the authors:

      Reviewer #2 (Recommendations for The Authors):

      The reviewer noted the following in the revised manuscript:

      - page 6, the authors may consider adding that presynaptic effects of blocking calcium channels on the slow EPSP cannot be fully ruled out. Indeed, the added experiments do indicate that some of the effects can be explained by impaired regulation of TRPC5 channels by calcium influx through calcium channels; however, the senktide-induced current is not fully blocked by the broad-spectrum calcium channel inhibitor cadmium, suggesting that the effect of blocking these channels on the slow EPSP may involve other mechanisms, such as presynaptic effects.

      Optogenetic stimulation of all Kiss1ARH neurons induces the release of NKB at “physiological” concentrations, which in turn generates a slow EPSP in the recorded Kiss1ARH neuron. Blocking voltage-gated calcium channels can inhibit the NKB release from presynaptic  Kiss1ARH neurons, thereby reducing the amplitude of the slow EPSP. However, in whole-cell recordings of synaptically isolated Kiss1ARH neurons,  senktide directly induces a large inward current (Figure 3F), which is generated by the opening of TRPC5 channels (Qiu et al. J. Neurosci 2021). Voltage-gated calcium channels are coupled to the activation of TRPC5 channels (Blair, Kaczmarek and Clapham, J. Gen Physiol 2009), so by blocking voltage-gated calcium channels, cadmium effectively abrogates the facilitating effects of these channels on TRPC5 channel activation and significantly reduces but does not abolish the inward (excitatory) current (Figures 3F-H). We have clarified in the Results (page 6) that the Kiss1ARH neurons were synaptically isolated as depicted in Figures 3F,G.

      - page 8, bottom, the mean value given for the apamin-sensitive current amplitude in E2 treated females does not match that plotted on the I/V graph in Figure 7F.

      Thank you for pointing out this typographical error, which we have corrected.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      This is an interesting and somewhat unusual paper supporting the idea that creatine is a neurotransmitter in the central nervous system of vertebrates. The idea is not entirely new, and the authors carefully weigh the evidence, both past and newly acquired, to make their case. The strength of the paper lies in the importance of the potential discovery - as the authors point out, creatine ticks more boxes on criteria of neurotransmitters than some of the ones listed in textbooks - and the list of known transmitters (currently 16) certainly is textbook material. A further strength of the manuscript is the careful consideration of a list of criteria for transmitters and newly acquired evidence for four of these criteria: 1. evidence that creatine is stored in synaptic vesicles, 2. mutants for creatine synthesis and a vesicular transporter show reduced storage and release of creatine, 3. functional measurement that creatine release has an excitatory or inhibitory (here inhibitory) effect in vivo, and 4. ATP-dependence. The key weakness of the paper is that there is no single clear 'smoking gun', like a postsynaptic creatine receptor, that would really demonstrate the function as a transmitter. Instead, the evidence is of a cumulative nature, and not all bits of evidence are equally strong. On balance, I found the path to discovery and the evidence assembled in this manuscript to establish a clear possibility, positive evidence, and to provide a foundation for further work in this direction.

      it is notable that, historically, no neurotransmitter has ever been established in a single paper. While creatine will not be an exception, data presented in this paper are more than any previous paper in demonstrating the possibility of a new neurotransmitter. However, we added an entire paragraph in the Discussion part about differences between Cr and classic neurotransmitters such as Glu, beginning with the absence of a molecularly defined receptor at this point and the Ca2+ independent component of Cr release induced by extracellular K+.

      We appreciate the reviewer for noting that evidence obtained by us now support that creatine satisfies all 4 criteria of transmitters.

      We respectively disagree the point about a smoking gun: any of these four is a smoking gun, while the satisfication of all 4 is quite strong, more than a smoking gun.

      We find it disagreeable that a receptor “would really demonstrate the function of a transmitter”. Textbook criteria for a transmitter usually require postsynaptic responses, not a molecularly defined receptor. A molecularly defined receptor for many of the known transmitters required many years of work, while they were accepted as transmitters before their receptors were finally molecularly defined. As long as there is a postsynaptic response, there is of course a receptor, though its molecular properties should be further studied. For examples, responses to choline were discovered in 1900 (Hunt, Am J Physiol 3, xviii-xix, 1900), those to acetylcholine in 1906 (Hunt and Taveau, Br Med J 2:1788-1789, 1906), those to supradrenal glands before 1894 (Oliver and Schäfer, J Physiol 18:230-276 1895). Henry Dale was awarded a Nobel prize in 1936 partly for his work on acetylcholine. Receptors for acetylcholine and noradrenaline were not molecularly defined until the 1970s and 1980s. Before then, they were only known by mediating responses to natural transmitters and synthesized chemicals.

      There were two previous reports that creatine could be taken into brain slices (Almeida et al., 2006) or synaptosomes (Peral, Vázquez-Carretero and Ilundain, 2010). These were used by the reviewer to argue that the idea of creatine as a neurotransmitter “is not entirely new”. However, no one has followed up these studies for 10 years, thus they would not be considered as good smoking guns. While we have reproduced the synaptosome uptake result (together with our new finding that this uptake was dependent on SLC6A8), it should be noted that uptake of molecules into synaptosomes is not absolutely required for a neurotransmitter because degradation of a transmitter is equally valid. Furthermore, molecules required synaptically but not as a transmitter can also be transported into the synaptic terminal.

      Our detection of Cr in the synaptic vesicles provides much stronger evidence supporting its importance. If a smoking gun is important, the detection of creatine in the SVs is the best smoking gun, whose discovery in fact was the reason leading us to study its release, postsynaptic responses as well as repeating the uptake experiment with genetic mutants.

      Reviewer #2 (Public Review):

      Summary:

      Bian et al studied creatine (Cr) in the context of central nervous system (CNS) function. They detected Cr in synaptic vesicles purified from mouse brains with anti-Synaptophysin using capillary electrophoresis-mass spectrometry. Cr levels in the synaptic vesicle fraction were reduced in mice lacking the Cr synthetase AGAT, or the Cr transporter SLC6A8. They provide evidence for Cr release within several minutes after treating brain slices with KCl. This KCl-induced Cr release was partially calcium-dependent and was attenuated in slices obtained from AGAT and SLC6A8 mutant mice. Cr application also decreased the excitability of cortical pyramidal cells in one third of the cells tested. Finally, they provide evidence for SLC6A8-dependent Cr uptake into synaptosomes, and ATP-dependent Cr loading into synaptic vesicles. Based on these data, the authors propose that Cr may act as a neurotransmitter in the CNS.

      Strengths:

      1) A major strength of the paper is the broad spectrum of tools used to investigate Cr.

      2) The study provides strong evidence that Cr is present in/loaded into synaptic vesicles.

      Weaknesses:

      (in sequential order)

      1) Are Cr levels indeed reduced in Agat-/-? The decrease in Cr IgG in Agat-/- (and Agat+/-) is similar to the corresponding decrease in Syp (Fig. 3B). What is the explanation for this? Is the decrease in Cr in Agat-/- significant when considering the drop in IgG? The data should be normalized to the respective IgG control.

      We measured the Cr concentration in the whole brain lysates using Creatine Assay Kit (Sigma, MAK079). Cr levels in the brain were reduced in Agat-/- mice. The Cr concentration in AGAT-/- mice was reduced to about 1/10 of AGAT+/+ and AGAT+/- mice (Author response image 1).

      Author response image 1.

      Cr concentration in brain from AGAT+/+, AGAT+/- and AGAT-/- mice (n=5 male mice for each group). , p<0.05, **, p<0.001, one-way ANOVA with Tukey’s correction.

      As pointed by the reviewer, the decrease in Cr IgG in Agat-/- seems similar to the corresponding decrease in Syp (Fig. 3B in the paper). Cr pulled down by IgG was 0.46 ± 0.04, 0.37 ± 0.06 and 0.17 ±0.03 pmol/μg anti-syp antibody for Agat+/+, Agat+/-, and Agat-/- mice respectively. There was a trend of reduction Cr IgG in Agat-/-, however, there were no statistically significant differences between Agat-/- and Agat+/+, or between Agat-/- and Agat+/-, as determined by one-way ANOVA (Fig. 3B in the paper). Due to the fact that Agat-/- reduced Cr concentration in the brain, we speculate that the apparent drop in Cr pulled down by IgG may have partially resulted from the overall reduction of Cr content in the brain.

      The absolute content of Cr pulled down by Syp in Agat-/- mice was reduced to 21.6% of Agat+/+ mice and 23.6% of Agat+/- mice (Fig. 3B in the paper). As suggested by the reviewer, we normalized the Cr pulled down by Syp to the respective IgG control (Author response image 2). The normalized Cr content in AGAT-/- mice has a tendency to decrease, but not statistically significant, as compared to Agat+/+ and Agat+/- mice (n=10 for each group, one-way ANOVA).

      Author response image 2.

      Normalized Cr content in brain from AGAT+/+, AGAT+/- and AGAT-/- mice (n=10 for each group). Cr pulled down by anti-Syp antibody was normalized to that of IgG.

      2) The data supporting that depolarization-induced Cr release is SLC6A8 dependent is not convincing because the relative increase in KCl-induced Cr release is similar between SLC6A8-/Y and SLC6A8+/Y (Fig. 5D). The data should be also normalized to the respective controls.

      As suggested by the reviewer, we normalized the Cr release during KCl stimulation to the baseline (Author response image 3). The ratio of Cr release evoked by high KCl stimulation to the baseline was similar in WT and Slc6a8 knockouts. This suggests that Cr is not released through SLC6A8 transporter.

      Author response image 3.

      Normalized Cr release from slices from Slc6a8+/Y and Slc6a8-/Y mice (n=7 slices for each group). Cr released evoked by high KCl stimulation was normalized to baseline.

      However, without Slc6a8, KCl-induced release of Cr was significantly reduced (Figure 5D in the paper). This is because Slc6a8 is a transporter to Cr uptake into synaptic terminals (Figure 5D and 8C in the paper). Therefore, Cr content in SVs (Figure 2C in the paper) indirectly reduced Cr release.

      3) The majority (almost 3/4) of depolarization-induced Cr release is Ca2+ independent (Fig. 5G). Furthermore, KCl-induced, Ca2+-independent release persists in SLC6A8-/Y (Fig. 5G). What is the model for Ca2+-independent Cr release? Why is there Ca2+-independent Cr release from SLC6A8 KO neurons? How does this relate to the prominent decrease in Ca2+-dependent Cr release in SLC6A8-/Y (Fig. 5G)? They show a prominent decrease in Cr control levels in SLC6A8-/Y in Fig. 5D. Were the data shown in Fig. 5D obtained in the presence or absence of Ca2+? Could the decrease in Ca2+-dependent Cr release in SLC6A8-/Y (Fig. 5G) be due to decreased Cr baseline levels in the presence of Ca2+ (Fig. 5D)?

      These are interesting questions that, at this point, could only be answered by references to literature. For example, one possibility was that Ca2+-independent Cr release might occurs in glia, since as pointed by the reviewer in Point 6, high GAMT levels were reported for astrocytes and oligodendrites (Schmidt et al. 2004; Rosko et al. 2023). As reported, other neuromodulators such as taurine can be released from astrocytes (Philibert, Rogers, and Dutton 1989) or slices (Saransaari and Oja 2006) in Ca2+ independent manner. In addition, in the absence of potassium stimulation, Ca2+ depletion lead to increased release of taurine in cultured astrocytes (Takuma et al. 1996) or in striatum in vivo (Molchanova, Oja, and Saransaari 2005). Similarly, in SLC6A8 KO slices, Ca2+ depletion (Figure 5G) also increased creatine baseline levels as compared to that in normal ACSF (Figure 5D). Another possibility was that Ca2+-independent Cr release might occurs in neurons lacking SLC6a8 expression.

      As mentioned in the paper, data shown in Figure 5D was obtained in the presence Ca2+. Reduction of Ca2+-dependent Cr release evoked by potassium in SLC6A8-/Y (Figure 5G) may be due to decreased Cr baseline levels in the presence of Ca2+ and reduced Cr in synaptic vesicles (Figure 5D).

      4) Cr levels are strongly reduced in Agat-/- (Figure 6B). However, KCl-induced Cr release persists after loss of AGAT (Figure 6B). These data do not support that Cr release is Agat dependent.

      Although KCl-induced Cr release persisted in AGAT-/- mutants, it was dropped to 11.6% of WT mice (Figure 6B). AGAT is not directly involved in the release, but required for providing sufficient Cr.

      5) The authors show that Cr application decreases excitability in ~1/3 of the tested neurons (Figure 7). How were responders and non-responders defined? What justifies this classification? The data for all Cr-treated cells should be pooled. Are there indeed two distributions (responders/non-responders)? Running statistics on pre-selected groups (Figure 7H-J) is meaningless. Given that the effects could be seen 2-8 minutes after Cr application - at what time points were the data shown in Figure 7E-J collected? Is the Cr group shown in Figure 7F significantly different from the control group/wash?

      The responders were defined by three criteria: (1) When Cr was applied, the rheobase was increased as compared to both control and wash conditions. (2) The number of total evoked spikes was decreased during Cr application than both control and wash. (3) The number of total evoked spikes was decreased at least by 10% than control or wash.

      For all the individual responders, when Cr was applied, the rheobase was increased (Figure 7E and 7F). While in individual non-responders, the rheobase was either identical to both control and wash (n=19/35), identical to either control or wash (n=11/35), between control and wash (n=2/35) or smaller than both control and wash (n=3/35) following Cr application. Thus, the responders and non-responders were separatable. When the rheobase data were pulled together, many points were overlapped, so we did not pull the data here.

      As suggested, we pulled the data of the ratio of spike changes in response to 100 μM Cr application for all neurons together (Author response image 4). Evoked spikes of non-responders were typically (34/35) changed in the range of -10% to 10%.

      Author response image 4.

      Relative changes of total evoked spikes in response to 100 μM Cr. Responders are represented by red dots and non-responders by black dots. Dashed black line indicates 10%. Relative change = (Cr-(Control +wash)/2)/((Control +wash)/2)*100%.

      In Figure 7E-J, we collected data at time points when the maximal response was reached. The Cr group shown in Figure 7F was indeed significantly different from the control group/wash (p<0.05, paired t test, for data points collected under 75-500 pA current injection).

      6) Indirect effects: The phenotypes could be partially caused by indirect effects of perturbing the Cr/PCr/CK system, which is known to play essential roles in ATP regeneration, Ca2+ homeostasis, neurotransmission, intracellular signaling systems, axonal and dendritic transport... Similarly, high GAMT levels were reported for astrocytes (e.g., Schmidt et al. 2004; doi: 10.1093/hmg/ddh112), and changes in astrocytic Cr may underlie the phenotypes. Cr has been also reported to be an osmolyte: a hyperosmotic shock of astrocytes induced an increase in Cr uptake, suggesting that Cr can work as a compensatory osmolyte (Alfieri et al. 2006; doi: 10.1113/jphysiol.2006.115006). Potential indirect effects are also consistent with a trend towards decreased KCl-induced GABA (and Glutamate) release in SLC6A8-/Y (Figure 5C). These indirect effects may in part explain the phenotypes seen after perturbing Agat, SLC6A8, and should be thoroughly discussed.

      We discussed the possibility of creatine/phosphocreatine as non-transmitters in discussion part. We added the possibility of astrocytic Cr in discussion part. KCl-induced GABA (and Glutamate) release in SLC6A8-/Y (Figure 5C) was not significant.

      7) As stated by the authors, there is some evidence that Cr may act as a co-transmitter for GABAA receptors (although only at high concentrations). Would a GABAA blocker decrease the fraction of cells with decreased excitability after Cr exposure?

      We performed another experiment in CA1 pyramidal neurons in hippocampus showing that Cr at 100 μM did not change GABAergic neurotransmission (n=8, Author response image 5). Inhibitory postsynaptic currents (IPSCs) recorded in the presence of glutamate receptor blockers (10 μM APV and 10 μM CNQX) were not changed by 100 μM creatine in hippocampal CA1 pyramidal neurons (Bgroup data of IPSC frequency (B) and amplitude (C) averaged in 1 min duration). These did not support Cr activation of GABAA receptors.

      Author response image 5.

      IPSCs recorded in in hippocampal CA1 pyramidal neurons. (A) representative raw traces before (Control), during (Creatine) and after (Wash) the application of 100 μM creatine. (B&C) group data of IPSC frequency (B) and amplitude (C) averaged in 1 min duration.

      8) The statement "Our results have also satisfied the criteria of Purves et al. 67,68, because the presence of postsynaptic receptors can be inferred by postsynaptic responses." (l.568) is not supported by the data and should be removed.

      We have deleted this sentence, though what could mediate postsynaptic responses other than receptors?

      Reviewer #3 (Public Review):

      SUMMARY:

      The manuscript by Bian et al. promotes the idea that creatine is a new neurotransmitter. The authors conduct an impressive combination of mass spectrometry (Fig. 1), genetics (Figs. 2, 3, 6), biochemistry (Figs. 2, 3, 8), immunostaining (Fig. 4), electrophysiology (Figs. 5, 6, 7), and EM (Fig. 8) in order to offer support for the hypothesis that creatine is a CNS neurotransmitter.

      We thank the reviewer for the summary.

      STRENGTHS:

      There are many strengths to this study.

      • The combinatorial approach is a strength. There is no shortage of data in this study.

      • The careful consideration of specific criteria that creatine would need to meet in order to be considered a neurotransmitter is a strength.

      • The comparison studies that the authors have done in parallel with classical neurotransmitters are helpful.

      • Demonstration that creatine has inhibitory effects is another strength.

      • The new genetic mutations for Slc6a8 and AGAT are strengths and potentially incredibly helpful for downstream work.

      WEAKNESSES:

      • Some data are indirect. Even though Slc6a8 and AGAT are helpful sentinels for the presence of creatine, they are not creatine themselves. Therefore, the conclusions that are drawn should be circumspect.

      SLC6A8 and AGAT mutants are not essential for Cr’s role as a neurotransmitter.

      • Regarding Slc6a8, it seems to work only as a reuptake transporter - not as a transporter into SVs. Therefore, we do not know what the transporter is.

      Indeed, SLC6A8 is only a transporter on the cytoplasmic membrane, not a transporter on synaptic vesicles. We have shown biochemistry here, and we have unpublished data that showed other SLCs on SVs, which did not include SLC6A8.

      • Puzzlingly, Slc6a8 and AGAT are in different cells, setting up the complicated model that creatine is created in one cell type and then processed as a neurotransmitter in another.

      • No candidate receptor for creatine has been identified postsynaptically.

      • Because no candidate receptor has been identified, is it possible that creatine is exerting its effects indirectly through other inhibitory receptors (e.g., GABAergic Rs)?

      As shown in our response to Question 7 of Reviewer 2, Cr did not exert its effects through inhibitory GABAA receptors.

      • More broadly, what are the other possibilities for roles of creatine that would explain these observations other than it being a neurotransmitter? Could it simply be a modifier that exists in the SVs (lots of molecules exist in SVs)?

      We discussed the possibility of a non-transmitter role for creatine/phosphocreatine in discussion part.

      • The biochemical studies are helpful in terms of comparing relevant molecules (e.g., Figs. 8 and S1), but the images of the westerns are all so fuzzy that there are questions about processing and the accuracy of the quantification.

      Multiple members (>4) have carried out SV purifications repeatedly over the last decade in our group, we are highly confident of SV purifications presented in Figs. 8 and S1.

      There are several criteria that define a neurotransmitter. The authors nicely delineated many criteria in their discussion, but it is worth it for readers to do the same with their own understanding of the data.

      By this reviewer's understanding (and the Purves' textbook definition) a neurotransmitter: 1) must be present within the presynaptic neuron and stored in vesicles; 2) must be released by depolarization of the presynaptic terminal; 3) must require Ca2+ influx upon depolarization prior to release; 4) must bind specific receptors present on the postsynaptic cell; 5) exogenous transmitter can mimic presynaptic release; 6) there exists a mechanism of removal of the neurotransmitter from the synaptic cleft.

      6 criteria seem to be only required by the reviewer. As discussed in our Discussion part, Purves’ textbook did not list 6 criteria but only three criteria, “the substance must be present within the presynaptic neuron; the substance must be released in response to presynaptic depolarization, and the release must be Ca2+ dependent; specific receptors for the substance be present on the postsynaptic cell” (Purves et al., 2001, 2016).

      Kandel et al. (2013, 2021) listed 4 criteria for a neurotransmitter: “it is synthesized in the presynaptic neuron; it is present within vesicles and is released in amounts sufficient to exert a defined action on the postsynaptic neuron or effector organ; when administered exogenously in reasonable concentrations it mimics the action of the endogenous transmitter; a specific mechanism usually exists for removing the substance from the synaptic cleft”.

      While we agree that any neuroscientist can have his/her own criteria, it is more reasonable to accept the textbooks that have been widely read for decades.

      For a paper to claim that the work has identified a new neurotransmitter, several of these criteria would be met - and the paper would acknowledge in the discussion which ones have not been met. For this particular paper, this reviewer finds that condition 1 is clearly met.

      Conditions 2 and 3 seem to be met by electrophysiology, but there are caveats here. High KCl stimulation is a blunt instrument that will depolarize absolutely everything in the prep all at once and could result in any number of non-specific biological reactions as a result of K+ rushing into all neurons in the prep. Moreover, the results in 0 Ca2+ are puzzling. For creatine (and for the other neurotransmitters), why is there such a massive uptick in release, even when the extracellular saline is devoid of calcium?

      To avoid the disadvantage of high KCl stimulation, we performed optogenetic experiments recently, with encouraging preliminary data. We do not know the source of Ca2+-independent release of Cr and neurotransmitters, though astrocytes are a possibility.

      Condition 4 is not discussed in detail at all. In the discussion, the authors elide the criterion of receptors specified by Purves by inferring that the existence of postsynaptic responses implies the existence of receptors. True, but does it specifically imply the existence of creatinergic receptors? This reviewer does not think that is necessarily the case. The authors should be appropriately circumspect and consider other modes of inhibition that are induced by activation or potentiation of other receptors (e.g., GABAergic or glycinergic).

      Our results did not support Cr stimulation of inhibitory GABAA receptors (see our answer to Point 7 in of Reviewer 2).

      Condition 5 may be met, because the authors applied exogenous creatine and observed inhibition (Fig. 7). However, this is tough to know without understanding the effects of endogenous release of creatine. if they were to test if the absence of creatine caused excess excitation (at putative creatinergic synapses), then that would be supportive of the same.

      After the submission of our manuscript, we found a recent paper showing that slc6a8 knockout led to increased excitation in pyramidal neurons in the prefrontal cortex (PFC), with increased firing frequency (Ghirardini et al., 2023). Because we have shown that slc6a8 knockout would cause decrease of Cr in SVs (Figure 2 in our paper), this result provide the evidence described as Condition 5 of this reviewer: that decrease of Cr in SVs led to excess excitation.

      For condition 6, the authors made a great effort with Slc6a8. This is a very tough criterion to understand for many synapses and neurotransmitters.

      In terms of fundamental neuroscience, the story would be impactful if proven correct. There are certainly more neurotransmitters out there than currently identified.

      The impact as framed by the authors in the abstract and introduction for intellectual disability is uncertain (forming a "new basis for ID pathogenesis") and it seems quite speculative beyond the data in this paper.

      We deleted this sentence.

      Reviewer #1 (Recommendations For The Authors):

      To strengthen the manuscript, I suggest the following considerations:

      1) The key missing evidence to my mind is a receptor - but this is clearly outside the scope of this paper. Yet, I am surprised that in the list of criteria for neurotransmitters in general there is no mention of a receptor. Furthermore, many receptors have been identified through receptor agonists or antagonists, like neurotoxins or drugs. The authors do not talk about putative receptors except for a sentence in the discussion where they speculate on a GPCR. There are numerous GPCR agonists and antagonists, which may be a long-shot, or something even a bit more designed based on knowledge about creatine? I do not think the publication of this manuscript should have been made dependent on finding an agonist or antagonist of this specific unknown receptor (if it exists), but it would be good to have at least some leads on this from the authors what has been tried or what could be done? How about a manipulation of G-protein-coupled signal transduction to support the idea that there IS such a GPCR? There may be a real opportunity here to test existing compounds in wild type, the slc6a8 and agat mutants.

      We will keep trying, but accept the reality that Rome was not built in a single day and that no transmitter was proven by one single paper.

      A key new puzzle piece of evidence is the identification of creatine in synaptic vesicles. The experiment relies heavily on the purity of the SV fraction using the anti-synaptophysin antibody. I am quite sure that these preparations contain many other compartments - and of course a big mix of synaptic (and other) vesicles. Would it be possible to purify with an anti slc6a8 antibody?

      Sl6a8 is expressed in on the plasma membrane of neurons7-9, instead of synaptic vesicles. Consistent with this, we could not detect obvious Slc6a8-HA signal in our starting material (Lane S in Author response image 6) that was used for SV purification. We have tried to purify SVs by HA antibody in Slc6a8 mice and SV markers could not be detected.

      Author response image 6.

      Lack of Slc6a8-HA in our starting material. In Slc6a8-HA knock-in mice, the HA signal was present in whole brain homogenate (H), but not obvious in supernatants (S) following 35000 × centrifugation. In contrast, SV marker Syp was present in supernatants.

      The K stimulation protocol in slices is relatively crude, as all neurons in the slice get simultaneously overactivated - and some of the effects on Ca-dependent release are not very strong (e.g. the 35 neurons that were not responsive to creatine at all). A primary neuronal culture of neurons that respond to creatine would strengthen this section.

      To avoid the disadvantage of K stimulation, we also performed optogenetic experiments recently and obtained encouraging preliminary results.

      Reviewer #2 (Recommendations For The Authors):

      1) The different sections of the manuscript are not separated by headers.

      2) The beginning of the results section either does not reference the underlying literature or refers to unpublished data.

      We have kept a bit background in the beginning of the Results section.

      3) The text contains many opinions and historical information that are not required (e.g., "It has never been easy to discover a new neurotransmitter, especially one in the central nervous system (CNS). We have been searching for new neurotransmitters for 12 years."; l. 17).

      This is a field that has been dormant for decades and such background introductions are helpful for at least some readers.

      4) Almeida et al. (2008; doi: 10.1002/syn.20280) provided evidence for electrical activity-, and Ca2+-dependent Cr release from rat brain slices. This paper should be introduced in the introduction.

      Those were stand-alone papers which have not been reproduced or paid attention to. Our introduction part did not mention them because our research did not begin with those papers. We had no idea that those papers existed when we began. We started with SV purification and only read those papers afterwards. Thus, they were not necessary background to our paper but can be discussed after we discovered Cr in SVs.

      5) Fig. 7: A Y-scale for the stimulation protocol is missing.

      Revised.

      Reviewer #3 (Recommendations For The Authors):

      The main suggestion by this reviewer (beyond the details in the public review) is to consider the full spectrum of biology that is consistent with these results. By my reading, creatine could be a neurotransmitter, but other possibilities also exist, and the authors need to highlight those too.

      We have discussed non-transmitter role in the discussion.

      References

      Ghirardini, E., G. Sagona, A. Marquez-Galera, F. Calugi, C. M. Navarron, F. Cacciante, S. Chen, F. Di Vetta, L. Dada, R. Mazziotti, L. Lupori, E. Putignano, P. Baldi, J. P. Lopez-Atalaya, T. Pizzorusso, and L. Baroncelli. 2023. Cell-specific vulnerability to metabolic failure: the crucial role of parvalbumin expressing neurons in creatine transporter deficiency. Acta Neuropathol Commun, 11: 34. doi: 10.1186/s40478-023-01533-w.

      Lowe, M. T., Faull, R. L., Christie, D. L. & Waldvogel, H. J. Distribution of the creatine transporter throughout the human brain reveals a spectrum of creatine transporter immunoreactivity. J Comp Neurol 523, 699-725 (2015). https://doi.org:10.1002/cne.23667

      Mak, C. S. et al. Immunohistochemical localisation of the creatine transporter in the rat brain. Neuroscience 163, 571-585 (2009). https://doi.org:10.1016/j.neuroscience.2009.06.065.

      Molchanova, S. M., Oja, S. S. & Saransaari, P. Mechanisms of enhanced taurine release under Ca2+ depletion. Neurochem Int 47, 343-349 (2005). https://doi.org:10.1016/j.neuint.2005.04.027

      Philibert, R. A., Rogers, K. L. & Dutton, G. R. K+-evoked taurine efflux from cerebellar astrocytes: on the roles of Ca2+ and Na+. Neurochem Res 14, 43-48 (1989). https://doi.org:10.1007/BF00969756

      Rosko, L. M. et al. Cerebral Creatine Deficiency Affects the Timing of Oligodendrocyte Myelination. J Neurosci 43, 1143-1153 (2023). https://doi.org:10.1523/JNEUROSCI.2120-21.2022

      Saransaari, P. & Oja, S. S. Characteristics of taurine release in slices from adult and developing mouse brain stem. Amino Acids 31, 35-43 (2006). https://doi.org:10.1007/s00726-006-0290-5

      Schmidt, A. et al. Severely altered guanidino compound levels, disturbed body weight homeostasis and impaired fertility in a mouse model of guanidinoacetate N-methyltransferase (GAMT) deficiency. Hum Mol Genet 13, 905-921 (2004). https://doi.org:10.1093/hmg/ddh112

      Speer, O. et al. Creatine transporters: a reappraisal. Mol Cell Biochem 256-257, 407-424 (2004). https://doi.org:10.1023/b:mcbi.0000009886.98508.e7

      Takuma, K. et al. Ca2+ depletion facilitates taurine release in cultured rat astrocytes. Jpn J Pharmacol 72, 75-78 (1996). https://doi.org:10.1254/jjp.72.75

    1. Author Response

      The following is the authors’ response to the previous reviews.

      eLife assessment

      This valuable paper examines gene expression differences between male and female individuals over the course of flower development in the dioecious angiosperm Trichosantes pilosa. The authors show that male-biased genes evolve faster than female-biased and unbiased genes. This is frequently observed in animals, but this is the first report of such a pattern in plants. In spite of the limited sample size, the evidence is mostly solid and the methods appropriate for a non-model organism. The resources produced will be used by researchers working in the Cucurbitaceae, and the results obtained advance our understanding of the mechanisms of plant sexual reproduction and its evolutionary implications: as such they will broadly appeal to evolutionary biologists and plant biologists.

      Public Reviews:

      Reviewer #1 (Public Review):

      The evolution of dioecy in angiosperms has significant implications for plant reproductive efficiency, adaptation, evolutionary potential, and resilience to environmental changes. Dioecy allows for the specialization and division of labor between male and female plants, where each sex can focus on specific aspects of reproduction and allocate resources accordingly. This division of labor creates an opportunity for sexual selection to act and can drive the evolution of sexual dimorphism.

      In the present study, the authors investigate sex-biased gene expression patterns in juvenile and mature dioecious flowers to gain insights into the molecular basis of sexual dimorphism. They find that a large proportion of the plant transcriptome is differentially regulated between males and females with the number of sex-biased genes in floral buds being approximately 15 times higher than in mature flowers. The functional analysis of sex-biased genes reveals that chemical defense pathways against herbivores are up-regulated in the female buds along with genes involved in the acquisition of resources such as carbon for fruit and seed production, whereas male buds are enriched in genes related to signaling, inflorescence development and senescence of male flowers. Furthermore, the authors implement sophisticated maximum likelihood methods to understand the forces driving the evolution of sex-biased genes. They highlight the influence of positive and relaxed purifying selection on the evolution of male-biased genes, which show significantly higher rates of non-synonymous to synonymous substitutions than female or unbiased genes. This is the first report (to my knowledge) highlighting the occurrence of this pattern in plants. Overall, this study provides important insights into the genetic basis of sexual dimorphism and the evolution of reproductive genes in Cucurbitaceae.

      Reviewer #2 (Public Review):

      Summary:

      This study uses transcriptome sequence from a dioecious plant to compare evolutionary rates between genes with male- and female-biased expression and distinguish between relaxed selection and positive selection as causes for more rapid evolution. These questions have been explored in animals and algae, but few studies have investigated this in dioecious angiosperms, and none have so far identified faster rates of evolution in male-biased genes (though see Hough et al. 2014 https://doi.org/10.1073/pnas.1319227111).

      Strengths:

      The methods are appropriate to the questions asked. Both the sample size and the depth of sequencing are sufficient, and the methods used to estimate evolutionary rates and the strength of selection are appropriate. The data presented are consistent with faster evolution of genes with male-biased expression, due to both positive and relaxed selection.

      This is a useful contribution to understanding the effect of sex-biased expression in genetic evolution in plants. It demonstrates the range of variation in evolutionary rates and selective mechanisms, and provides further context to connect these patterns to potential explanatory factors in plant diversity such as the age of sex chromosomes and the developmental trajectories of male and female flowers.

      Weaknesses:

      The presence of sex chromosomes is a potential confounding factor, since there are different evolutionary expectations for X-linked, Y-linked, and autosomal genes. Attempting to distinguish transcripts on the sex chromosomes from autosomal transcripts could provide additional insight into the relative contributions of positive and relaxed selection.

      Reviewer #3 (Public Review):

      The potential for sexual selection and the extent of sexual dimorphism in gene expression have been studied in great detail in animals, but hardly examined in plants so far. In this context, the study by Zhao, Zhou et al. al represents a welcome addition to the literature.

      Relative to the previous studies in Angiosperms, the dataset is interesting in that it focuses on reproductive rather than somatic tissues (which makes sense to investigate sexual selection), and includes more than a single developmental stage (buds + mature flowers).<br /> Some aspects of the presentation have been improved in this new version of the manuscript.

      Specifically:

      • the link between sex-biased and tissue-biased genes is now slightly clearer,

      • the limitation related to the de novo assembled transcriptome is now formally acknowledged,

      • the interpretation of functional categories of the genes identified is more precise,

      • the legends of supplementary figures have been improved - a large number of typos have been fixed.

      in response to this first round of reviews. As I detail below, many of the relevant and constructive suggestions by the previous reviewers were not taken into account in this revision.

      For instance:

      • Reviewer 2 made precise suggestions for trying to take into account the potential confounding factor of sex-chromosomes. This suggestion was not followed.

      For the question of reviewer 2:

      The presence of sex chromosomes is a potential confounding factor, since there are different evolutionary expectations for X-linked, Y-linked, and autosomal genes. Attempting to distinguish transcripts on the sex chromosomes from autosomal transcripts could provide additional insight into the relative contributions of positive and relaxed selection.

      Empirically, the analyses could be expanded by an attempt to distinguish between genes on the autosomes and the sex chromosomes. Genotypic patterns can be used to provisionally assign transcripts to XY or XX-like behavior when all males are heterozygous and all females are homozygous (fixed X-Y SNPs) and when all females are heterozygous and males are homozygous (lost or silenced Y genes). Comparing such genes to autosomal genes with sex-biased expression would sharpen the results because there are different expectations for the efficacy of selection on sex chromosomes. See this paper (Hough et al. 2014; https://www.pnas.org/doi/abs/10.1073/pnas.1319227111), which should be cited and does in fact identify faster substitution rates in Y-linked genes.

      Authors’ response: We have cited Hough et al. (2014) and Sandler et al. (2018) in the revised manuscript. We agree that the presence of sex chromosomes is potentially a confounding factor. By adopting methods in Hough et al. (2014) and Sandler et al. (2018), we tried to distinguish transcripts on sex chromosomes from autosomal chromosomes. For a total of 2,378 unbiased genes, we found that 36 genes were putatively sex chromosomal genes, 20 of which were exclusively heterozygous and homozygous for males and females, respectively; while the other 16 genes showing an opposite genotyping patterns between males and females. For 343 male-biased genes, only three ones exhibit a pattern of potentially sex-linked. For the 1,145 female-biased genes, we identified 19 genes which might located on the sex chromosomes. Among the 19 genes, five genes were exclusively heterozygous for males and exclusively homozygous for females, while reversed genotyping patterns presented in the other 14 genes. So, sex-linked genes may contribute relatively little to rapid evolution of male-biased genes. An alternative explanation is that the results could be unreliable due to small sample sizes. Thus, we did not describe them in the Results section. We will investigate the issue when whole genome sequences and population datasets become available in the near future.

      • Reviewer 1 & 3 indicated that results were mentioned in the discussion section without having been described before. This was not fixed in this new version.

      For the question of reviewer 1:

      2) Paragraph (407-416) describes the analysis of duplicated genes under relaxed selection but there is no mention of this in the results.

      Authors’ response: Following this suggestion, in the Results section, we have added a sentence, “We also found that most of them were members of different gene families generated by gene duplication (Table S13)” on line 310-311 in the revised manuscript (Rapid_evolution_of_malebiased_genes_Trichosanthes_pilosa_Tracked_change_2023_11_06.docx).

      For the question of reviewer 1:

      38- line 417-424. The discussion should not contain new results.

      Authors’ response: Thank you for pointing out this. In the Results section, we have added a few sentences as following: “Similarly, given that dN/dS values of sex-biased genes were higher due to codon usage bias, lower dS rates would be expected in sex-biased genes relative to unbiased genes (Ellegren & Parsch, 2007; Parvathy et al., 2022). However, in our results, the median of dS values in male-biased genes were much higher than those in female-biased and unbiased genes in the results of ‘free-ratio’ (Fig. S4A, female-biased versus male-biased genes, P = 6.444e-12 and malebiased versus unbiased genes, P = 4.564e-13) and ‘two-ratio’ branch model (Fig. S4B, femalebiased versus male-biased genes, P = 2.2e-16 and male-biased versus unbiased genes, P = 9.421e08, respectively). ” on line 323-331, and consequently, removed the following sentence, “femalebiased vs male-biased genes, P = 6.444e-12 and male-biased vs unbiased genes, P = 4.564e-13” and “female-biased versus male-biased genes, P = 2.2e-16 and male-biased versus unbiased genes, P = 9.421e-08, respectively” in the Discussion section.

      • Reviewer 1 asked for a comparison between the number of de novo assembled unigenes in this transcriptome and the number of genes in other Cucurbitaceae species. I could not see this comparison reported.

      Authors’ response: In the first revision, we described only percentages. We have now added the number of genes. We modify this part as follows: “The majority of unigenes were annotated by homologs in species of Cucurbitaceae (61.6%, 36,375), including Momordica charantia (16.3%, 9,625), Cucumis melo (11.9%, 7,027), Cucurbita pepo (11.9%, 7,027), Cucurbita moschata (11.5%, 6,791), Cucurbita maxima (10.1%, 5,964) and other species (38.4%, 22,676) (Fig. S1C).”.

      • Reviewer 1 pointed out that permutation tests were more appropriate, but no change was made to the manuscript.

      Authors’ response: Thank you for your suggestion. In the first revision, we have indirectly responded to the issues. Wilcoxon rank sum test is more commonly used for all comparisons between sex-biased and unbiased genes in many papers. Additionally, we tested datasets using permutation t-tests, which is consistent with the results of Wilcoxon rank sum test. For example, we found that only in floral buds, there are significant differences in ω values in the results of ‘free-ratio’ (female-biased versus male-biased genes, P = 0.04282 and male-biased versus unbiased genes, P = 0.01114) and ‘two-ratio’ model (female-biased versus male-biased genes, P = 0.01992 and male-biased versus unbiased genes, P = 0.02127, respectively). We also described these results in the Results section accordingly (line 278-284).

      • Reviewer 3 pointed out the small sample size (both for the RNA-seq and the phylogenetic analysis), but again this limitation is not acknowledged very clearly.

      Authors’ response: Sorry, we acknowledged that our sample size was relatively small. In the revised version, we have added a sentence as follows, “Additionally, our sample size is relatively small, and may provide low power to detect differential expression.” in the Discussion section.

      • Reviewer 1 & 3 pointed out that Fig 3 was hard to understand and asked for clarifications that I did not see in the text and the figure in unchanged.

      Authors’ response: Thank you for your suggestions. We have revised the manuscript to clarify the meaning of the acronym (F1TGs, F2TGs, M1TGs, M2TGs, F1BGs, F2BGs, M1BGs and M2BGs) and presented the number of genes. We have added two labels, indicating that panels A and B correspond to males and C and D to females in Fig. 3.

      • Reviewer 3 suggested to combine all genes with sex-bias expression when evaluating the evolutionary rate, in addition to the analyses already done. This suggestion was not followed.

      For the question of reviewer 3:line 196 and following: In these analyses, I could not understand the rationale for keeping buds vs mature flowers as separate analyses throughout. Why not combine both and use the full set of genes showing sex-bias in any tissue? This would increase the power and make the presentation of the results a lot more straightforward.

      Authors’ response: Thank you for your suggestions. In the first revision, we tried to respond to the issues. First, we observed strong sexual dimorphism in floral buds, such as racemose versus solitary, early-flowering versus late-flowering. Second, as you pointed out earlier, “the dataset is interesting in that it focuses on reproductive rather than somatic tissues (which makes sense to investigate sexual selection), and includes more than a single developmental stage (buds + mature flowers)”, we totally agree with you on this point. Third, according to your suggestions, we combined all genes with sex-bias expression to evaluate the evolutionary rates. We found significant differences (please see a Figure below) in ω values in the results of ‘free-ratio’ (female-biased versus male-biased genes, P =0.005622 and male-biased versus unbiased genes, P = 0.001961) and ‘two-ratio’ model (female-biased versus male-biased genes, P = 0.008546 and male-biased versus unbiased genes, P = 0.009831, respectively) using Wilcoxon rank sum test. However, the significance is lower than previous results in floral buds due to sex-biased genes of mature flower joined, especially compared to the results of “free-ratio model”. Additionally, we also test all combined genes with sex-bias expression using permutation t-test. Unfortunately, there are no significant differences in ω values expect for male-biased versus unbiased genes in the results of ‘free-ratio’ model (P = 0.03034) and ‘two-ratio’ model (P = 0.0376), respectively. To a certain extent, the combination of all genes with sex-bias expression may cover the signals of rapid evolution of sex-biased genes in floral buds. Therefore, these results are not described in our manuscript. In the near future, we would like to make further investigations through more development stages of flowers and new technologies (e.g. Single-Cell method, See Murat et al., 2023) in each sex to consolidate the conclusion, and it is hoped that we could find more meaningful results.

      Author response image 1.

      • Reviewer 3 pointed out that hand-picking specific categories of genes was not statistically valid, and in fact not necessary in the present context. This was not changed.

      For the question of reviewer3: removing genes on a post-hoc basis seems statistically suspicious to me. I don't think your analysis has enough power to hand-pick specific categories of genes, and it is not clear what this brings here. I suggest simply removing these analyses and paragraphs.

      Authors’ response: Thank you for your suggestions. We have changed them accordingly. We removed a part of the following paragraph, “To confirm the contributions of positive selection and relaxed selection to rapid rates of male-biased genes in floral buds, we generated three datasets of OGs by excluding different sets of genes. Specifically, we excluded 18 relaxed selective male-biased genes (5.23%), 98 positively selected male-biased genes (28.57%), and 112 male-biased genes (32.65%) under positive and relaxed selection from 343 OGs (Fig. S4). We observed that after excluding male-biased genes under relaxed purifying selection, the median (0.264) decreased by 0.34% compared to the median (0.265) of all OGs (Fig. S4A-B). However, after excluding positively selected male-biased genes, the median (0.236) was reduced by 11% (Fig. S4A, C) in the results of ‘free-ratio’ branch model. This pattern was consistent with the results of ‘two-ratio’ branch model as well (Fig. S4E-G).” on line 290 to 300.

      However, we kept the following paragraph, “We also analyzed female-biased and unbiased genes that underwent positive and relaxed selection in floral buds (Tables S6-S10). We identified 216 (18.86%) positively selected, and 69 (6.03%) relaxed selective female-biased genes from 1,145 OGs, respectively. Similarly, we found 436 (18.33%) positively selected, and 43 (1.81%) unbiased genes under relaxed selection from 2,378 OGs, respectively. Notably, male-biased genes have a higher proportion (10%) of positively selected genes compared to female-biased and unbiased genes. However, relaxed selective male-biased genes have a higher proportion (3.24%) than unbiased genes, but about 0.8% lower than that of female-biased genes.”. In this way, we can compare the proportion of sex-biased genes that have undergone positive selection and release selection among female-biased genes, unbiased genes and male-biased genes in floral buds in the Discussion section.

      • Reviewer 1 asked for all data to be public, but I could not find in the manuscript where the link to the data on ResearchGate was provided.

      Authors’ response: We have added a link in the Data Availability section.

      • Reviewers 1 & 3 pointed out that since only two tissues were compared, the claims on pleiotropy should have been toned down, but no change was made to the text.

      Authors’ response: Thank you for your suggestions. We revised “due to low pleiotropic constraints” to “due to low evolutionary constraints” and revised “low pleiotropy” to “low constraints”.

      • Reviewer 1 asked for a clarification on which genes are plotted on the heatmap of Fig3C and an explanation of the color scale. No change was made.

      Authors’ response: Sorry for the confusion. Actually, Reviewer 1 asked that “Fig. 2C, which genes are plotted on the heatmap and what is the color scale corresponding to?” In the previous revision, we have revised them (See Fig. 2 Sex-biased gene expression for floral buds and flowers at anthesis in males and females of Trichosanthes pilosa). Sex-biased genes (the union of sex-biased genes in F1, M1, F2 and M2) are plotted on the heatmap. The color gradient represents from high to low (from red to green) gene expression.

      • Reviewer 1 asked for panel B in Fig S5 and S6 to be removed. They are still there. They asked for abbreviations to be explained in the legend of Fig S8. This was not done. They asked for details about columns headers. Such detailed were not added. They asked for more recent references on line 53-56: this was not done.

      Authors’ response: We have removed panel B in Fig. S5 and S6. We explained abbreviations in text and Fig. S8. We added more details about the column headers in Supplementary Table S4, S5, S6, S7, S8, S9 and S10. We also added more recent references on line 53-56.

      Recommendations for the authors:

      Reviewer #3 (Recommendations For The Authors):

      Authors’ response: Thank you for your suggestions. We have revised/fixed these issues following your concerns and suggestions.

      Line 46-48 would be clearer as « Sexual dimorphism is the condition where sexes of the same species exhibit different morphological, ecological and physiological traits in gonochoristic animals and dioecious plants, despite male and female individuals sharing the same genome except for sex chromosomes or sex-determining loci »

      Authors’ response: Thanks. We have revised it accordingly.

      Line 50: replace «in both » by «between the two »

      Authors’ response: We have revised it.

      Line 51: « genes exclusively » -> « genes expressed exclusively »

      Authors’ response: We have revised it.

      Line 58: « in many animals » -> « in several animal species »

      Authors’ response: We have revised it to “in some animal species”.

      Line 58: « to which » -> « of this bias »

      Authors’ response: We have revised it.

      Line 64: « Most dioecious plants possess homomorphic sex-chromosomes that are roughly similar in size when viewed by light microscopy. » : a reference is missing

      Authors’ response: We have added the reference.

      Line 67: remove « that »

      Authors’ response: We have revised it.

      line 96: change to: « only the five above-mentioned studies »

      Authors’ response: We have revised it.

      Line 97: remove « the »

      Authors’ response: We have revised it.

      Line 111: « Drosophia » -> Drosophila

      Authors’ response: We have revised it.

      Line 114: exhibiting -> « exhibited »

      Authors’ response: We have revised it.

      Line 115: suggest -> « suggesting »

      Authors’ response: We have revised it.

      Line 117: « studies in plants have rarely reported elevated rates of sex-biased genes » : is it « rarely » or « never » ?

      Authors’ response: We have revised to “never”.

      Line 143: « It’s » -> « Its »

      Authors’ response: We have revised it.

      Line 143-146: say whether the male parts (e.g. anthers) are still present in females flowers, and the female parts (pistil+ ovaries) in the male flowers, or whether these respective organs are fully aborted.

      Authors’ response: We have added the following sentence, “The male parts (e. g., anthers) of female flowers, and the female parts (e. g., pistil and ovaries) of male flowers are fully aborted” in line 148150 of the Introduction section.

      Line 158: this is now clearer, but please specify whether you are talking about 12 floral buds in total, or 12 per individual (i.e. 72 buds in total).

      Authors’ response: We have revised it to “Using whole transcriptome shotgun sequencing, we sequenced floral buds and flowers at anthesis from female and male of dioecious T. pilosa. We set up three biological replicates from three female and three male plants, including 12 samples in total (six floral buds and six flowers at anthesis)”.

      Line 194-198: These sentences are unclear and hard to link to the figure. Consider changing for « In male plants, the number of tissue-biased genes in flowers at anthesis (M2TGs: n = 2795) was higher than that in floral buds (M1TGs: n = 1755, Fig. 3A and 3B). Figure 3 is also very hard to read. Adding a label on the side to indicate that panels A and B correspond to male-biased genes and C and D to female-biased genes could be useful.

      Authors’ response: Thank you for your suggestions. We have revised the text to clarify the meaning of the acronym (F1TGs, F2TGs, M1TGs, M2TGs, F1BGs, F2BGs, M1BGs and M2BGs) and presented the number of genes. We have added two labels, indicating that panels A and B correspond to males and C and D to females in Figure 3.

      Line 208: explain the approach: e.g. « We then compared rates of protein evolution among malebiased, female-biased and unbiased genes. To do this, we sequenced floral bud transcriptomes from the closely related T. anguina, as well as two more distant outgroups, T. kirilowii and Luffa cylindrica. T. kirilowii is a dioecious species like T. pilosa, and the other two are monoecious. We identified one-to-one orthologous groups (OGs) for 1,145 female-biased, 343 male-biased, and 2,378 unbiased genes. »

      Authors’ response: We have revised this paragraph to the following, “We compared rates of protein evolution among male-biased, female-biased and unbiased genes in four species with phylogenetic relationships (((T. anguina, T. pilosa), T. kirilowii), Luffa cylindrica), including dioecious T. pilosa, dioecious T. kirilowii, monoecious T. anguina in Trichosanthes, together with monoecious Luffa cylindrica. To do this, we sequenced transcriptomes of T. pilosa. We also collected transcriptomes of T. kirilowii, as well as genomes of T. anguina and Luffa cylindrica.”

      Line 220: « the same ω value was in all branches » -> « all branches are constrained to have the same ω value ».

      Authors’ response: We have revised it.

      Line 221: « results of the 'two-ratio' branch model ... »

      Authors’ response: We have revised it.

      Line 235: add a few words to explain why the effect size is bigger than for buds, but still is not significant: e.g. «possibly because of limited statistical power due to the low number of sex-biased genes in flowers at anthesis »

      Authors’ response: We have revised this to “However, there is no statistically significant difference in the distribution of ω values using Wilcoxon rank sum tests for female-biased versus male-biased genes (P = 0.0556), female-biased versus unbiased genes (P = 0.0796), and male-biased versus unbiased genes (P = 0.3296) possibly because of limited statistical power due to the low number of sex-biased genes in flowers at anthesis.” in line 260-261.

      Line 255: explain in plain English what the « A model » is. This was already requested in the previous version.

      Authors’ response: We have revised “A model” to “classical branch-site model A”.

      Line 258: explain in plain English what the « foreground 2b ω value » corresponds to

      Authors’ response: We have revised to as follows, “foreground 2b ω value” to “foreground ω >1”. Additionally, we also added the sentence “The classical branch-site model assumes four site classes (0, 1, 2a, 2b), with different ω values for the foreground and background branches. In site classes 2a and 2b, the foreground branch undergoes positive selection when there is ω > 1.” in line 624-627.

      Line 259: explain how these different approaches complement each other rather than being redundant. This was also already requested in the previous version.

      Authors’ response: Sorry. We have now revised it as follows, “As a complementary approach, we utilized the aBSREL and BUSTED methods that are implemented in HyPhy v.2.5 software, which avoids false positive results by classical branch-site models due to the presence of rate variation in background branches, and detected significant evidence of positive selection.” in line 292-295.

      Line 270: remove « dramatically », and also remove « or eliminated at both gene-wide and genomewide levels », as well as « relative to positive selection »

      Authors’ response: Thank you for your suggestions. We have revised it.

      Line 290-309: remove this section - this was already pointed out in the previous reviews as a « ad hoc » procedure, and this point has already been made clear with the RELAX analysis.

      Authors’ response: Thank you for your suggestions. We revised this section accordingly. We remove the following paragraph, “To confirm the contributions of positive selection and relaxed selection to rapid rates of male-biased genes in floral buds, we generated three datasets of OGs by excluding different sets of genes. Specifically, we excluded 18 relaxed selective male-biased genes (5.23%), 98 positively selected male-biased genes (28.57%), and 112 male-biased genes (32.65%) under positive and relaxed selection from 343 OGs (Fig. S4). We observed that after excluding malebiased genes under relaxed purifying selection, the median (0.264) decreased by 0.34% compared to the median (0.265) of all OGs (Fig. S4A-B). However, after excluding positively selected malebiased genes, the median (0.236) was reduced by 11% (Fig. S4A, C) in the results of ‘free-ratio’ branch model. This pattern was consistent with the results of ‘two-ratio’ branch model as well (Fig. S4E-G).” on line 334-344.

      However, we kept the other parts “We also analyzed female-biased and unbiased genes that underwent positive and relaxed selection in floral buds (Tables S6-S10). We identified 216 (18.86%) positively selected, and 69 (6.03%) relaxed selective female-biased genes from 1,145 OGs, respectively. Similarly, we found 436 (18.33%) positively selected, and 43 (1.81%) unbiased genes under relaxed selection from 2,378 OGs, respectively. Notably, male-biased genes have a higher proportion (10%) of positively selected genes compared to female-biased and unbiased genes. However, relaxed selective male-biased genes have a higher proportion (3.24%) than unbiased genes, but about 0.8% lower than that of female-biased genes.”. In this way, we can compare the proportion of sex-biased genes that have undergone positive selection and release selection among female-biased genes, unbiased genes and male-biased genes in floral buds in the Discussion sections.

      Line 348: Here you talk about « Numerous studies », but then only report three studies. Please clarify.

      Authors’ response: Thank you for your suggestions. We have revised it to “Several studies”.

      Line 352: Cut the sentence: « In contrast, the wind-pollinated dioecious plant Populus balsamifera ... »

      Authors’ response: Thank you for your suggestions. We have revised it.

      Line 357: « In contrast to the above studies... »: If I understand correctly, this is not in contrast to the observation in Populus balsamifera. Please clarify.

      Authors’ response: Thank you for your suggestions. We have revised to “Similar to the above study of Populus balsamifera.”.

      Line 420: « our results » -> « we »; « that underwent » -> « undergoing »

      Authors’ response: Thank you for your suggestions. We have revised it.

      Figure 3 is very hard to read and poorly labeled (see my comments on line 194 above). It is also hard to link to the text, since the numbers reported in the text are actually not present in the figure unless the readers makes some calculations themselves. This should be improved. Also, the use of acronyms (e.g. M1BG, F2TG etc.) contributes to making the text very difficult to read. The acronyms should at least be explained very clearly in the text when they are used.

      Authors’ response: Thank you for your suggestions. We have revised the text to clarify the meaning of the acronym (F1TGs, F2TGs, M1TGs, M2TGs, F1BGs, F2BGs, M1BGs and M2BGs) and give the number of genes. We have added two labels, indicating that panels A and B correspond to males and C and D to females in Figure 3.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Review:

      Reviewer #2 (Public Review): 

      Regarding reviewer #2 public review, we update here our answers to this public review with new analysis and modification done in the manuscript. 

      This manuscript is missing a direct phenotypic comparison of control cells to complement that of cells expressing RhoGEF2-DHPH at "low levels" (the cells that would respond to optogenetic stimulation by retracting); and cells expressing RhoGEF2-DHPH at "high levels" (the cells that would respond to optogenetic stimulation by protruding). In other words, the authors should examine cell area, the distribution of actin and myosin, etc in all three groups of cells (akin to the time zero data from figures 3 and 5, with a negative control). For example, does the basal expression meaningfully affect the PRG low-expressing cells before activation e.g. ectopic stress fibers? This need not be an optogenetic experiment, the authors could express RhoGEF2DHPH without SspB (as in Fig 4G). 

      Updated answer: We thank reviewer #2 for this suggestion. PRG-DHPH overexpression is known to affect the phenotype of the cell as shown in Valon et al., 2017. In our experiments, we could not identify any evidence of a particular phenotype before optogenetic activation apart from the area and spontaneous membrane speed that were already reported in our manuscript (Fig 2E and SuppFig 2). Regarding the distribution of actin and myosin, we did not observe an obvious pattern that will be predictive of the protruding/retracting phenotype. Trying to be more quantitative, we have classified (by eye, without knowing the expression level of PRG nor the future phenotype) the presence of stress fibers, the amount of cortical actin, the strength of focal adhesions, and the circularity of cells. As shown below, when these classes are binned by levels of expression of PRG (two levels below the threshold and two above) there is no clear determinant. Thus, we concluded that the main driver of the phenotype was the PRG basal expression rather than any particularity of the actin cytoskeleton/cell shape.

      Author response image 1.

      Author response image 2.

      Relatedly, the authors seem to assume ("recruitment of the same DH-PH domain of PRG at the membrane, in the same cell line, which means in the same biochemical environment." supplement) that the only difference between the high and low expressors are the level of expression. Given the chronic overexpression and the fact that the capacity for this phenotypic shift is not recruitmentdependent, this is not necessarily a safe assumption. The expression of this GEF could well induce e.g. gene expression changes. 

      Updated answer: We agree with reviewer #2 that there could be changes in gene expression. In the next point of this supplementary note, we had specified it, by saying « that overexpression has an influence on cell state, defined as protein basal activity or concentration before activation. »  We are sorry if it was not clear, and we changed this sentence in the revised manuscript (in red in the supp note). 

      One of the interests of the model is that it does not require any change in absolute concentrations, beside the GEF. The model is thought to be minimal and fits well and explains the data with very few parameters. We do not show that there is no change in concentration, but we show that it is not required to invoke it. We revised a sentence in the new version of the manuscript to include this point.

      Additional answer: During the revision process, we have been looking for an experimental demonstration of the independence of the phenotypic switch to any change in global gene expression pattern due to the chronic overexpression of PRG. Our idea was to be in a condition of high PRG overexpression such that cells protrude upon optogenetic activation, and then acutely deplete PRG to see if cells where then retracting. To deplete PRG in a timescale that prevent any change of gene expression, we considered the recently developed CATCHFIRE (PMID: 37640938) chemical dimerizer. We designed an experiment in which the PRG DH-PH domain was expressed in fusion with a FIRE-tag and co-expressing the FIRE-mate fused to TOM20 together with the optoPRG tool. Upon incubation with the MATCH small molecule, we should be able to recruit the overexpressed PRG to the mitochondria within minutes, hereby preventing it to form a complex with active RhoA in the vicinity of the plasma membrane. Unfortunately, despite of numerous trials we never achieved the required conditions: we could not have cells with high enough expression of PRGFIRE-tag (for protrusive response) and low enough expression of optoPRG (for retraction upon PRGFIRE-tag depletion). We still think this would be a nice experiment to perform, but it will require the establishment of a stable cell line with finely tuned expression levels of the CATCHFIRE system that goes beyond the timeline of our present work.      

      Concerning the overall model summarizing the authors' observations, they "hypothesized that the activity of RhoA was in competition with the activity of Cdc42"; "At low concentration of the GEF, both RhoA and Cdc42 are activated by optogenetic recruitment of optoPRG, but RhoA takes over. At high GEF concentration, recruitment of optoPRG lead to both activation of Cdc42 and inhibition of already present activated RhoA, which pushes the balance towards Cdc42."

      These descriptions are not precise. What is the nature of the competition between RhoA and Cdc42? Is this competition for activation by the GEFs? Is it a competition between the phenotypic output resulting from the effectors of the GEFs? Is it competition from the optogenetic probe and Rho effectors and the Rho biosensors? In all likelihood, all of these effects are involved, but the authors should more precisely explain the underlying nature of this phenotypic switch. Some of these points are clarified in the supplement, but should also be explicit in the main text. 

      Updated answer: We consider the competition between RhoA and Cdc42 as a competition between retraction due to the protein network triggered by RhoA (through ROCK-Myosin and mDia-bundled actin) and the protrusion triggered by Cdc42 (through PAK-Rac-ARP2/3-branched Actin). We made this point explicit in the main text.  

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):  

      Major 

      - why this is only possible for such few cells. Can the authors comment on this in the discussion? Does the model provide any hints? 

      As said in our answer to the public comment or reviewer #1, we think that the low number of cells being able to switch can be explained by two different reasons: 

      (1) First, we were looking for clear inversions of the phenotype, where we could see clear ruffles in the case of the protrusion, and clear retractions in the other case. Thus, we discarded cells that would show in-between phenotypes, because we had no quantitative parameter to compare how protrusive or retractile they were. This reduced the number of switching cells 

      (2) Second, we had a limitation due to the dynamic of the optogenetic dimer used here. Indeed, the control of the frequency was limited by the dynamic of unbinding of the optogenetic dimer. This dynamic of recruitment (~20s) is comparable to the dynamics of the deactivation of RhoA and Cdc42. Thus, the differences in frequency are smoothed and we could not vary enough the frequency to increase the number of switches. Thanks to the model, we can predict that increasing the unbinding rate of the optogenetic tool (shorter dimer lifetime) should allow us to increase the number of switching cells. 

      We have added a sentence in the discussion to make this second point explicit.

      - I would encourage the authors to discuss this molecular signaling switch in the context of general design principles of switches. How generalizable is this network/mechanism? Is it exclusive to activating signaling proteins or would it work with inhibiting mechanisms? Is the competition for the same binding site between activators and effectors a common mechanism in other switches? 

      The most common design principle for molecular switches is the bistable switch that relies on a nonlinear activation (for example through cooperativity) with a linear deactivation. Such a design allows the switch between low and high levels. In our case, there is no need for a non-linearity since the core mechanism is a competition for the same binding site on active RhoA of the activator and the effectors. Thus, the design principle would be closer to the notion of a minimal “paradoxical component” (PMID: 23352242) that both activate and limit signal propagation, which in our case can be thought as a self-limiting mechanism to prevent uncontrolled RhoA activation by the positive feedback. Yet, as we show in our work, this core mechanism is not enough for the phenotypic switch to happen since the dual activation of RhoA and Cdc42 is ultimately required for the protrusion phenotype to take over the retracting one. Given the particularity of the switch we observed here, we do not feel comfortable to speculate on any general design principles in the main text, but we thank reviewer #1 for his/her suggestion.

      - Supplementary figures - there is a discrepancy between the figures called in the text and the supplementary files, which only include SF1-4. 

      We apologize for this error and we made the correction. 

      - In the text, the authors use Supp Figure 7 to show that the phenotype could not be switched by varying the fold increase of recruitment through changing the intensity/duration of the light pulse. Aside from providing the figure, could you give an explanation or speculation of why? Does the model give any prediction as to why this could be difficult to achieve experimentally (is the range of experimentally feasible fold change of 1.1-3 too small? Also, could you clarify why the range is different than the 3 to 10-fold mentioned at the beginning of the results section? 

      We thank the reviewer for this question, and this difference between frequency and intensity can be indeed understood in a simple manner through the model. 

      All the reactions in our model were modeled as linear reactions. Thus, at any timepoint, changing the intensity of the pulse will only change proportionally the amount of the different components (amount of active RhoA, amount of sequestered RhoA, and amount of active Cdc42). This explains why we cannot change the balance between RhoA activity and Cdc42 activity only through the pulse strength. We observed the same experimentally: when we changed the intensity of the pulses, the phenotype would be smaller/stronger, but would never switch, supporting our hypothesis on the linearity of all biochemical reactions. 

      On the contrary, changing the frequency has an effect, for a simple reason: the dynamics of RhoA and Cdc42 activation are not the same as the dynamics of inhibition of RhoA by the PH domain (see

      Figure 4). The inhibition of RhoA by the PH is almost instantaneous while the activation of RhoGTPases has a delay (sets by the deactivation parameter k_2). Intuitively, increasing the frequency will lead to sustained inhibition of RhoA, promoting the protrusion phenotype. Decreasing the frequency – with a stronger pulse to keep the same amount of recruited PRG – restricts this inhibition of RhoA to the first seconds following the activation. The delayed activation of RhoA will then take over. 

      We added two sentences in the manuscript to explain in greater details the difference between intensity and frequency.  

      Regarding the difference between the 1.3-3 fold and the 3 to 10 fold, the explanation is the following: the 3 to 10 fold referred to the cumulative amount of proteins being recruited after multiple activations (steady state amount reached after 5 minutes with one activation every 30s); while the 1.3-3 fold is what can be obtained after only one single pulse of activation.  

      - The transient expression achieves a large range of concentration levels which is a strength in this case. To solve the experimental difficulties associated with this, i.e. finding transfected cells at low cell density, the authors developed a software solution (Cell finder). Since this approach will be of interest for a wide range of applications, I think it would deserve a mention in the discussion part. 

      We thank the reviewer for his/her interest in this small software solution.

      We developed the description of the tool in the Method section. The Cell finder is also available with comments on github (https://github.com/jdeseze/cellfinder) and usable for anyone using Metamorph or Micromanager imaging software. 

      Minor 

      - Can the authors describe what they mean with "cell state"? It is used multiple times in the manuscript and can be interpreted as various things. 

      We now explain what we mean by ‘cell state’ in the main text :

      “protein basal activities and/or concentrations - which we called the cell state”

      - “(from 0% to 45%, Figure 2D)", maybe add here: "compare also with Fig. 2A". 

      We completed the sentence as suggested, which clarifies the data for the readers.

      - The sentence "Given that the phenotype switch appeared to be controlled by the amount of overexpressed optoPRG, we hypothesized that the corresponding leakiness of activity could influence the cell state prior to any activation." might be hard to understand for readers unfamiliar with optogenetic systems. I suggest adding a short sentence explaining dark-state activity/leakiness before putting the hypothesis forward. 

      We changed this whole beginning of the paragraph to clarify.

      - Figure 2E and SF2A. I would suggest swapping these two panels as the quantification of the membrane displacement before activation seems more relevant in this context. 

      We thank reviewer #1 for this suggestion and we agree with it (we swapped the two panels)

      - Fig. 2B is missing the white frames in the mixed panels. 

      We are sorry for this mistake, we changed it in the new version.  

      - In the text describing the experiment of Fig. 4G, it would again be helpful to define what the authors mean by cell state, or to state the expected outcome for both hypotheses before revealing the result.

      We added precisions above on what we meant by cell state, which is the basal protein activities and/or concentrations prior to optogenetic activation. We added the expectation as follow: 

      To discriminate between these two hypotheses, we overexpressed the DH-PH domain alone in another fluorescent channel (iRFP) and recruited the mutated PH at the membrane. “If the binding to RhoA-GTP was only required to change the cell state, we would expect the same statistics than in Figure 2D, with a majority of protruding cells due to DH-PH overexpression. On the contrary, we observed a large majority of retracting phenotype even in highly expressing cells (Figure 4G), showing that the PH binding to RhoA-GTP during recruitment is a key component of the protruding phenotype.”

      - Figure 4H,I: "of cells that overexpress PRG, where we only recruit the PH domain" doesn't match with the figure caption. Are these two constructs in the same cell? If not please clarify the main text. 

      We agree that it was not clear. Both constructs are in the same cell, and we changed the figure caption accordingly.  

      - "since RhoA dominates Cdc42" is this concluded from experiments (if yes, please refer to the figure) or is this known from the literature (if yes, please cite). 

      The assumption that RhoA dominates Cdc42 comes from the fact that we see retraction at low PRG concentration. We assumed that RhoA is responsible for the retraction phenotype. Our assumption is based on the literature (Burridge 2004 as an example of a review, confirmed by many experiments, such as the direct recruitment of RhoA to the membrane, see Berlew 2021) and is supported by our observations of immediate increase of RhoA activity at low PRG. We modified the text to clarify it is an assumption.

      - Fig. 6G  o left: is not intuitive, why are the number of molecules different to start with? 

      The number of molecules is different because they represent the active molecules: increasing the amount of PRG increases the amount of active RhoA and active Cdc42. We updated the figure to clarify this point.

      o right: the y-axis label says "phenotype", maybe change it to "activity" or add a second y-axis on the right with "phenotype"? 

      We updated the figure following reviewer #1 suggestion.

      - Discussion: "or a retraction in the same region" sounds like in the same cell. Perhaps rephrase to state retraction in a similar region? 

      Sorry for the confusion, we change it to be really clear: “a protrusion in the activation region when highly expressed, or a retraction in the activation region when expressed at low concentrations.”

      Typos: 

      - "between 3 and 10 fold" without s. 

      - Fig. 1H, y-axis label. 

      - "whose spectrum overlaps" with s. 

      - "it first decays, and then rises" with s. 

      - Fig 4B and Fig 6B. Is the time in sec or min? (Maybe double-check all figures). 

      - "This result suggests that one could switch the phenotype in a single cell by selecting it for an intermediate expression level of the optoPRG.". 

      - "GEF-H1 PH domain has almost the same inhibition ability as PRG PH domain". 

      We corrected all these mistakes and thank the reviewer for his careful reading of the manuscript.

      Reviewer #2 (Recommendations For The Authors): 

      Likewise, the model assumes that at high PRG GEF expression, the "reaction is happening far from saturation ..." and that "GTPases activated with strong stimuli -giving rise to strong phenotypic changes- lead to only 5% of the proteins in a GTP-state, both for RhoA and Cdc42". Given the high levels of expression (the absolute value of which is not known) this assumption is not necessarily safe to assume. The shift to Cdc42 could indeed result from the quantitative conversion of RhoA into its active state. 

      We agree with the reviewer that the hypothesis that RhoA is fully converted into its active state cannot be completely ruled out. However, we think that the two following points can justify our choice.

      - First, we see that even in the protruding phenotype, RhoA activity is increasing upon optoPRG recruitment (Figure 3). This means that RhoA is not completely turned into its active GTP-loaded state. The biosensor intensity is rising by a factor 1.5 after 5 minutes (and continue to increase, even if not shown here). For sure, it could be explained by the relocation of RhoA to the place of activation, but it still shows that cells with high PRG expression are not completely saturated in RhoA-GTP. 

      - We agree that linearity (no saturation) is still an hypothesis and very difficult to rule out, because it is not only a question of absolute concentrations of GEFs and RhoA, but also a question of their reaction kinetics, which are unknow parameters in vivo. Yet, adding a saturation parameter would mean adding 3 unknown parameters (absolute concentrations of RhoA, as well as two reaction constants). The fact that there are not needed to fit the complex curves of RhoA as we do with only one parameter tends to show that the minimal ingredients representing the interaction are captured here.  

      The observed "inhibition of RhoA by the PH domain of the GEF at high concentrations" could result from the ability of the probe to, upon membrane recruitment, bind to active RhoA (via its PH domain) thereby outcompeting the RhoA biosensor (Figure 4A-C). This reaction is explicitly stated in the supplemental materials ("PH domain binding to RhoA-GTP is required for protruding phenotype but not sufficient, and it is acting as an inhibitor of RhoA activity."), but should be more explicit in the main text. Indeed, even when PRG DHPH is expressed at high concentrations, it does activate RhoA upon recruitment (figure 3GH). Not only might overexpression of this active RhoA-binding probe inhibit the cortical recruitment of the RhoA biosensor, but it may also inhibit the ability of active RhoA to activate its downstream effectors, such as ROCK, which could explain the decrease in myosin accumulation (figure 3D-F). It is not clear that there is a way to clearly rule this out, but it may impact the interpretation. 

      This hypothesis is actually what we claim in the manuscript. We think that the inhibition of RhoA by the PH domain is explained by its direct binding. We may have missed what Reviewer #2 wanted to say, but we think that we state it explicitly in the main text :

      “Knowing that the PH domain of PRG triggers a positive feedback loop thanks to its binding to active RhoA 18, we hypothesized that this binding could sequester active RhoA at high optoPRG levels, thus being responsible for its inhibition.”

      And also in the Discussion:

      “However, this feedback loop can turn into a negative one for high levels of GEF: the direct interaction between the PH domain and RhoA-GTP prevents RhoA-GTP binding to effectors through a competition for the same binding site.”

      We may have not been clear, but we think that this is what is happening: the PH domain prevents the binding to effectors and decreases RhoA activity (as was shown in Chen et al. 2010).  

      The X-axis in Figure 4C time is in seconds not minutes. The Y-axis in Figure 4H is unlabeled. 

      We are sorry for the mistake of Figure 4C. We changed the Y-axis in the Figure 4h.  

      Although this publication cites some of the relevant prior literature, it fails to cite some particularly relevant works. For example, the authors state, "The LARG DH domain was already used with the iLid system" and refers to a 2018 paper (ref 19), whereas that domain was first used in 2016 (PMID 27298323). Indeed, the authors used the plasmid from this 2016 paper to build their construct. 

      We thank the reviewer for pointing out this error, we have corrected the citation and put the seminal one in the revised version.

      An analogous situation pertains to previous work that showed that an optogenetic probe containing the DH and PH domains in RhoGEF2 is somewhat toxic in vivo (table 6; PMID 33200987). Furthermore, it has previously been shown that mutation of the equivalent of F1044A and I1046E eliminates this toxicity (table 6; PMID 33200987) in vivo. This is particularly important because the Rho probe expressing RhoGEF2-DHPH is in widespread usage (76 citations in PubMed). The ability of this probe to activate Cdc42 may explain some of the phenotypic differences described resulting from the recruitment of RhoGEF2-DHPH and LARG-DH in a developmental context (PMID 29915285, 33200987). 

      We thank reviewer #2 for these comments, and added a small section in the discussion, for optogenetic users: 

      This underlines the attention that needs to be paid to the choice of specific GEF domains when using optogenetic tools. Tools using DH-PH domains of PRG have been widely used, both in mammalian cells and in Drosophila (with the orthologous gene RhoGEF2), and have been shown to be toxic in some contexts in vivo 28. Our study confirms the complex behavior of this domain which cannot be reduced to a simple RhoA activator.   

      Concerning the experiment shown in 4D, it would be informative to repeat this experiment in which a non-recruitable DH-PH domain of PRG is overexpressed at high levels and the DH domain of LARG is recruited. This would enable the authors to distinguish whether the protrusion response is entirely dependent on the cell state prior to activation or the combination of the cell state prior to activation and the ability of PRG DHPH to also activate Cdc42. 

      We thank the reviewer for his suggestion. Yet, we think that we have enough direct evidence that the protruding phenotype is due to both the cell state prior to activation and the ability of PRG DHPH to also activate Cdc42. First, we see a direct increase in Cdc42 activity following optoPRG recruitment (see Figure 6). This increase is sustained in the protruding phenotype and precedes Rac1 and RhoA activity, which shows that it is the first of these three GTPases to be activated. Moreover, we showed that inhibition of PAK by the very specific drug IPA3 is completely abolishing only the protruding phenotype, which shows that PAK, a direct effector of Cdc42 and Rac1, is required for the protruding phenotype to happen. We know also that the cell state prior to activation is defining the phenotype, thanks to the data presented in Figure 2. 

      We further showed in Figure 1 that LARG DH-PH domain was not able to promote protrusion. The proposed experiment would be interesting to confirm that LARG does not have the ability to activate another GTPase, even in a different cell state with overexpressed PRG. However, we are not sure it would bring any substantial findings to understand the mechanism we describe here, given the facts provided above.  

      Similarly, as PRG activates both Cdc42 and Rho at high levels, it would be important to determine the extent to which the acute Rho activation contributes to the observed phenotype (e.g. with Rho kinase inhibitor). 

      We agree with the reviewer that it would be interesting to know whether RhoA activation contributes to the observed phenotype, and we have tried such experiments. 

      For Rho kinase inhibitor, we tried with Y-27632 and we could never prevent the protruding phenotype to happen. However, we could not completely abolish the retracting phenotype either (even when the effect on the cells was quite strong and visible), which could be due to other effectors compensating for this inhibition. As RhoA has many other effectors, it does not tell us that RhoA is not required for protrusion. 

      We also tried with C3, which is a direct inhibitor of RhoA. However, it had too much impact on the basal state of the cells, making it impossible to recruit (cells were becoming round and clearly dying. As both the basal state and optogenetic activation require the activation of RhoA, it is hard to conclude out of experiments where no cell is responding. 

      The ability of PRG to activate Cdc42 in vivo is striking given the strong preference for RhoA over Cdc42 in vitro (2400X) (PMID 23255595). Is it possible that at these high expression levels, much of the RhoA in the cell is already activated, so that the sole effect that recruited PRG can induce is activation of Cdc42? This is related to the previous point pertaining to absolute expression levels.  

      As discussed before, we think that it is not only a question of absolute expression levels, but also of the affinities between the different partners. But Reviewer #2 is right, there is a competition between the activation of RhoA and Cdc42 by optoPRG, and activation of Cdc42 probably happens at higher concentration because of smaller effective affinity.

      Still, we know that activation of the Cdc42 by PRG DH-PH domain is possible in vivo, as it was very clearly shown in Castillo-Kauil et al., 2020 (PMID 33023908). They show that this activation requires the linker between DH and PH domain of PRG, as well as Gαs activation, which requires a change in PRG DH-PH conformation. This conformational switch does not happen in vitro, which might explain why the affinity against Cdc42 was found to be very low. 

      Minor points 

      In both the abstract and the introduction the authors state, "we show that a single protein can trigger either protrusion or retraction when recruited to the plasma membrane, polarizing the cell in two opposite directions." However, the cells do not polarize in opposite directions, ie the cells that retract do not protrude in the direction opposite the retraction (or at least that is not shown). Rather a single protein can trigger either protrusion or retraction when recruited to the plasma membrane, depending upon expression levels. 

      We thank the reviewer for this remark, and we agree that we had not shown any data supporting a change in polarization. We solved this issue, by showing now in Supplementary Figure 1 the change in areas in both the activated and in the not activated region. The data clearly show that when a protrusion is happening, the cell retracts in the non-activated region. On the other hand, when the cell retracts, a protrusion happens in the other part of the cell, while the total area is staying approximately constant. 

      We added the following sentence to describe our new figure:

      Quantification of the changes in membrane area in both the activated and non-activated part of the cell (Supp Figure 1B-C) reveals that the whole cell is moving, polarizing in one direction or the other upon optogenetic activation.

      While the authors provide extensive quantitative data in this manuscript and quantify the relative differences in expression levels that result in the different phenotypes, it would be helpful to quantify the absolute levels of expression of these GEFs relative to e.g. an endogenously expressed GEF. 

      We agree with the reviewer comment, and we also wanted to have an idea of the absolute level of expression of GEFs present in these cells to be able to relate fluorescent intensities with absolute concentrations. We tried different methods, especially with the purified fluorescent protein, but having exact numbers is a hard task.

      We ended up quantifying the amount of fluorescent protein within a stable cell line thanks to ELISA and comparing it with the mean fluorescence seen under the microscope. 

      We estimated that the switch concentration was around 200nM, which is 8 times more than the mean endogenous concentration according to https://opencell.czbiohub.org/, but should be reachable locally in wild type cell, or globally in mutated cancer cells. 

      Given the numerical data (mostly) in hand, it would be interesting to determine whether RhoGEF2 levels, cell area, the pattern of actin assembly, or some other property is most predictive of the response to PRG DHPH recruitment. 

      We think that the manuscript made it clear that the concentration of PRG DHPH is almost 100% predictive of the response to PRG DHPH. We believe that other phenotypes such as the cell area or the pattern of actin assembly would only be consequences of this. Interestingly, as experimentators we were absolutely not able to predict the behavior by only seeing the shape of the cell, event after hundreds of activation experiments, and we tried to find characteristics that would distinguish both populations with the data in our hands and could not find any.

      There is some room for general improvement/editing of the text. 

      We tried our best to improve the text, following reviewers suggestions.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewing Editor's comments:

      There appears to be several mistakes/missing details in the additional statistical analyses reported in their response to Reviewer #'1 comments:

      (1) Detecting differentially expressed genes (DEGs):

      Reviewer #1 suggested adding an interaction term between sex and environment (ethnicity) in identifying DEGs. The authors performed ANCOVA analysis with sex and ethnicity as covariates (but not the interaction) and found sex explained more variance. This is not what the reviewer asked for, and the results do not help identify DEGs.

      We understand the reviewer’s suggestion about identification of DEGs using sex × ethnicity interaction. However, we could not find an appropriate tool to make such analysis, though we have carefully searched it in the literature. It should be noted that the interaction analysis between sex and environment was only designed to study genotype data rather than gene expression data. Besides, considering that we have added multiple covariates in our DEG detection, adding an interaction term between sex and environment (ethnicity) in identifying DEGs make the formulation too complex to resolve using current tools. Alternatively, we have made a linear regression model to test the explanation of sex for DEG detection in the revision (see details below). We would appreciate if the reviewer could provide any available tools, or previous studies conducting interaction analysis for DEG identification.

      (2) Overlap between DEGs and genes under positive selection in Tibetans (TSNGs)

      The authors claimed that the overlaps are significantly enriched in "sex-combined" set (p=0.048) and "male-only" set (p=9e-4), but it seems that the authors calculated the p-values incorrectly. Based on the histogram shown in Fig 3R (left penal), at least 750 out of 10,000 permutations led to 4 genes in overlap and there are additional permutations with 5 or more genes in overlap, so the p-value for the sex-combined set cannot be 0.048. In addition, the permutation procedure is somewhat questionable: it is unclear whether randomly sampling 192 genes from the human genome is reasonable choice, without matching for relevant gene features.

      As we explained in the response to Reviewer-1, we agree with the reviewer’s point that random sampling of genes in permutation should be extracted from genes expressed in each tissue rather than the entire genome. Based on this updated random sampling procedure, we redid the analysis, and our previous conclusions remain unchanged.

      (3) Polygenic adaptation signal based on eQTL information:

      The PolyGraph method is designed for highly polygenic traits with causal variants spread across the genome. However, the genetic architecture of the expression of a gene is much less polygenic with at most few cis- eQTLs per gene, so the PolyGraph model does not apply for expression of individual genes. On the other hand, eQTLs for different genes are associated with different "traits", so they cannot be simply aggregated together for PolyGraph analysis. Based on the Methods description, it is unclear how the authors ran the PolyGraph analysis on eQTLs practically and whether this practice is appropriate for detecting polygenic adaptation signal on gene expression.

      We understand the reviewer’s concern on polygenic adaptation analysis. In this study, we tested whether the estimated polygenic scores from eQTLs (estimated using sums of allele frequencies at independent eQTLs weighted by their effect sizes) were significantly enriched in Tibetans compared to other populations. The detailed descriptions of polygenic test are provided in the response to Reviewer-1.

      Reviewer #1 (Public Review):

      The revised manuscript new presented 1) a permutation-based test for the significance of the overlap between DEGs and genes with positive selection signals in Tibetans, and 2) polygenic adaptation test for the eQTLs. I make my suggestions in detail as below:

      Major Comments

      (1) My previous concern regarding the DEG analysis remains unresolved. Although the authors agreed in their response that the difference between the male- and female-specific DEGs are insufficient to the difference between sex-combined and sex-specific DEGs (Figure S6). However, the results section still states the opposite pattern between males and females as a decisive reason for the difference (p. 9, lines 236-239). Again, I would like to recommend the authors to test alternative ways of analysis to boost statistical power for DEG detection other than simply splitting data into males and females and performing analysis in each subset. For example, the authors may consider utilizing gene by environment interaction analysis schemes here biological sex as an environmental factor.

      To evaluate the effect of gene expression of each layer by sex, we adopted two strategies: 1) to calculate the variance explained by sex from the expression data; 2) to evaluate the statistical significance of association between sex from the expression data.

      Firstly, we observed a significantly higher variance explained by sex than by ethnicity in six layers of the placenta (see details in our previous response to reviewers).

      Then, we performed a linear regression model to test whether gender affects the gene expression. For each gene, a linear regression model was made by using R glm function with sex as covariates: glm (gene expression ~ sex). We discovered 5,865 genes significantly associated with sex, and most of them were located on the sex chromosomes. We observed 62.63% genes overlapped with those genes with opposite differential directions between the sex-combined and the sex-specific analyses.

      Considering the opposite direction of DEGs is likely only one of the explanations for the discrepancy between the sex-combined and the sex-specific DEGs, and there might be alternative mechanism for this phenomenon, we have tune down the description of this point in the revised manuscript:

      “Considering 62.63% of DEGs (248/396) with an opposite direction of between-population expression divergence in males and females, respectively (Figure S6), we reckon that there might be other factors such as sample size or cell composition affecting the identification of DEGs, which could cancel out the differences in the sex-combined analysis.” (Page 9)

      (2) Multiple testing schemes are still sub-optimal in some cases. Most of all, the p-values in the WGCNA analysis (p. 11), the authors corrected for the number of traits (n=12) after adjusting for the correlation between them. However, they did not mention whether they counted for the number of modules they tested at all (n=136 and 161 for males and females, respectively). Whether they account for the number of modules will make a substantial difference in the significance threshold, please incorporate and describe a proper multiple testing scheme for this analysis.

      We understand the reviewer’s point. Indeed, for multiple testing schemes, we considered both the number of traits and the number of modules. For the number of modules, multiple testing correction is already imbedded in WGCNA, as described in the published studies (Li et al. 2018; Zeng et al. 2023).

      (3) Evidence for natural selection on the observed DEG pattern is still weak and not properly described.

      (1) For the overlap between DEGs and TSNGs, the authors introduced a permutation-based test, but used a total set of genes in the human genome as a comparison set (p. 25, lines 699-700). I believe that the authors should sample random sets of genes from those already expressed in each tissue to make a fair comparison.

      We agree with the reviewer’s point that random sampling of genes in permutation should be extracted from genes expressed in each tissue, which is a fair comparison between the observed and the simulated counts of the overlapped genes.

      Therefore, for each permutation, we randomly extracted 192 genes from all the placenta expressed genes identified from the seven layers (17,284 genes in total), and we overlapped them with DEGs of the three sets (female + male, female only, and male only) and counted the gene numbers. After 10,000 permutations, we constructed a null distribution for each set, and found that the overlaps between DEGs and TSNGs were significantly enriched in the “sex-combined” set (p-value = 0.0123) and the “male-only” set (p-value < 1e-4), but not in the “female-only” set (p-value = 0.0572) (Figure R1). This result suggests that the observed DEGs are significantly enriched in TSNGs when compared to the set of random sampling, especially for the DEGs from the “male-only” set.

      Author response image 1.

      The distribution of 10,000 permutation tests of counts of the overlapped genes between 192 TSNGs and the DEGs randomly selected from the expressed genes in the placenta. The red-dashed lines indicate the observed values based on the randomly selected DEGs.

      (2) The entire polygraph analysis for polygenic adaptation is poorly described. The current version of the Methods does not clarify i) for which genes the eQTLs are discovered, 2) how the authors performed the eQTL analysis, iii) how the authors polarized the effect, and iv) how they set up a comparison between the eQTLs and the others.

      Considering the RNA-seq data of placenta mostly represent the transcriptomes of the newborns according to our analysis on maternal-fetal compositions of each dissected layer, we conducted eQTL analysis using the fetal genotypes and the placental tissue gene expression data (TPM) using R package MatrixEQTL (https://github.com/andreyshabalin/MatrixEQTL), and the altitude and maternal age were taken as covariates. We take a window 1 Mb upstream and 1 Mb downstream around each SNP to select genes or expression probes to test. Associations between these SNP–gene combinations are calculated using linear model. This tool can distinguish local (cis-) and distant (trans) eQTLs. We performed separate corrections for multiple testing.

      Finally, we detected 5,251 eQTLs (involving 319 eGenes), covering the SNPs significantly associated with gene expression (p-value < 5e-8). To identify the signatures of polygenic selection in Tibetans using eQTL information, we removed those SNPs in linkage disequilibrium (r2 > 0.2 in 1000 Genome Project) and obtained 176 independent eQTLs as input into PolyGraph (Racimo et al. 2018). QB (Racimo et al. 2018) and QX (Berg and Coop 2014) framework are used in Polygraph to determine whether the estimated polygenic scores exhibit more variance among populations than null expectation under genetic drift, by retrieving the summary statistics from the eQTL set.

      In this study, we focused on testing whether the estimated polygenic scores from eQTLs (estimated using sums of allele frequencies at independent eQTLs weighted by their effect sizes) were significantly enriched in Tibetans compared to other populations. The significance was evaluated by comparing to 10,000 sets of the control SNPs. Each set of control SNPs was randomly drawn from the genomic SNPs, and contained an equal number of SNPs as the eQTLs matched one-to-one by minor allele frequency.

      The PolyGraph result showed that Tibetans have a clear signature of polygenic selection on gene expression (Bonferroni-corrected p-value = 0.003, Figure S12). In other words, the frequency of alleles associated with gene expression (up-regulation or down-regulation) were specifically enriched in Tibetans, a signal of positive selection.

      Minor comments (1) In Figure S1, the amount of variance explained by PC1 and PC2 need to be corrected. PC1 explains less variance than PC2 (0.11 vs 0.68%).

      It was a typing error that mixed up the variances between PC1 and PC2. We have corrected it in the revised version.

      (2) In the section "Sex-biased expression divergence ..." (p. 8), the authors are using the term "gender" instead of sex. Considering that they are talking about the biological sex of each infant, I believe that sex is a more appropriate term to be used than gender.

      Following the reviewer’s suggestion, we rephrased “gender” as “sex” in the revised manuscript to describe the biological differences between females and males.

      Reviewer #3 (Public Review):

      More than 80 million people live at high altitude. This impacts health outcomes, including those related to pregnancy. Longer-lived populations at high altitudes, such as the Tibetan and Andean populations show partial protection against the negative health effects of high altitude. The paper by Yue sought to determine the mechanisms by which the placenta of Tibetans may have adapted to minimise the negative effect of high altitude on fetal growth outcomes. It compared placentas from pregnancies from Tibetans to those from the Han Chinese. It employed RNAseq profiling of different regions of the placenta and fetal membranes, with some follow-up of histological changes in umbilical cord structure and placental structure. The study also explored the contribution of fetal sex in these phenotypic outcomes.

      A key strength of the study is the large sample sizes for the RNAseq analysis, the analysis of different parts of the placenta and fetal membranes, and the assessment of fetal sex differences.

      A main weakness is that this study, and its conclusions, largely rely on transcriptomic changes informed by RNAseq. Changes in genes and pathways identified through bioinformatic analysis were not verified by alternate methods, such as by western blotting, which would add weight to the strength of the data and its interpretations. There is also a lack of description of patient characteristics, so the reader is unable to make their own judgments on how placental changes may link to pregnancy outcomes. Another weakness is that the histological analyses were performed on n=5 per group and were rudimentary in nature.

      For the three weaknesses raised by the reviewer, here are our responses:

      (1) Considering that our conclusions largely rely on the transcriptomic data, we agree with reviewer that more experiments are needed to validate the results from our transcriptomic data. However, this study was mainly aimed to provide a transcriptomic landscape of high-altitude placenta, and to characterize the gene-expression difference between native Tibetans and Han migrants. The molecular mechanism exploration is not the main task of this study, and more validation experiments are warranted in the future.

      (2) For the lack of description of patient characteristics, actually, we provided three-level results on the placental changes of Tibetans: macroscopic phenotypes (higher placental weight and volume), histological phenotypes (larger umbilical vein walls and umbilical artery intima and media; lower syncytial knots/villi ratios) and transcriptomic phenotypes (DEG and differential modules). Combined with the previous studies, these placenta changes suggest a better reproductive outcome. For example, the placenta volume shows a significantly positive correlation with birth weight (R = 0.31, p-value = 2.5e-16), therefore, the larger placenta volume of Tibetans is beneficial to fetal development at high altitude. In addition, the larger umbilical vein wall and umbilical artery intima and media of Tibetans can explain their adaptation in preventing preeclampsia.

      (3) For the sample size of histological analyses, we understand the reviewer’s concern that 5 vs. 5 samples are not very large in histological analyses. This is because it was difficult to collect high-altitude Han placenta samples, and we only got 13 Han samples, from which we selected 5 infant sex matched samples.

      Minor point:

      I feel the authors have responded well to the other reviewer comments. However, I am disappointed that the authors did not address my comment related to the validation of their RNAseq data. In particular, they failed to add new data that verifies and supports their RNAseq findings on pathways affected. This is imperative as their conclusions are based solely on the RNAseq analysis. The only other comment I have is that they should add a description of all abbreviations, including those in the supplementary information (like Table S12).

      For experimental validation of transcriptome, we understand the concern of reviewer. However, as we mentioned before, this study was mainly aimed to provide a transcriptomic landscape of high-altitude placenta, the molecular mechanism exploration is not the main task of this study, and more validation experiments are warranted in the future. Actually, we have tune down the description of power from transcriptomic data for explanation of biological difference, and called for the further functional validations in the future:

      “the transcriptome data is insufficient to explain the underlying molecular mechanisms of genetic adaptation in Tibetans. Future single-cell transcriptome analysis and functional validations of the candidate genes are warranted to reveal the responsible cell types and the molecular pathways.” (highlighted in Page 20)

      For abbreviations of the manuscript, according to the reviewer’s suggestion, we added descriptions of all abbreviations of this study in corresponding position (Table S1 and S12).

      References

      Berg JJ, and Coop G (2014). A population genetic signal of polygenic adaptation. PLoS Genet 10(8): e1004412.

      Li J, et al. (2018). Application of Weighted Gene Co-expression Network Analysis for Data from Paired Design. Sci Rep 8(1): 622.

      Racimo F, Berg JJ, and Pickrell JK (2018). Detecting Polygenic Adaptation in Admixture Graphs. Genetics 208(4): 1565-1584.

      Zeng JF, et al. (2023). Functional investigation and two-sample Mendelian randomization study of neuropathic pain hub genes obtained by WGCNA analysis. Frontiers in Neuroscience 17.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer #2 (Public review):

      Dipasree Hajra et al demonstrated that Salmonella was able to modulate the expression of Sirtuins (Sirt1 and Sirt3) and regulate the metabolic switch in both host and Salmonella, promoting its pathogenesis. The authors found Salmonella infection induced high levels of Sirt1 and Sirt3 in macrophages, which were skewed toward the M2 phenotype allowing Salmonella to hyper-proliferate. Mechanistically, Sirt1 and Sirt3 regulated the acetylation of HIF-1alpha and PDHA1, therefore mediating Salmonella-induced host metabolic shift in the infected macrophages. Interestingly, Sirt1 and Sirt3-driven host metabolic switch also had an effect on the metabolic profile of Salmonella. Counterintuitively, inhibition of Sirt1/3 led to increased pathogen burdens in an in vivo mouse model. Overall, this is a well-designed study.<br /> The revised manuscript has addressed all of the previous comments. The re-analysis of flow cytometry and WB data by authors makes the results and conclusion more complete and convincing.

      We are immensely grateful to the reviewer for improving the strength of the manuscript by providing insightful comments and for appreciating the work.

      Reviewer #3 (Public review):

      Summary:

      In this paper Hajra et al have attempted to identify the role of Sirt1 and Sirt3 in regulating metabolic reprogramming and macrophage host defense. They have performed gene knock down experiments in RAW macrophage cell line to show that depletion of Sirt1 or Sirt3 enhances the ability of macrophages to eliminate Salmonella Typhimurium. However, in mice inhibition of Sirt1 resulted in dissemination of the bacteria but the bacterial burden was still reduced in macrophages. They suggest that the effect they have observed is due to increased inflammation and ROS production by macrophages. They also try to establish a weak link with metabolism. They present data to show that the switch in metabolism from glycolysis to fatty acid oxidation is regulated by acetylation of Hif1a, and PDHA1.

      Strengths:

      The strength of the manuscript is that the role of Sirtuins in host-pathogen interactions have not been previously explored in-depth making the study interesting. It is also interesting to see that depletion of either Sirt1 or Sirt3 result in a similar outcome.

      Weaknesses:

      The major weakness of the paper is the low quality of data, making it harder to substantiate the claims. Also, there are too many pathways and mechanisms being investigated. It would have been better if the authors had focussed on either Sirt1 or Sirt3 and elucidated how it reprograms metabolism to eventually modulate host response against Salmonella Typhimurium. Experimental evidences are also lacking to prove the proposed mechanisms. For instance they show correlative data that knockdown of Sirt1 mediated shift in metabolism is due to HIF1a acetylation but this needs to be proven with further experiments.

      As the public review of the reviewer remains unaltered as the previous version without further recommendations for authors, we are sticking to our former author’s response. We respect the reviewer’s opinion and thank the reviewer for the critical analysis of our work.

      ---------

      The following is the authors’ response to the previous reviews.

      Reviewer #2 (Public Review):

      Dipasree Hajra et al demonstrated that Salmonella was able to modulate the expression of Sirtuins (Sirt1 and Sirt3) and regulate the metabolic switch in both host and Salmonella, promoting its pathogenesis. The authors found Salmonella infection induced high levels of Sirt1 and Sirt3 in macrophages, which were skewed toward the M2 phenotype allowing Salmonella to hyper-proliferate. Mechanistically, Sirt1 and Sirt3 regulated the acetylation of HIF-1alpha and PDHA1, therefore mediating Salmonella-induced host metabolic shift in the infected macrophages. Interestingly, Sirt1 and Sirt3-driven host metabolic switch also had an effect on the metabolic profile of Salmonella. Counterintuitively, inhibition of Sirt1/3 led to increased pathogen burdens in an in vivo mouse model. Overall, this is a well-designed study.

      Comments on revised version:

      The authors have performed additional experiments to address the discrepancy between in vitro and in vivo data. While this offers some potential insights into the in vivo role of Sirt1/3 in different cell types and how this affects bacterial growth/dissemination, I still believe that Sirt1/3 inhibitors could have some effect on the gut microbiota contributing to increased pathogen counts. This possibility can be discussed briefly to give a better scenario of how Sirt1/3 inhibitors work in vivo. Additionally, the manuscript would improve significantly if some of the flow cytometry analysis and WB data could be better analyzed.

      We are highly grateful for your valuable and insightful comments. Thank you for appreciating the merit of our manuscript. As rightly pointed out by the eminent reviewer, we acknowledge the probable link of Sirtuin on gut microbiota and its effect on increased bacterial loads as indicated by previous literature studies (PMID: 22115311, PMID: 19228061). These reports suggested that a low dose of Sirt1 activator, resveratrol treatment in rats for 25 days treatment under 5% DSS induced colitis condition led to alterations in gut microbiota profile with increased lactobacilli and bifidobacteria alongside reduced abundance of enterobacteria. This study correlates with our study wherein we have detected enhanced Salmonella (belonging to Enterobacteriaceae family) loads under both Sirt1/3 in vivo knockdown condition or inhibitor-treated condition in C57BL/6 mice and reduced burden under Sirt-1 activator treatment SRT1720.

      As per your valid suggestion, we have discussed this possibility in our discussion section. (Line- 541-548).

      We have incorporated the suggestions for the improvement in the analysis of WB data and flow cytometry.

      Reviewer #3 (Public Review):

      Summary:

      In this paper Hajra et al have attempted to identify the role of Sirt1 and Sirt3 in regulating metabolic reprogramming and macrophage host defense. They have performed gene knock down experiments in RAW macrophage cell line to show that depletion of Sirt1 or Sirt3 enhances the ability of macrophages to eliminate Salmonella Typhimurium. However, in mice inhibition of Sirt1 resulted in dissemination of the bacteria but the bacterial burden was still reduced in macrophages. They suggest that the effect they have observed is due to increased inflammation and ROS production by macrophages. They also try to establish a weak link with metabolism. They present data to show that the switch in metabolism from glycolysis to fatty acid oxidation is regulated by acetylation of Hif1a, and PDHA1.

      Strengths:

      The strength of the manuscript is that the role of Sirtuins in host-pathogen interactions has not been previously explored in-depth making the study interesting. It is also interesting to see that depletion of either Sirt1 or Sirt3 results in a similar outcome.

      Weaknesses:

      The major weakness of the paper is the low quality of data, making it harder to substantiate the claims. Also, there are too many pathways and mechanisms being investigated. It would have been better if the authors had focussed on either Sirt1 or Sirt3 and elucidated how it reprograms metabolism to eventually modulate host response against Salmonella Typhimurium. Experimental evidence is also lacking to prove the proposed mechanisms. For instance they show correlative data that knock down of Sirt1 mediated shift in metabolism is due to HIF1a acetylation but this needs to be proven with further experiments.

      We appreciate the reviewer’s critical analysis of our work. In the revised manuscript, we aimed to eliminate the low-quality data sets and have tried to substantiate them with better and conclusive ones, as directed in the recommendations for the author section. We agree with the reviewer that the inclusion of both Sirtuins 1 and 3 has resulted in too many pathways and mechanisms and focusing on one SIRT and its mechanism of metabolic reprogramming and immune modulation would have been a less complicated alternative approach. However, as rightly pointed out, our work demonstrated the shared and few overlapping roles of the two sirtuins, SIRT1 and SIRT3, together mediating the immune-metabolic switch upon Salmonella infection. As per the reviewer’s suggestion, we have performed additional experiments with HIF-1α inhibitor treatment in our revised manuscript to substantiate our correlative findings on SIRT1-mediated regulation of host glycolysis (Fig.7G). We wanted to clarify our claim in this regard. Our results suggested that loss of SIRT1 function triggered increased host glycolysis alongside hyperacetylation of HIF-1α. HIF-1α is reported to be one of the important players in glycolysis regulation (Kierans SJ, Taylor CT. Regulation of glycolysis by the hypoxia-inducible factor (HIF): implications for cellular physiology. J Physiol. 2021;599(1):23-37. doi:10.1113/JP280572.) and additionally, SIRT1 has been shown to regulate HIF-1α acetylation status (Lim JH, Lee YM, Chun YS, Chen J, Kim JE, Park JW. Sirtuin 1 modulates cellular responses to hypoxia by deacetylating hypoxia-inducible factor 1 alpha. Mol Cell. 2010;38(6):864-878. doi:10.1016/j.molcel.2010.05.023.) Further, ectopic expression of SIRT1 has been demonstrated to reduce glycolysis by negatively regulating HIF-1α. (Wang Y, Bi Y, Chen X, et al. Histone Deacetylase SIRT1 Negatively Regulates the Differentiation of Interleukin-9-Producing CD4(+) T Cells. Immunity. 2016;44(6):1337-1349. doi:10.1016/j.immuni.2016.05.009). We have subsequently shown in Fig. 7G, that the increase in host glycolysis upon SIRT knockdown in the infected macrophages gets lowered upon HIF-1α inhibitor treatment, suggesting that one of the mechanisms of SIRT-mediated regulation of host glycolysis is via regulation of HIF-1α. However, this warrants further future mechanistic research.

      Recommendations for the authors:

      Reviewer #2 (Recommendations For The Authors):

      (1) Figures 8I-S: are only viable cells used for analysis? Please provide gating strategy used for these analyses.

      (2) Many changes seen in WB seem to be marginal. Since the authors used densitometric plot to quantify the band intensities, I expect these experiments were repeated at least three times. Please indicate the number of repeats. For instance, Figures 7C, 7I (UI SCR vs UI shSIRT3), 7J, show marginal changes or no changes. What do other WB images look like? Are they more convincing than the ones currently shown? Please provide them in the response letter.

      (3) Figure 7C: label is a bit misleading. Please relabel the figure title to Acetylated HIF vs total levels

      (4) Figure 7J: which band is AcPDHA1?

      (1) We are highly apologetic for not clarifying our gating strategy for the analysis.

      We initially gated the viable splenocyte population based on Forward scatter (FSC) and Side Scatter (SSC). This gated population was further subjected to gating based on cell FSC-H (height) versus FSC-A (area). Subsequently, the population was gated as per SSC-A and GFP (expressed by intracellular bacteria) based on the autofluorescence exhibited by the uninfected control (Fig. 8I-J).

      Author response image 1.

      UNINFECTED

      Author response image 2.

      VEHICLE CONTROL INFECTED

      Author response image 3.

      EX-527 INFECTED

      Author response image 4.

      3TYP INFECTED

      Author response image 5.

      SRT 1720 INFECTED

      For gating different cell types such as F4/80 (PE) positive population in Fig. 8K-L, the viable cell population was gated based on SSC-A versus PE-A to gate the macrophage population. These macrophage populations were gated further based on GFP (Salmonella) + population to obtain the percentage of macrophage population harboring GFP+ bacteria. Similar strategies were followed for other cell types as depicted in Fig. 8M-S, Fig. S8.

      (2) We agree with the reviewer’s concern with the marginal changes in the western blots (Figures 7C, 7I (UI SCR vs UI shSIRT3), 7J). As per the suggestions, we have provided the alternate blot images and have indicated the number of repeats in the manuscript. The alternate blot images are provided herewith:

      Author response image 6.

      Alternate blot images for Fig. 7B-C

      Author response image 7.

      Alternate blot images for Fig. 7I, J

      (1) We are highly thankful to the reviewer for recommending this suggestion. We have made the necessary modifications of relabelling Fig. C to Acetylated HIF-1α over total HIF-1α as per the suggestion.

      (2) 7J Acetylated PDHA1 has been duly pointed as per the suggestion. We are extremely apologetic for the inconvenience caused.

      Author response image 8.

      Reviewer #3 (Recommendations For The Authors):

      The authors have done some work to improve the manuscript. However, the data presented lacks clarity.

      Fig 4B: I still do not see a change in Ac p65 in the less saturated blot. It looks reduced as the band is distorted. I am not sure how this could be quantified.

      Fig S2 b-actin bands are hyper saturated, and it is not possible to decipher the knockdown efficiency. It is probably better to provide a ponceau staining similar to S2C. The band intensity values are out of place.

      Fig 5F HADHA blot: Lane 1 expression appears to be significantly higher than lane 3, but the values mentioned do not match the intensity of the bands.

      It is hard to interpret the authors' claim that the shift in metabolism is HIF1a-dependent.

      Fig 7B: I would expect HIF1a acetylation to be increased in UI ShSIRT1 compared to UI SCR. The blot shows reduced HIF1a acetylation.

      Fig 7D: SIRT1 immunoprecipitates with HIF1a equally under all conditions. Is this what the authors expect? Labelling of the blots are not clear. It looks like the bottom SIRT1 blot is from Beads IgG control.

      Fig 7H: How does PDHA1 interact with SIRT3 so strongly in shSIRT3 cells (lane 2)?

      Authors have mentioned in their response that a knockdown of 40% has been achieved in the uninfected but the blot does not reflect that. SIRT3 expression seems to be more in the knockdown.

      Blots are also not labelled properly especially Input. The lanes are not marked.

      We thank the reviewer for acknowledging the improvements in the revised version and for suggesting further clarifications and improvements.

      We have tried to incorporate the specified modifications to the best of our abilities in the revised manuscript.

      We are highly apologetic for the inconclusive blot image in the figure 4B. We have provided an alternative blot image with better clarity for Fig.4B used for quantification analysis.

      Author response image 9.

       

      As per the reviewer’s valuable suggestions, we have provided the ponceau image in the Fig. S2B.

      We thank the reviewers for rightly pointing out the discrepancy in the band intensity quantification in the Fig. 5F. We have re-evaluated the intensities on imageJ and have provided with the correct band intensities. We are highly apologetic for the inaccuracies.

      As per the reviewer’s previous suggestion, we have performed additional experiments with HIF-1α inhibitor treatment in our revised manuscript to substantiate our correlative findings on SIRT1-mediated regulation of host glycolysis (Fig.7G). We wanted to clarify our claim in this regard. Our results suggested that loss of SIRT1 function triggered increased host glycolysis alongside hyperacetylation of HIF-1α. HIF-1α is reported to be one of the important players of glycolysis regulation (Kierans SJ, Taylor CT. Regulation of glycolysis by the hypoxia-inducible factor (HIF): implications for cellular physiology. J Physiol. 2021;599(1):23-37. doi:10.1113/JP280572.) and additionally, SIRT1 has been shown to regulate HIF-1α acetylation status (Lim JH, Lee YM, Chun YS, Chen J, Kim JE, Park JW. Sirtuin 1 modulates cellular responses to hypoxia by deacetylating hypoxia-inducible factor 1alpha. Mol Cell. 2010;38(6):864-878. doi:10.1016/j.molcel.2010.05.023.) Further, ectopic expression of SIRT1 has been demonstrated to reduce glycolysis by negatively regulating HIF-1α. (Wang Y, Bi Y, Chen X, et al. Histone Deacetylase SIRT1 Negatively Regulates the Differentiation of Interleukin-9-Producing CD4(+) T Cells. Immunity. 2016;44(6):1337-1349. doi:10.1016/j.immuni.2016.05.009). We have subsequently shown in Fig. 7G, that the increase in host glycolysis upon SIRT knockdown in the infected macrophages gets lowered upon HIF-1α inhibitor treatment, suggesting that one of the mechanisms of SIRT-mediated regulation of host glycolysis is via regulation of HIF-1α. However, this warrants further future mechanistic research.

      We agree with the reviewer’s claim of increased HIF-1α acetylation in the UI sh1 versus UI SCR. The apparent reduced acetylation depicted in UI sh1 in Fig. 7B could be attributed to lower HIF-1α levels in the UI sh1 compared to UI SCR. Therefore, we have provided an alternate blot image that been used for quantification in Fig. 7C (Author response image 6).

      To answer the reviewer’s question in Fig. 7D, we have noticed more or less equal degree of immunoprecipitation of HIF-1α under pull down of HIF-1α in all the sample cohorts under conditions of SIRT1 inhibitor treatment. However, we have observed reduced interaction of HIF-1α with SIRT1 in the infected sample upon SIRT1 inhibitor treatment.

      We thank the reviewers for suggesting improvements in the blot labelling and for raising this concern. We have corrected the blot labelling to avoid the previous confusion.

      We appreciate the reviewer’s concern and therefore we have provided an alternate blot image for Fig. 7H which might address the previous stated concern wherein we have achieved an enhanced SIRT3 knockdown percentage.

      We are extremely apologetic for the improper labelling of the Input blot with unmarked lanes. We have addressed this issue by labelling the lanes in the input section of the blots.

    1. Author Response

      The following is the authors’ response to the previous reviews

      We appreciate the positive comments from the editors and reviewers. The followings are the point to point responses to the questions and comments of the Reviewers:

      Reviewer #1 (Public Review):

      In this study, Jiamin Lin et al. investigated the potential positive feedback loop between ZEB2 and ACSL4, which regulates lipid metabolism and breast cancer metastasis. They reported a correlation between high expression of ZEB2 and ACSL4 and poor survival of breast cancer patients, and showed that depletion of ZEB2 or ACSL4 significantly reduced lipid droplets abundance and cell migration in vitro. The authors also claimed that ZEB2 activated ACSL4 expression by directly binding to its promoter, while ACSL4 in turn stabilized ZEB2 by blocking its ubiquitination. While the topic is interesting, there are several concerns with the study:

      1. My concern regarding the absence of appropriate thresholds or False Discovery Rate (FDR) adjustments for the RNA-seq analysis has not been addressed, leading to incorrect thresholds and erroneous identification of significant signals.

      Response: We thank the reviewer for the concern about the RNA-seq analysis. RNA-seq data was analyzed by the Benjamini and Hochberg’s approach for controlling the false discovery rate. The procedure of RNA-seq bioinformatic analysis is as follows: For data analysis, raw data of fastq format were firstly processed through in-house perl scripts. In this step, clean data were obtained by removing reads containing adapter, reads containing N base and low quality reads from raw data. All the downstream analyses were based on the clean data with high quality. Index of the reference genome was built using Hisat2 v2.0.5 and paired-end clean reads were aligned to the reference genome using Hisat2 v2.0.5. FeatureCounts v1.5.0-p3 was used to count the reads numbers mapped to each gene, and then FPKM of each gene was calculated based on the length of the gene and reads count mapped to this gene. Differential expression analysis of two conditions/groups was performed using the DESeq2 R package (1.20.0). The resulting P-values were adjusted using the Benjamini and Hochberg’s approach for controlling the false discovery rate. Genes with an adjusted P-value (<0.05) found by DESeq2 were assigned as differentially expressed.

      1. In Figure 3B and C, it appears that the knockdown efficiency of ACSL4 is inadequate in these cells, which contradicts the Western blot results presented in Figure 2F.

      Response: We thank the reviewer for the concern. In figure 3B and 3C, we use the shRNA for the knockdown experiment and in Figure 2F we use siRNA for the knockdown experiment, so the efficiency of them were different.

      1. Regarding Figure 6, the discovery of consensus binding sequences (CACCT) for ZEB2 alone is insufficient evidence to support the direct binding of ZEB2 to the ACSL4 promoter.

      Response: We thank the reviewer for the concern. We performed chromatin immunoprecipitation (ChIP), which examines the direct interaction between DNA and protein, to test if ZEB2 directly binds to the ACSL4 promoter. The results showed that the primer set 1, which covered -184 to -295 of ACSL4 promoter region exhibited apparent ZEB2 binding (Fig. 6F). Moreover, the mutant sequence (AAAA) of ACSL4 promoter showed significant decreased luciferase activity (Fig. 7H). All these evidences suggest that ZEB2 directly bond to the consensus sequence of ACSL4 promoter.

      1. For Figure 7E, there are multiple bands present, and it appears that ZEB2-HA has been cropped, which should ideally be presented with unaltered raw data. Please provide the uncropped raw data.

      Response: We thank the reviewer for the concern. The raw data of the figure 7E ZEB2-HA is shown in Author response image 1:

      Author response image 1.

      1. In Figure 7C, the author claimed to have used 293T cells for the ubiquitin assay, which are not breast cancer cells. Moreover, the efficiency of over-expression differs between ZEB2 and ACSL4 in 293T cell lines. Performing the experiment in an unrelated cell line to justify an important interaction is not acceptable.

      Response: We thank the reviewer for the concern. We also performed the ubiquitination assay in MDA-MB-231 cells in Fig 7D (Author response image 2), The results confirm that knockdown of ACSL4 obviously enhanced the ubiqutination of ZEB2. We also have performed the IP experiment in MDA-MB-231 cells in Author response image 3 (Fig 7F). The results confirmed the interaction between ZEB2 and ACSL4:

      Author response image 2.

      Author response image 3.

      Reviewer #2 (Public Review):

      In this study, the authors validated a positive feedback loop between ZEB2 and ACSL4 in breast cancer, which regulates lipid metabolism to promote metastasis.

      Overall, the study is original, well structured, and easy to read.

      We appreciate the positive comments from the reviewer.

      Reviewer #3 (Public Review):

      The manuscript by Lin et al. reveals a novel positive regulatory loop between ZEB2 and ACSL4, which promotes lipid droplets storage to meet the energy needs of breast cancer metastasis.

      We appreciate the positive comments from the reviewer.

      Reviewer #2 (Recommendations For The Authors):

      I still have some points that should be addressed by the Authors:

      The interaction between ACSL4 and ZEB2 is still not convincing, due to the cellular localization of ACSL4 and ZEB2 is different. The authors should consider utilizing the Duolink experiment to more accurately determine the interaction location of these two proteins in cells.

      Response: We appreciate the reviewer’s suggestion. We performed GST pull-down assay to examine whether ZEB2 and ACSL4 form a complex. GST pull-down assay confirmed the interaction of ZEB2 and ACSL4 (Supplementary Fig. S10). We also performed immunofluorescence assay and found that ZEB2 was co-localized with ACSL4 in some certain regions of the cytoplasm in Author response image 5 (Supplementary Fig. S11):

      Author response image 4.

      Author response image 5.

      In Figure S4, the authors showed both "shACSL4" and "siACSL4", which is a description error.

      Response: We appreciate the reviewer to point out the mistake. We have corrected the "siACSL4" into "shACSL4".

      Author response image 6.

      Reviewer #3 (Recommendations For The Authors):

      The manuscript is improved.

      We appreciate the positive comments from the reviewer.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Responses to Reviewer #1:

      Reviewer #1: The study shows a new mechanism of NFkB-p65 regulation mediated by Vangl2-dependent autophagic targeting. Autophagic regulation of p65 has been reported earlier; this study brings an additional set of molecular players involved in this important regulatory event, which may have implications for chronic and acute inflammatory conditions.

      Comments on the revised version:

      The authors have addressed the earlier concerns and I am satisfied with the revised version. I have no additional comments to make.

      We appreciate the reviewer’s comments on our revised manuscript.

      Responses to Reviewer #2:

      Reviewer #2: Vangl2, a core planar cell polarity protein involved in Wnt/PCP signaling, cell proliferation, differentiation, homeostasis, and cell migration. Vangl2 malfunctioning has been linked to various human ailments, including autoimmune and neoplastic disorders. Interestingly, it was shown that Vangl2 interacts with the autophagy regulator p62, and autophagic degradation limits the activity of inflammatory mediators, such as p65/NF-κB. However, the possible role of Vangl2 in inflammation has not been investigated. In this manuscript, Lu et al. describe that Vangl2 expression is upregulated in human sepsis-associated PBMCs and that Vangl2 mitigates experimental sepsis in mice by negatively regulating p65/NF-κB signaling in myeloid cells. Their mechanistic studies further revealed that Vangl2 recruits the E3 ubiquitin ligase PDLIM2 to promote K63-linked poly-ubiquitination of p65. Vangl2 also facilitated the recognition of ubiquitinated p65 by the cargo receptor NDP52. These molecular processes caused selective autophagic degradation of p65. Indeed, abrogation of PDLIM2 or NDP52 functions rescued p65 from autophagic degradation, leading to extended p65/NF-κB activity in myeloid cells. Overall, the manuscript presents convincing evidence for novel Vangl2-mediated control of inflammatory p65/NF-kB activity. The proposed pathway may expand interventional opportunities restraining aberrant p65/NF-kB activity in human ailments.

      IKK is known to mediate p65 phosphorylation, which instructs NF-kB transcriptional activity. In this manuscript, Vangl2 deficiency led to an increased accumulation of phosphorylated p65 and IKK also at 30 minutes post-LPS stimulation; however, autophagic degradation of p-p65 may not have been initiated at this early time point. Therefore, this set of data put forward the exciting possibility that Vangl2 could also be regulating the immediate early phase of inflammatory response involving the IKK-p65 axis - a proposition that may be tested in future studies.

      We appreciate the reviewer’s comments on our manuscript, and we have added the discussion about IKK-p65 axis in revised version. (Page 15, lines 467-474)

      Responses to Reviewer #3:

      Reviewer #3: Lu et al. describe Vangl2 as a negative regulator of inflammation in myeloid cells. The primary mechanism appears to be through binding p65 and promoting its degradation, albeit in an unusual autolysosome/autophagy dependent manner. Overall, these findings are novel, valuable and the crosstalk of PCP pathway protein Vangl2 with NF-kappaB is of interest. While generally solid, some concerns still remain about the rigor and conclusions drawn.

      Comments on the revised version:

      (1) Lu et al. address my comments through responses and new experimental data. However, some of the explanations provided are inadequate.

      However, in response to my enquiry regarding directly exploring PCP effects, the authors simply assert "Our study revealed that Vangl2 recruits the E3 ubiquitin ligase PDLIM2 to facilitate K63-linked ubiquitination of p65, which is subsequently recognized by autophagy receptor NDP52 and then promotes the autophagic degradation of p65. Our findings by using autophagy inhibitors and autophagic-deficient cells indicate that Vangl2 regulates NFkB signaling through a selective autophagic pathway, rather than affecting the PCP pathway, WNT, HH/GLI, Fat-Dachsous or even mechanical tension."

      I do not agree that the use of autophagy inhibitors and autophagy-deficient cells can rule out the contributions of PCP or any other pathways. Only experimentally inhibiting the pathway(s) with adequate demonstration of target inhibition/abolition of well-known effector function and documenting unaltered p65 regulation under these conditions can be considered proof. Autophagy inhibitors and autophagy-deficient cells only prove that this particular pathway is necessary. Nonetheless, I do not want to dwell on proving a negative and agree that Vangl2 is a novel regulator of p65 through its role in promoting p65 degradation. The inclusion of a statement discussing the limitations of their approach would have sufficed. The response from the authors could have been better.

      We thank the reviewer for helping us improve the quality of the manuscript. We provided new data and revised the Discussion as suggested.

      To ascertain whether Vangl2 degrades p65 through a selective autophagic pathway or the PCP pathway, 293T cells were transfected with p65, together with or without the Vangl2 plasmids, and treated with different pharmacological inhibitors. We found the degradation of p65 induced by Vangl2 was blocked by autolysosome inhibitor (CQ), but not by the JNK inhibitor (SP600125) or Wnt/β-catenin inhibitor (FH535) (New Figure. 1). These data suggest that Vangl2 primarily degrades p65 through a selective autophagic pathway, rather than through the JNK or Wnt signaling pathway. Nevertheless, additional pathway inhibitions, such as those of the HH/GLI and Fat-Dachsous pathways, should also be employed to further elucidate the function of Vangl2 in p65 degradation. As suggested, we have added a statement about the limitation of the approach in the discussion (Page 12, lines 378-385).

      Author response image 1.

      Vangl2 degrades p65 through a selective autophagic pathway, but not by the PCP pathway. HEK293T cells were transfected with Flag-p65 and HA-Vangl2 plasmids, and treated with DMSO, CQ (50 mM) for 6 h, SP600125 (20 mM) for 1 h or FH535 (30 mM) for 6 h. The cell lysates were analyzed by immunoblot.

      (2) I am also not satisfied with the explanation that "immune cells represent a minor fraction of the lungs and liver". There are lots of resident immune cells in the lungs and liver (alveolar macrophages in the lung and Kuppfer cells in the liver). For example, it may be so that Vangl2 is important in monocytes and not in the resident population. This might be a potential explanation. But this is not explored. The restricted tissue-specificity of the interaction between two ubiquitously present proteins is still a challenge to understand. The response from the authors is not satisfactory. There is plenty of Vangl2 in the liver in their western blot.

      We thank the reviewer for this question. We added this explanation in the Discussion. (Page 13, lines 398-404)

      (3) I had also simply pointed out PMID: 34214490 with reference to the findings described in the manuscript. There were no suggestions of contradiction. In fact, I would refer to the publication in discussion to support the findings and stress the novelty. The response from the authors could have been better.

      Thank you for the reviewer's insightful comments. We have modified this discussion as suggested. (Page 13, lines 410-415; Page 14, lines 419-421)

      (4) The response to my enquiry regarding homo- or heterozygosity is unsupported by any reference or data.

      As suggested, we provided the data that only Vangl2 deficient homozygous showed inhibition of the activation of NF-kB in New Figure. 2.

      Author response image 2.

      Vangl2 deficiency promotes NF-kB activation. (A) The survival rates of WT, Vangl2ΔM/ΔM and Vangl2ΔM/WT mice treated with high-dosage of LPS (30 mg/kg, i.p.) (n≥4). (B) IL-6 and TNF-a secretion by WT and Vangl2-deficient BMDMs treated with LPS for 6 h was measured by ELISA. IL-1β secretion by WT, Vangl2ΔM/ΔM and Vangl2ΔM/WT BMDMs treated with LPS for 6 h and ATP for 30 min was measured by ELISA.

      (5) The listing of 8 patients and healthy controls are also appreciated. The body temperature of #6 doesn't fall in the <36 or >38 degree C SIRS criteria. The inclusion of CRP, PCT, heart rate and respiratory rate, and other lab values would have further improved the inclusion criteria. Moreover, it is difficult to understand why there are 16 value points for healthy and sepsis cohorts in Fig 1 when there are 8 patients.

      We thank the reviewer for this valuable suggestion. We are sorry for our mistake that we entered data from two repeated experiments in Figure. 1 A and we have revised this data in the updated version (Figure. 1 A, Pages 12 Lines 146). As suggested, we have added CRP, WBC and heart rate in sepsis patients’ information. (Supplementary Materials and Methods)

      Recommendations for the authors:

      Reviewer #2 (Recommendations For The Authors):

      The proposition that Vangl2 may target additional mediators of inflammation could be indicated in the text.

      We thank the reviewer for this valuable suggestion. We had added discussion in modified version. (Page 15, lines 467-474)

      Reviewer #3 (Recommendations For The Authors):

      It is advised that some of the deficiencies pointed out by Reviewer #3 are textually addressed. Additionally, there could be some inconsistency in the number of healthy controls and patients (see Fig S1A and FIg 1A and Supplementary table, also see comments from Reviewer #3) - this should be carefully scrutinised and revised, if necessary.

      We thank the reviewer for this valuable suggestion. We are sorry for our mistake that we entered data from two repeated experiments in Figure. 1 A and we have revised this data in the updated version (Figure. 1 A, Pages 12 Lines 146).

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      Summary:

      Audio et al. measured cerebral blood volume (CBV) across cortical areas and layers using high-resolution MRI with contrast agents in non-human primates. While the non-invasive CBV MRI methodology is often used to enhance fMRI sensitivity in NHPs, its application for baseline CBV measurement is rare due to the complexities of susceptibility contrast mechanisms. The authors determined the number of large vessels and the areal and laminar variations of CBV in NHP, and compared those with various other metrics.

      Strengths:

      Noninvasive mapping of relative cerebral blood volume is novel for non-human primates. A key finding was the observation of variations in CBV across regions; primary sensory cortices had high CBV, whereas other higher areas had low CBV. The measured CBV values correlated with previously reported neuronal and receptor densities.

      We appreciate your recognition of the novelty of our non-invasive relative cerebral blood volume (CBV) mapping in non-human primates, as well as the observed areal variations and their correlations with neuronal and receptor densities. However, we are concerned that key contributions of our work—such as cortical layer-specific vasculature mapping and benchmarking surface vessel density estimations against anatomical ground truth—are being framed as limitations rather than significant advances in the field pushing the boundaries of current neuroimaging capabilities and providing a valuable foundation for future research. Additionally, we would like to clarify that dynamic susceptibility contrast (DSC) MRI using gadolinium is the gold standard for CBV measurement in clinical settings and the argument that “baseline CBV measurements are rare due to the complexities of susceptibility contrast” is simply not true. The limited use of ferumoxytol for CBV imaging is primarily due to previous FDA regulatory restrictions, rather than inherent methodological shortcomings.

      Changes in text:

      Compared to clinically used gadolinium-based agents, ferumoxytol's substantially longer half-life and stronger R<sub>2</sub>* effect allows for higher-resolution and more sensitive vascular volume measurements (Buch et al., 2022), albeit these methodologies are hampered by confounding factors such as vessel orientation relative to the magnetic field (B<sub>0</sub>) direction (Ogawa et al., 1993).

      Weaknesses:

      A weakness of this manuscript is that the quantification of CBV with postprocessing approaches to remove susceptibility effects from pial and penetrating vessels is not fully validated, especially on a laminar scale. Further specific comments follow.

      (1) Baseline CBV indices were determined using contrast agent-enhanced MRI (deltaR<sub>2</sub>*). Although this approach is suitable for areal comparisons, its application at a laminar scale poses challenges due to significant contributions from large vessels including pial vessels. The primary concern is whether large-vessel contributions can be removed from the measured deltaR<sub>2</sub>* through processing techniques.

      Eliminating the contribution of large vessels completely is unlikely, and we agree with the reviewer that ΔR<sub>2</sub>* results likely reflect a weighted combination of signals from both large vessels and capillaries. However, the distribution of ΔR<sub>2</sub>* more closely aligns with capillary density in areas V1–V5 than with large vessel distributions (Weber et al., 2008), suggesting that our ΔR<sub>2</sub>* results are more weighted toward capillaries. Moreover, we demonstrated that the pial vessel induced signal-intensity drop-outs are clearly limited to the superficial layers and exhibit smaller spatial extent than generally thought (Supp. Figs. 2 and 4).

      (2) High-resolution MRI with a critical sampling frequency estimated from previous studies (Weber 2008, Zheng 1991) was performed to separate penetrating vessels. However, this approach is still insufficient to accurately identify the number of vessels due to the blooming effects of susceptibility and insufficient spatial resolution. The reported number of penetrating vessels is only applicable to the experimental and processing conditions used in this study, which cannot be generalized.

      Our intention was not to suggest that our measurements provide a general estimate of vessel density across the macaque cerebral cortex. At 0.23 mm isotropic resolution, we successfully delineated approximately 30% of the penetrating vessels in V1. Our primary objective was to demonstrate a proof-of-concept quantifiable measurement rather than to establish a generalized vessel density metric for all brain regions. We have consistently emphasized this throughout the manuscript, but if there is a specific point of misunderstanding, we would be happy to consider revisions for clarity.

      (3) Baseline R<sub>2</sub>* is sensitive to baseline R<sub>2</sub>, vascular volume, iron content, and susceptibility gradients. Additionally, it is sensitive to imaging parameters; higher spatial resolution tends to result in lower R<sub>2</sub>* values (closer to the R<sub>2</sub> value). Thus, it is difficult to correlate baseline R<sub>2</sub>* with physiological parameters.

      The observed correlation between R<sub>2</sub>* and neuron density is likely indirect, as R<sub>2</sub>* is strongly influenced by iron, myelin, and deoxyhemoglobin densities. However, the robust correlation between R<sub>2</sub>* and neuron density, peaking in the superficial layers (R = 0.86, p < 10<sup>-10</sup>), is striking and difficult to ignore (revised Supp. Fig. 6D-E). Upon revision, we identified an error in Supp. Fig. 6D-E, where the previous version used single-subject R<sub>2</sub>* and ΔR<sub>2</sub>* maps instead of the group-averaged maps. The revised correlations are slightly stronger than in the earlier version.

      Given that the correlation between neuron density and R<sub>2</sub>* is strongest in the superficial layers, we suggest this relationship reflects an underlying association with tissue cytochrome oxidase (CO) activity and cumulative effect of deoxygenated venous blood drainage toward the pial network. The superficial cortical layers are also less influenced by myelin and iron densities, which are more concentrated in the deeper cortical layers. Additional factors may contribute to this relationship, including the iron dependence of mitochondrial CO activity, as iron is an essential component of CO’s heme groups. Moreover, myelin maintenance depends on iron, which is predominantly stored in oligodendrocytes. The presence of myelinated thin axons and a higher axonal surface density may, in turn, be a prerequisite for high neuron density.

      In this context, it is also valuable to note the absolute range of superficial R<sub>2</sub>* values (≈ 6 s<sup>-1</sup>; Supp. Fig. 6D). This variation in cortical surface R<sub>2</sub>* is about 12-30 times larger compared to the signal changes observed during task-based fMRI (6 vs. 0.2-0.5 s<sup>-1</sup>). This relation seems reasonable because regional increases in absolute blood flow associated with imaging signals, as measured by PET, typically do not exceed 5%–10% of the brain's resting blood flow (Raichle and Mintum 2016; Brain work and brain imaging). The venous oxygenation level is typically 60%, with task-induced activation increasing it by only a few percent. We suggest that this is ~40% oxygen extraction is reflected in the superficial R<sub>2</sub>*. Finally, the large intercept (≈ 14.5 1/s; Supp. Fig. 6D), which is not equivalent to the water R<sub>2</sub>* (≈ 1 1/s), suggests that R<sub>2</sub>* is influenced by substantial non-neuron density factors, such as receptor, myelin, iron, susceptibility gradients and spatial resolution.

      The R<sub>2</sub>* values are well known to be influenced by intra-voxel phase coherence and thus spatial resolution. However, our view is that the proposed methodology of acquiring cortical-layer thickness adjusted high-resolution (spin-echo) R<sub>2</sub> maps poses more methodological limitations and is less practical. Notwithstanding, to further corroborate the relationship between R<sub>2</sub>* and neuron density, we investigated whether a similar correlation exists in non-quantitative T2w SPACE-FLAIR images (0.32 mm isotropic) signal-intensity and neuron density. Using B<sub>1</sub> bias-field and B<sub>0</sub> orientation bias corrected T2w SPACE-FLAIR images (N=7), we parcellated the equivolumetric surface maps using Vanderbilt sections. Our findings showed that signal intensity—where regions with high signal intensity correspond to low R<sub>2</sub> values, and areas with low signal intensity correspond to high R<sub>2</sub> values—was positively correlated with neuron density, particularly in the superficial layers (R = 0.77, p = 10<sup>-11</sup>; Author response image 1).This analysis confirmed the correlation with neuron density and R<sub>2</sub> peaks at superficial layers. However, this correlation was slightly weaker compared to quantitative R<sub>2</sub>* (Supp. Fig. 6D), suggesting the variable flip-angle spin-echo train refocused signal-phase coherence loss from large draining vessels or that non-quantitative T2w-FLAIR images may be confounded by other factors such as B<sub>1</sub> transmission field biases (Glasser et al., 2022). Notwithstanding, this non-quantitative fast spin-echo with variable flip-angles approach, which is in principle less dependent on image resolution and closer to R<sub>2,intrinsic</sub> than R<sub>2</sub>*, yields similar findings in comparison to quantitative gradient-echo.

      Author response image 1.

      (A) T2w-FLAIR SPACE normalized signal-intensity plotted vs neuron density. Note that low signal-intensity corresponds to high R<sub>2</sub> and high neuron density, consistent with findings using ME-GRE. (B) Correlation between T2w-FLAIR SPACE and neuron density across equivolumetric layers. Notably, a similar relationship with neuron density was observed using a variable spin-echo pulse sequence as with quantitative gradient-echo-based imaging.

      Changes in text:

      Results:

      “Because the Julich cortical area atlas covers only a section of the cerebral cortex, and the neuron density estimates are interpolated maps, we extended our analysis using the original Collins sample borders encompassing the entire cerebral cortex (Supp. Fig. 6A-C). This analysis reaffirmed the positive correlation with ΔR<sub>2</sub>* (peak at EL2, R = 0.80, p < 10<sup>-11</sup>) and baseline R<sub>2</sub>* (peak at EL2a, R = 0.86, p < 10<sup>-13</sup>), yielding linear coefficients of ΔR<sub>2</sub>* = 102 × 10<sup>3</sup> neurons/s and R<sub>2</sub>* = 41 × 10<sup>3</sup> neurons/s (Supp. Fig. 6D-G). This suggests that the sensitivity of quantitative layer R<sub>2</sub>* MRI in detecting neuronal loss is relatively weak, and the introduction of the Ferumoxytol contrast agent has the potential to enhance this sensitivity by a factor of 2.5.”

      A new paragraph was added into discussion section 4.3 corroborating the relation between R<sub>2</sub>* and neuron density:

      “Another key finding of this study was the strong correlation between baseline R<sub>2</sub>* and neuron density (Supp. Fig. 6D, E). While R<sub>2</sub>* is well known to be influenced by iron, myelin, and deoxyhemoglobin densities, this correlation peaks in the superficial layers (Supp. Fig. 6E), suggesting a link to CO activity and the accumulation of deoxygenated venous blood draining from all cortical layers toward the pial network. Notably, the absolute range of superficial R<sub>2</sub>* values (max - min ≈ 6 s<sup>-1</sup>; Supp. Fig. 6D) is approximately 12-30 times larger than the ΔR<sub>2</sub>* observed during task-based BOLD fMRI at 3T (0.2-0.5 1/s) (Yablonskiy and Haacke 1994). Since venous oxygenation is around 60% and task-induced changes in blood flow account for only 5%–10% of the brain's resting blood flow (Raichle & Mintun, 2006), these results suggest that superficial R<sub>2</sub>* (Fig. 1D) may serve as a more accurate proxy for total deoxyhemoglobin content (and thus total oxygen consumption), which scales with the neuron density of the underlying cortical gray matter. Importantly, superficial layers may also provide a more specific measure of deoxyhemoglobin, as they are less influenced by myelin and iron, which are more concentrated in deeper cortical layers. Additionally, smaller but direct contributors, such as mitochondrial CO density—an iron-dependent factor—may also play a role in this relationship.”

      References:

      Raichle, M.E., Mintun, M.A., 2006. BRAIN WORK AND BRAIN IMAGING. Annu. Rev. Neurosci. 29, 449–476. https://doi.org/10.1146/annurev.neuro.29.051605.112819

      (4) CBV-weighted deltaR<sub>2</sub>* is correlated with various other metrics (cytoarchitectural parcellation, myelin/receptor density, cortical thickness, CO, cell-type specificity, etc.). While testing the correlation between deltaR<sub>2</sub>* and these other metrics may be acceptable as an exploratory analysis, it is challenging for readers to discern a causal relationship between them. A critical question is whether CBV-weighted deltaR<sub>2</sub>* can provide insights into other metrics in diseased or abnormal brain states.

      We acknowledge that having multivariate analysis using dense histological maps would be valuable to establish causality among these several metrics:

      “To comprehensively understand the factors contributing to the vascular organization of the brain, experimental disentanglement through multivariate analysis of laminar cell types and receptor densities is needed (Hayashi et al., 2021, Froudist-Walsh et al., 2023). Moreover, employing more advanced statistical modeling, including considerations for synapse-neuron interactions, may be important for refined evaluations.”

      We think the primary contributors to the brain's energy budget are neurons and receptors, as shown in several references and stated in the manuscript. To investigate relationship between neuron density and CBV, we estimated the energy budget allocated to neurons and extrapolated the remaining CBV to other contributing factors:

      Changes in text:

      “However, this is a simplified estimation, and a more comprehensive assessment would need to account for an aggregate of biophysical factors such as neuron types, neuron membrane surface area, firing rates, dendritic and synaptic densities (Fig. 6F-G), neurotransmitter recycling, and other cell types (Kageyama 1982; Elston and Rose 1997; Perge et al., 2009; Harris et al., 2012). Indeed, the majority of the mitochondria reside in the dendrites and synaptic transmission is widely acknowledged to drive the majority of the energy consumption and blood flow (Wong-Riley, 1989; Attwell et al., 2001).

      Extrapolating cortical ΔR<sub>2</sub>* to zero neuron density results in a large intercept (~35 1/s), corresponding to 60% of the maximum cortical CBV (57 1/s; Supp. Fig. 6F). This supports the view that the majority of energy consumption occurs in the neuropil—comprising dendrites, synapses, and axons—which accounts for ~80–90% of cortical gray matter volume, whereas neuronal somata constitute only ~10–20% (Wong-Riley, 1989). Although neuronal cell bodies exhibit higher CO activity per unit volume due to their dense mitochondrial content, these results suggest their overall contribution to the total CBV per mm<sup>3</sup> tissue remains lower than that of the neuropil, given the latter's substantially larger volume fraction in cortical tissue.

      Contrary to our initial expectations, we observed a relatively smaller CBV in regions and layers with high receptor density (Fig. 6B, D, F). This relationship extends to other factors, such as number of spines (putative excitatory inputs) and dendrite tree size across the entire cerebral cortex (Supp. Fig. 7) (Froudist-Walsh et al., 2023, Elston 2007). These results align with the work of Weber and colleagues, who reported a similar negative correlation between vascular length density and synaptic density, as well as a positive correlation with neuron density in macaque V1 across cortical layers (Weber et al., 2008).”

      Variations in neurons and receptors are reflected in cytoarchitecture, myelin (axon density likely scales with neuron density and myelin inhibits synaptic connections), and cell-type composition. For example, fast-spiking parvalbumin interneurons, which target the soma or axon hillock, are well-suited for regulating activity in regions with high neuron density, whereas bursting calretinin interneurons, which target distal dendrites, are more adapted to areas with high synaptic density. These factors in turn, gradually change along the cortical hierarchy level (higher levels have thinner cortical layer IV, more complex dendrite trees and more numerous inter-areal connectivity patterns). In our view, these factors are tightly interlinked and explain the strong correlations and metabolic demands observed across different metrics.

      We also agree that cortical layer imaging of vasculature in diseased or abnormal brain states is an intriguing direction for future research; however, it falls beyond the scope of the present study.

      Reviewer #2 (Public review):

      Summary:

      This manuscript presents a new approach for non-invasive, MRI-based, measurements of cerebral blood volume (CBV). Here, the authors use ferumoxytol, a high-contrast agent and apply specific sequences to infer CBV. The authors then move to statistically compare measured regional CBV with known distribution of different types of neurons, markers of metabolic load and others. While the presented methodology captures and estimated 30% of the vasculature, the authors corroborated previous findings regarding lack of vascular compartmentalization around functional neuronal units in the primary visual cortex.

      Strengths:

      Non invasive methodology geared to map vascular properties in vivo.

      Implementation of a highly sensitive approach for measuring blood volume.

      Ability to map vascular structural and functional vascular metrics to other types of published data.

      Weaknesses:

      The key issue here is the underlying assumption about the appropriate spatial sampling frequency needed to captures the architecture of the brain vasculature. Namely, ~7 penetrating vessels / mm2 as derived from Weber et al 2008 (Cer Cor). The cited work, begins by characterizing the spacing of penetrating arteries and ascending veins using vascular cast of 7 monkeys (Macaca mulatta, same as in the current paper). The ~7 penetrating vessels / mm2 is computed by dividing the total number of identified vessels by the area imaged. The problem here is that all measurements were made in a "non-volumetric" manner and only in V1. Extrapolating from here to the entire brain seems like an over-assumption, particularly given the region-dependent heterogeneity that the current paper reports.

      We appreciate the reviewer’s concerns regarding spatial sampling frequency and its implications for characterizing brain vasculature, which we investigated in this study. To clarify, our analysis of surface vessel density was explicitly restricted to V1 precisely due to the limitations of our experimental precision. While we reported the total number of vessels identified in the cortex, we intentionally chose not to present density values across regions in this manuscript. Although these calculations are feasible, we focused on the data directly analyzed and avoided extrapolating density values beyond the scope of our findings. Thus, we are uncertain about the suggestion that we extrapolated vessel density values across the entire brain, as we have taken care to limit our conclusions of our vessel density precision to V1.

      Regarding methodology, we conducted two independent analyses of vessel density specifically in V1. The first involved volumetric analysis using the Frangi filter, while the second used surface-based analysis of local signal-intensity gradients (as illustrated in Fig. 2E and Supp. Figs. 3 and 4), albeit the final surface density analysis is performed using the ultra-high resolution equivolumetric layers. Notably, these two approaches produced consistent and comparable vessel density estimates, supporting the reliability of our findings within the scope of V1 (we found 30% of the vessels relative to the ground-truth).

      Comments on revisions:

      I appreciate the effort made to improve the manuscript. That said, the direct validation of the underlying assumption about spatial resolution sampling remains unaddressed in the final version of this manuscript. With the only intention to further strengthen the methodology presented here, I would encourage again the authors to seek a direct validation of this assumption for other brain areas.

      In their reply, the authors stated "... line scanning or single-plane sequences, at least on first impression, seem inadequate for whole-brain coverage and cortical surface mapping. ". This seems to emanate for a misunderstanding as the method could be used to validate the mapping, not to map per-se.

      We apologize for any misunderstanding in our previous response and appreciate your clarification. We now understand that you were suggesting the use of line-scanning or single-plane sequences as a method to validate, rather than map, our spatial sampling assumptions.

      We agree that single-plane sequences at very high in-plane resolution (e.g., 50 × 50 × 1000 µm) have great potential to detect penetrating vessels and even vessel branching patterns. These techniques could indeed provide valuable insights into region-specific vessel density variations which could then be used to validate whole brain 3D acquisitions. However, as noted above, we have refrained from reporting vessel densities outside V1 precisely due to sampling limitations (we only found 30% of the penetrating vessels in V1, or only 2 mm<sup>2</sup>/30mm<sup>2</sup> ≈ 7% of branching vessel ground-truth, see discussion).

      We acknowledge the merit of incorporating such methods to validate regional vessel densities and agree that this would be an important avenue for future research. Thank you for suggesting this point, we have briefly mentioned the advantage of single-plane EPI at discussion.

      Changes in text:

      “4.1 Methodological considerations - vessel density informed MRI

      …anatomical studies accounting for branching patterns have reported much higher vessel densities up to 30 vessels/mm<sup>2</sup> (Keller et al., 2011; Adams et al., 2015). Further investigations are warranted, taking into account critical sampling frequencies associated with vessel branching patterns (Duverney 1981), and achieving higher SNR through ultra-high B<sub>0</sub> MRI (Bolan et al., 2006; Harel et al., 2010; Kim et al., 2013) and utilize high-resolution single-plane sequences and prospective motion correction schemes to accurately characterize regional vessel densities. Such advancements hold promise for improving vessel quantification, classifications for veins and arteries and constructing detailed cortical surface maps of the vascular networks which may have diagnostic and neurosurgical utilities (Fig. 2A, B) (Iadecola, 2013; Qi and Roper, 2021; Sweeney et al., 2018).”

      During the revision we found a typo and corrected it in Supp. Fig. 8: Dosal -> Dorsal.

    1. Author response:

      The following is the authors’ response to the original reviews.

      In light of some reviewer comments requesting more clarity on the relationship between our model and prior theoretical studies of systems consolidation, we propose a modification to the title of our manuscript: “Selective consolidation of learning and memory via recall-gated plasticity.” We believe this title better reflects the key distinguishing feature of our model, that it selectively consolidates only a subset of memories, and also highlights the model’s applicability to task learning as well as memory storage.

      Major comments:

      Reviewer #3’s primary concern with the paper is the following: “The main weakness of the paper is the equation of recall strength with the synaptic changes brought about by the presentation of a stimulus. In most models of learning, synaptic changes are driven by an error signal and hence cease once the task has been learned. The suggested consolidation mechanism would stop at that point, although recall is still fine. The authors should discuss other notions of recall strength that would allow memory consolidation to continue after the initial learning phase.”

      We thank the reviewer for drawing attention to this issue, which primarily results from a poor that memories should be interpreted as actual synaptic weight updates,∆𝑤and thus in the context choice of notation on our part. Our decision to denote memories as gives the impression of supervised learning would go to zero when the task is learned. However, in the formalism of our model, memories are in fact better interpreted as target values of synaptic weights, and the synaptic model/plasticity rule is responsible for converting these target values into synaptic weight updates. We were unclear on this point in our initial submission, because our paper primarily considers binary synaptic weights, where target synaptic weights have a one-to-one correspondence with candidate synaptic weight updates. We have updated the paper to use w* to refer to memories, which we hope resolves this confusion, and have updated our introduction to the term “memory” to reflect their interpretation as target synaptic weight values. We have also updated the paper’s language to more clearly disambiguate between the “learning rule,” which determines how the memory vector (target synaptic weight vectors) are derived from task variables, and the “plasticity rule,” which governs how these are translated into actual synaptic weight updates. We acknowledge that our manuscript still does not explicitly consider a plasticity rule that is sensitive to continuous error error signals, as our analysis is restricted to binary weights. However, we believe that the updated notation and exposition makes it more clear that our model could be applied in such a case.

      Reviewer #1 brought up that our framework cannot capture “single-shot learning, for example, under fear conditioning or if a presented stimulus is astonishing.” Reviewer #2 raised a related question of how our model “relates to the opposite more intuitive idea, that novel surprising experiences should be stored in memory, as the familiar ones are presumably already stored.”

      We agree that the built-in inability to consolidate memories after a single experience is a limitation of our model, and that extreme novelty is one factor (among others, such as salience or reward) that might incentivize one-shot consolidation. We have added a comment to the discussion to acknowledge these points (added text in bold): “ Moreover, in real neural circuits, additional factors besides recall, such as reward or salience, are likely to influence consolidation as well. For instance, a sufficiently salient event should be stored in long-term memory even if encountered only once. Furthermore, while in our model familiarity drives consolidation, certain forms of novelty may also incentivize consolidation, raising the prospect of a non-monotonic relationship between consolidation probability and familiarity.” We agree that future work should address the combined influence of recall (as in our model) and other factors on the propensity to consolidate a memory.

      Reviewer #1 requested, “a comparison/discussion of the wide range of models on synaptic tagging for consolidation by various types of signals. Notably, studies from Wulfram Gerstner's group (e.g., Brea, J., Clayton, N. S., & Gerstner, W. (2023). Computational models of episodic-like memory in food-caching birds. Nature Communications, 14(1); and studies on surprise).”

      We thank the reviewer for the reference, which we have added to the manuscript. The model of Brea et al.(2023) is similar to that of Roxin & Fusi (2013), in that consolidation consists of “copying” synaptic weights from one population to another. As a result, just like the model of Roxin & Fusi (2013), this model does not provide the benefit that our model offers in the context of consolidating repeatedly recurring memories. However, the model of Brea et al. does have other interesting properties – for instance, it affords the ability to decode the age of a memory, which our model does not. We have added a comment on this point in the subsection of the Discussion tilted “Other models of systems consolidation.”

      Reviewer #2 noted, “While the article extensively discusses the strengths and advantages of the recall-gated consolidation model, it provides a limited discussion of potential limitations or shortcomings of the model, such as the missing feature of generalization, which is part of previous consolidation models. The model is not compared to other consolidation models in terms of performance and how much it increases the signal-to-noise ratio.”

      We agree that our work does not consider the notion of generalization and associated changes to representational geometry that accompany consolidation, which is the focus of many other studies on consolidation. We have further highlighted this limitation in the discussion. Regarding the comparison to other models, this is a tricky point as the desiderata we emphasize in this study (the ability to recall memories that are intermittently reinforced) is not the focus of other studies. Indeed, our focus is primarily on the ability of systems consolidation to be selective in which memories are consolidated, which is somewhat orthogonal to the focus of many other theoretical studies of consolidation. We have updated some wording in the introduction to emphasize this focus.

      Additional comments made by reviewer #1

      Reviewer #1 pointed out issues in the clarity of Fig. 2A. We have added substantial clarifying text to the figure caption.

      Reviewer #1 pointed out lack of clarity in our introduction to the terms “reliability” and “reinforcement.” We have now made it more clear what we mean by these terms the first time they are used.

      We have updated our definition of “recall” to use the term “recall factor,” which is how we refer to it subsequently in the paper.

      We have made explicit in the main text our simplifying assumption that memories are mean-centered.

      We have made consistent our use of “forgetting curve” and “memory trace”.

      Additional comments made by reviewer #2

      We have added a comment in the discussion acknowledging alternative interpretations of the result of Terada et al. (2021)

      We have significantly expanded the discussion of findings about the mushroom body to make it accessible to readers who do not specialize in this area. We hope this clarifies the nature of the experimental finding, which uncovered a circuit that performs a strikingly clean implementation of our model.

      The reviewer expresses concern that the songbird study (Tachibana et al., 2022) does not provide direct evidence for consolidation being gated by familiarity of patterns of activity. Indeed, the experimental finding is one-step removed from the direct predictions of our model. That said, the finding – that the rate of consolidation increases with performance – is highly nontrivial, and is predicted by our model when applied to reinforcement learning tasks. We have added a comment to the discussion acknowledging that this experimental support for our model is behavioral and not mechanistic.

      We do not regard it as completely trivial that the parallel LTM model performs roughly the same as the STM model, since a slower learning rate can achieve a higher SNR (as in Fig. 2C). Nevertheless we have added wording to the main text around Fig. 4B to note that the result is not too surprising.

      We have added a sentence that clarifies the goal / question of our paper earlier on in the introduction.

      We have updated Figure 3 by labeling the key components of the schematics and adding more detail to the legend, as suggested by the reviewer. We also reordered the figure panels as suggested.

      Additional comments made by reviewer #3:

      We have clarified in the main text that Fig. 2C and all results from Fig. 4 onward are derived from an ideal observer model (which we also more clearly define).

      We have now emphasized in the main text that the derivations of the recall factors for specific learning rules are derived in the Supplementary Information.

      We have highlighted more clearly in the main text that the recall factors associated with specific learning rules may correspond to other notions that do not intuitively correspond to “recall,” and have added a pointer to Fig. 3A where these interpretations are spelled out.

      We have added references corresponding to the types of learning rules we consider.

      The cutoffs / piecewise-looking behavior of plots in Fig. 4 are primarily the result of finite N, which limits the maximum SNR of the system, rather than coarse sampling of parameter values.

      Thank you for pointing out the error in the legend in Fig. 5D (also affected Supp Fig. S7/S8), which is now fixed.

      The reference to the nonexistence panel Fig. 5G has been removed.

      As the reviewer points out, the use of a binary action output in our reinforcement learning task renders it quite similar to the supervised learning task, making the example less compelling. In the revised manuscript we have updated the RL simulation to use three actions. Note also that in our original submission the network outputs represented action probabilities directly (which is straightforward to do for binary actions, but not for more than two available actions). In order to parameterize a policy when more than two actions are available, we sample actions using a softmax policy, as is more standard in the field and as the reviewer suggested. The associated recall factor is still a product of reward and a “confidence factor,” and the confidence factor is still the value of the network output in the unit corresponding to the chosen action, but in the updated implementation this factor is equal to , similar (though with a sign difference) to the reviewer’s suggestion. We believe these updates make our RL implementation and simulation more compelling, as it allows them to be applied to tasks with arbitrary numbers of actions.

      Additional minor comments

      The reviewers made a number of other specific line-by-line wording suggestions, typo corrections,

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      The study provides potentially fundamental insight into the function and evolution of daily rhythms. The authors investigate the function of the putative core circadian clock gene Clock in the cnidarian Nematostella vectensis. While it parts still incomplete, the evidence suggests that, in contrast to mice and fruit flies, Clock in this species is important for daily rhythms under constant conditions, but not under a rhythmic light/dark cycle, suggesting that the major role of the circadian oscillator in this species could be a stabilizing function under non-rhythmic environmental conditions.

      Public Reviews:

      Reviewer #1 (Public Review):

      In this nice study, the authors set out to investigate the role of the canonical circadian gene Clock in the rhythmic biology of the basal metazoan Nematostella vectensis, a sea anemone, which might illuminate the evolution of the Clock gene functionality. To achieve their aims the team generated a Clock knockout mutant line (Clock-/- ) by CRISPR/Cas9 gene deletion and subsequent crossing. They then compared wild-type (WT) with Clock-/- animals for locomotor activity and transcriptomic changes over time in constant darkness (DD) and under light/dark cycles to establish these phenotypes under circadian control and those driven by light cycles. In addition, they used Hybridization Chain Reaction-In situ Hybridization (HCR-ISH) to demonstrate the spatial expression of Clock and a putative circadian clocl-controlled gene Myh7 in whole-mounted juvenile anemones.

      The authors demonstrate that under LD both WT and Clock-/- animals were behaviourally rhythmic but under DD the mutants lost this rhythmicity, indicating that Clock is necessary for endogenous rhythms in activity. With altered LD regimes (LD6:6) they show also that Clock is light-dependent. RNAseq comparisons of rhythmic gene expression in WT and Clock-/- animals suggest that clock KO has a profound effect on the rhythmic genome, with very little overlap in rhythmic transcripts between the two phenotypes; of the rhythmic genes in both LD and DD in WT animals (220- termed clock-controlled genes, CCGS) 85% were not rhythmic in Clock-/- animals in either light condition. In silico gene ontology (GO) analysis of CCGS reflected process associated with circadian control. Correspondingly, those genes rhythmic in KO animals under DD (here termed neoCCGs) were not rhythmic in WT, lacked upstream E-box motifs associated with circadian regulation, and did not display any GO enrichment terms. 'Core' circadian genes (as identified in previous literature) in WT and Clock-/- animals were only rhythmic under entrainment (LD) conditions whilst Clock-/- displayed altered expression profiles under LD compared to WT. Comparing CCGs with previous studies of cycling genes in Nematostellar, the authors selected a gene from 16 rhythmic transcripts. One of these, Myh7 was detectable by both RNAseq and HCR-ISH and considered a marker of the circadian clock by the authors.

      The authors claim that the study reveals insights into the evolutionary origin of circadian timing; Clock is conserved across distant groups of organisms, having a function as a positive regulator of the transcriptional translational feedback loop at the heart of daily timing, but is not a central element of the core feedback loop circadian system in this basal species. Their behavioural and transcriptomic data largely support the claims that Clock is necessary for endogenous daily activity but that the putative molecular circadian system is not self-sustained under constant darkness (this was known already for WT animals)- rather it is responsive to light cycles with altered dynamics in Clock-/- specimens in some core genes under LD. In the main, I think the authors achieved their aims and the manuscript is a solid piece of important work. The Clock-/- animal is a useful resource for examining time-keeping in a basal metazoan.

      The work described builds on other transcriptomic-based works on cnidaria, including Nematostellar, and does probe into the molecular underpinnings with a loss-of-function in a gene known to be core in other circadian systems. The field of chronobiology will benefit from the evolutionary aspect of this work and the fact that it highlights the necessity to study a range of non-model species to get a fuller picture of timing systems to better appreciate the development and diversity of clocks.

      Strengths:

      The generation of a line of Clock mutant Nematostellar is a very useful tool for the chronobiological community and coupled with a growing suite of tools in this species will be an asset. The experiments seem mostly well conceived and executed (NB see 'weaknesses'). The problem tackled is an interesting one and should be an important contribution to the field.

      Weaknesses:

      I think the claims about shedding light on the evolutionary origin of circadian time maintenance are a little bold. I agree that the data do point to an alternative role for Clock in this animal in light responsiveness, but this doesn't illuminate the evolution of time-keeping more broadly in my view. In addition, these are transcriptomic data and so should be caveated- they only demonstrate the expression of genes and not physiology beyond that. The time-course analysis is weakened by its low resolution, particularly for the RAIN algorithm when 4-hour intervals constrain the analysis. I accept that only 24h rhythms were selected in the analysis from this but, it might be that detail was lost - I think a preferred option would be 2 or 3-hour resolution or 2 full 24h cycles of analysis.

      The authors discount the possibility of the observed 12h rhythmicity in Clock-/- animals by exposing them to LD6:6 cycles before free-running them in DD. I suggest that LD cycles are not a particularly robust way to entrain tidal animals as far as we know. Recent papers show inundation/mechanical agitation are more reliable cues (Kwiatkowski ER, et al. Curr Biol. 2023, 2;33(10):1867-1882.e5. doi: 10.1016/j.cub.2023.03.015; Zhang L., et al Curr Biol. 2013, 23;19, 1863-1873 doi.org/10.1016/j.cub.2013.08.038.) and might be more effective in revealing endogenous 12h rhythms in the absence of 24h cues.

      Response: We removed the suggestion that we used 6:6h LD to perform tidal entrainment. We generated this ultradian light condition to address the 24h rhythmicity observed in the NvClk1-/- in 12:12h LD.

      Reviewer #2 (Public Review):

      This manuscript addresses an important question: what is the role of the gene Clock in the control of circadian rhythms in a very primitive group of animals: Cnidaria. Clock has been found to be essential for circadian rhythms in several animals, but its function outside of Bilaterian animals is unknown. The authors successfully generated a severe loss-of-function mutant in Nematostella. This is an important achievement that should help in understanding the early evolution of circadian clocks. Unfortunately, this study currently suffers from several important weaknesses. In particular, the authors do not present their work in a clear fashion, neither for a general audience nor for more expert readers, and there is a lack of attention to detail. There are also important methodological issues that weaken the study, and I have questions about the robustness of the data and their analysis. I am hoping that the authors will be able to address my concerns, as this work should prove important for the chronobiology field and beyond. I have highlighted below the most important issues, but the manuscript needs editing throughout to be accessible to a broad audience, and referencing could be improved.

      Major issues:

      (1) Why do the authors make the claim in the abstract that CLOCK function is conserved with other animals when their data suggest that it is not essential for circadian rhythms? dCLK is strictly required in Drosophila for circadian rhythms. In mammals, there are two paralogs, CLOCK and NPAS2, but without them, there are no circadian rhythms either. Note also that the recent claim of BMAL1-independent rhythms in mammals by Ray et al., quoted in the discussion to support the idea that rhythms can be observed in the absence of the positive elements of the circadian core clock, had to be corrected substantially, and its main conclusions have been disputed by both Abruzzi et al. and Ness-Cohn et al. This should be mentioned.

      Response: According to our Behavioral and Transcriptomic data, CLOCK function is conserved in constant light condition. In LD context, the rhythmicity is maintained probably by the light-response pathway in Nematostella. We modified our rhythmic transcriptomic analysis and considered the context of the contested results by Ray et al., and discussed it in the revised manuscript.

      (2) The discussion of CIPC on line 222 is hard to follow as well. How does mRNA rhythm inform the function of CIPC, and why would it function as a "dampening factor"? Given that it is "the only core clock member included in the Clock-dependent CCGs," (220) more discussion seems warranted. Discussing work done on this protein in mammals and flies might provide more insight.

      Response: The initial sentence was unclear. Furthermore, since we restricted our rhythmic analysis to genes only found rhythmic with a p<0.01 with RAIN combined with JTK, NvCipc was no longer defined as rhythmic in free running.

      (3) The behavioral arrhythmicity seen with their Clock mutation is really interesting. However, what is shown is only an averaged behavior trace and a single periodogram for the entire population. This leaves open the possibility that individual animals are poorly synchronized with each other, rather than arrhythmic. I also note that in DD there seem to be some residual rhythms, though they do not reach significance. Thus, it is also possible that at least some individual animals retain weak rhythms. The authors should analyze behavioral rhythms in individual animals to determine whether behavioral rhythmicity is really lost. This is important for the solidity of their main conclusions.

      Response: Fig. 1 has been modified. We have separated the data for WT and NvClk1-/- animals to provide clarity on the average behavior pattern for each genotype. While the LSP analysis on the population average informs us about the synchronization of the population, it is true that it does not provide insight into individual rhythmicity. To address this, we analyzed individuals in all conditions using the Discorhythm website (Carlucci et al., 2019).

      In the revised figure, we have included a comparison plot of the acrophase of 24-hour rhythmic animals between genotypes using Cosinor analysis, which is most suitable for acrophase detection. This plot indicates the number of animals detected as significantly rhythmic, providing direct visual input to the reader regarding individual rhythmicity. Additionally, we have added Table 1, which contains the Cosinor period analysis (24 and 12 hours) of individuals for all genotypes and conditions, further enhancing the clarity of our findings.

      (4) There is no mention in the results section of the behavior of heterozygotes. Based on supplement figure 2A, there is a clear reduction in amplitude in the heterozygous animals. Perhaps this might be because there is only half a dose of Clock, but perhaps this could be because of a dominant-negative activity of the truncated protein. There is no direct functional evidence to support the claim that the mutant allele is nonfunctional, so it is important to discuss carefully studies in other species that would support this claim, and the heterozygous behavior since it raises the possibility that the mutant allele acts as a dominant negative.

      Response: Extended Data Fig.1 modified. We show NvClk1+/- normalized locomotion over time in DD of the population, comparison of individual normalized behavior amplitude, LSP of the average population and individual acrophase of only rhythmic 24h individuals. Indeed, we cannot discriminate Dominant-negative from non-functional allele.

      (5) I do not understand what the bar graphs in Figure 2E and 3B represent - what does the y-axis label refer to?

      Response: Not relevant to the revised manuscript.

      (6a) I note that RAIN was used, with a p<0.05 cut-off. I believe RAIN is quite generous in calling genes rhythmic, and the p-value cut-off is also quite high. What happens if the stringency is increased, for example with a p<0.01.

      Response: We acknowledge your concern regarding the stringency of our statistical analysis. To address this, we opted to combine both RAIN and JTK methods and applied a more stringent p-value cut-off of p<0.01.

      (6b) It would be worth choosing a few genes called rhythmic in different conditions (mutant or wild-type. LD or DD), and using qPCR to validate the RNAseq results. For example, in Figure 3D, Myh7 RNAseq data are shown, and they do not look convincing. I am surprised this would be called a circadian rhythm. In wild-type, the curve seems arrhythmic to me, with three peaks, and a rather large difference between the first and second ZT0 time point. In the Clock mutants, rhythms seem to have a 12hr period, so they should not be called rhythmic according to the material and methods, which says that only ca 24hr period mRNA rhythms were considered rhythmic. Also, the result section does not say anything about Myh7 rhythms. What do they tell us? Why were they presented at all?

      Response: Regarding the suggestion for independent verification of our RNAseq results, we agree that such validation would enhance the robustness of our findings. To address this, we chose to overlap our identified rhythmic genes under WT LD conditions with those from another transcriptomic study that shared similarities in experimental design. Notably, the majority of overlapping rhythmic genes between the studies are candidate pacemaker genes. We believe that this replication of biologically significant rhythmic genes strengthens the validity and reliability of our results (see Extended Data Fig. 2).

      Furthermore, we have decided to remove the NvMhc-st (mistakenly named Myh7, only rhythmic in WT DD in the new analysis) as it does not contribute substantively to the revised version of the manuscript.

      (7) The authors should explain better why only the genes that are both rhythmic in LD and DD are considered to be clock-controlled genes (CCGs). In theory, any gene rhythmic in DD could be a CCG. However, Leach and Reitzel actually found that most genes in DD1 do not cycle the next day (DD2)? This suggests that most "rhythmic" genes might show a transient change in expression due to prolonged obscurity and/or the stress induced by the absence of a light-dark cycle, rather than being clock controlled. Is this why the authors saw genes rhythmic under both LD and DD as actual CCGs? I would suggest verifying that in DD the phase of the oscillation for each CCG is similar to that in LD. If a gene is just responding to obscurity, it might show an elevated expression at the end of the dark period of LD, and then a high level in the first hours of DD. Such an expression pattern would be very unlikely to be controlled by the circadian clock.

      Response: As we modified our transcriptomic analysis, we do no longer analyze LD+DD rhythmic genes, but any genes rhythmic (RAIN and JTK p<0.01) in each condition. As such we end up with four list of genes corresponding to each experimental conditions.

      (8) Since there are still rhythms in LD in Clock mutants, I wonder whether there is a paralog that could be taking Clock's place, similar to NPAS2 in mammals.

      Response: see response to (1) > The only NPAS2 orthologous identified in Nematostella NPAS3 showed marginally significance (p=0.013) with RAIN in LD WT suggesting a regulation similar to the candidate pacemaker genes. As such we included within our candidate pacemaker genes list.

      (9) I do not follow the point the authors try to make in lines 268-272. The absence of anticipatory behavior in Drosophila Clk mutants results from disruption of the circadian molecular clock, due to the loss of Clk's circadian function. Which light-dependent function of Clock are the authors referring to, then? Also, following this, it should be kept in mind that clock mutant mice have a weakened oscillator. The effect on entrainment is secondary to the weakening of the oscillator, rather than a direct effect on the light input pathway (weaker oscillators have increased response to environmental inputs). The authors thus need to more clearly explain why they think there is a conservation of circadian and photic clock function.

      Response: Following the changes in our statistical analysis we reframed the discussion and address directly the circadian and the photic clock function (we call it light-response pathway in the manuscript)

      Recommendations for the authors:

      We suggest the following improvements:

      (1) Please undertake a serious effort to make this work more accessible to non-marine chronobiologists. This includes better explanations, and schemes of the animal when images of staining are shown (e.g. Fig.1b) which include the labeling of relevant morphological structures mentioned in the text (like "tentacle endodermis and mesenteries" (line 132)). Similar issues for mentioned life cycle stages like "late planula stage" (line 133), "bisected physa" (line 149).

      Response: Fig. 1b, we outlined the animal shaped and added 2 arrows to locate the tentacle endodermis and mesenteries. We replaced the term late planula stage, by larvae. And we rephrased bisected physa by tissue sampling.

      Please attend to details. This includes:

      • Wrong referrals to figures (currently line 151 refers to EDF2- but should be EDF 1 instead, there is a Fig.3f mentioned in the text, but there is no such Fig.).

      Response: Fixed

      • Mentioning of ZTs when the HCR stainings were performed.

      Response: Fixed

      • Fig.1 a shows a rather incomplete and thus potentially confusing phylogenetic tree. Vertebrates have at least two Clk orthologs (NPAS2 and CLK), please include both, use an outgroup, and rout the tree.

      Response: Identifying NPAS2 and CLK orthologous in all species added more confusion into the conclusion. However, we followed the suggestion of adding an outgroup using a CLK orthologous sequence identified in the sponge Amphimedon queenslandica and rout the tree. Thank for the suggestion.

      • What do the y-axis labels in Figure 2E and 3B refer to exactly? Y-axis label annotations in Fig.3a,d are entirely missing- what do the numbers refer to?

      Response: not relevant in the revised manuscript

      • Fig.2D- is the Go term enrichment referring to LD or DD?

      Response: to DD. We made it cleared on the figure 5.

      • Wording: "Clock regulates genetic pathways." What is meant by "genetic pathways"? There are no "non-genetic pathways". Could one simply say: "Clock regulates a variety of transcripts".

      Response: We modified our threshold to use only p.adj<0.01, which reduced the GO term numbers. We removed “genetic pathways” and now address the specific pathways: cell-cycle and neuronal.

      The use of the term "epistatic" is confusing (line 219), i.e. that light is epistatic to Clock. In genetics, epistasis is defined as the effect of gene interactions on phenotypes. To a geneticist, this implies that there is a second gene impacting on the phenotype of the Clock mutants. Please re-word.

      Response: “light is epistatic on Clock” has been re-phrased.

      The provided Supplementary tables are not well annotated. Several of them need guess-work about what is shown. For instance, for Supplementary Table 1, the Ns are unclear, which in total can go up to almost 200 per condition-genotype, but only about 30 animals for each were tested. Thus, where do the high totals in the LSP table come from? What do the numbers of each periodicity mean? Initially one might assume it was the number of animals that showed a periodogram peak at a given periodicity, but it seems that cannot be. Maybe it counted any period bin over statistical significance? Please clarify with better descriptions and labels.

      Response: Supplementary tables are now clearly annotated on their first Tabs. About Fig.1, we already addressed this point in the public review.

      Albeit not essential, it would be more reader-friendly to also add a summary table with average period and SD, power and SD, and percentage rhythmicity to the main figure.

      Response: Table 1 is added: it contains individual count of rhythmic animals (24h and 12h) with Cosinor. However, using Discorhythm we had to ask for a specific Period. Thus, we can only provide animal count significant for a given period value. And not an estimation of their own period.

      (2) Some of the terminology is quite confusing, in particular the double meaning of the word "clock" (i.e the pacemaker and the transcription factor). This is not a specific problem to this manuscript, but it would be helpful for the readability to try to improve this.

      Could the gene/transcript/protein be spelled: clk and Clk?

      Alternatively, for clarity- how about talking about "core pacemaker genes," "CLOCK-dependent rhythmic genes" and "CLOCK-independent rhythmic genes"?

      Response:

      Clock/CLOCK > NvClk / NvCLK and the mutant is NvClk1-/-

      Core clock genes > candidate pacemaker genes.

      CLOCK-dependent CCG > this notion no longer exists in the revised manuscript.

      CLOCK-independent CCG > this notion no longer exists in the revised manuscript.

      (3) The dismissal of the 12h rhythmicity in Clock-/- animals is not really convincing and should be reconsidered. LD6:6 cycles (before free-running animals in DD) is likely a not particularly robust way to entrain tidal animals. Recent papers show inundation/mechanical agitation are more reliable cues (Kwiatkowski ER, et al. Curr Biol. 2023, 2;33(10):1867-1882.e5. doi: 10.1016/j.cub.2023.03.015; Zhang L., et al Curr Biol. 2013, 23;19, 1863-1873 doi.org/10.1016/j.cub.2013.08.038.) and might be more effective in revealing endogenous 12h rhythms in the absence of 24h cues.

      Response: We removed the proposition of using 6:6hLD as Tidal entrainment. Instead, the LD 6:6 experiment reveals the direct light-dependency of the NvClk1-/- mutant.

      (4) There are significant questions raised on the validity of BMAL1-independent rhythms in mammals as suggested by the Ray et al study. See DOI: 10.1126/science.abe9230 and DOI: 10.1126/science.abf0922

      These technical comments should also be taken into account and the discussion adjusted accordingly to better reflect the ongoing discussions in the chronobiology field.

      Response: We modified our rhythmic analysis. As we cannot use BHQ or adjusted p-value which resulted in very genes, we defined 24h-rhythmic genes if p<0.01 with two different algorithms (RAIN and JTK). We propose this compromise to reduce the risk of false-positive. Furthermore, we discussed our methodology in the light of the significant questions raised by these papers you cited. We thank the reviewer for this important point.

      (5) The HCR stainings for clk are not very convincing. Normally, HCR should have more dots. In principle, the logic of HCR is such that it detects individual mRNA molecules in the cell. Thus, having only one strong dot/cell like in Fig.1b doesn't make much sense.

      Response: We were the first surprised by this single dot signal. We are experienced users of HCRv.3 across different species. We decided to remove the close-up (for further investigations) but to keep the full animal signal. According to our approach it is a convincing signal. However, the doty nature of the signal itself it is not easy to make it highly visible at full scale animal on the picture. We did our best to show the mRNA signal visible without altering the pattern.

      Furthermore, the controls for the HCR in situ hybridization are unclear. In the methods, there are two Clock probes described (B3 & B5) and two control probes (B1 & B3), however, in the negative control image, a combination of one Clock (B1) and one control (B3) probes is used and is unclear what "redundant detection" means in the legend of figure S2.

      Response: Considering the nature of the signal (single of few dots), we decided to use two probes with 2 different fluorophores. A noise is by nature random. Our hypothesis was: only overlapping fluorescent dots are true signal of NvClk mRNA.

      For Control probes we used two zebrafish probes labelling hypothalamic peptides.

      Based on the experience with non-Drosophila, non-mouse animal model systems the reviewers assume that non-sense mediated mRNA decay (NMD) is not strongly initiated upon Crispr-induced premature STOP-codons. If this assumption is correct it would be worth to mention it. Alternatively, it would be worth testing if Nematostella induces NMD, as this would be a great control for the HCR and the mutation itself. At which ZT was the HCR done?

      Response: We performed the HCR at ZT10 when NvClk is described to be at peak. It is now indicated in the Fig. 1b. The RNAseq detected a higher quantity of NvClk1 mRNA in the NvClk1-/- (see Fig. 4a). mRNA quantity regulation involves transcription, stabilization, and degradation. At this stage, we cannot identify which specific step is affected.

      For Fig.1c- please provide the binding site and sequence in the figure, simply include EDF 1 in the main figure.

      Response: We generated a clear indication in the new Fig.1c and EDF. 1b about the protein domains, the CRISPR binding site and the consequences on the DNA and AA sequences.

      (6) Please provide the individual trace data for the behavioral analyses either as supplementary files or as a link to an openly accessible database like DRYAD (see also comment 7 in the public review of reviewer 2). Maybe this is what is shown in Supplementary Table 1, but it is really not clear what is actually shown.

      Response: Fig.1 is updated. Table 1 is added. Supplementary Table 1 contains individual normalized locomotor data of each polyps for each genotypes and light conditions. Supplementary Table 2 contains the cosinor individual rhythmic behavior analysis based on the Supplementary Table 1.

      (7) It is not really clear if the mutation is a true loss-of-function or could also be dominant negative. While this is raised in the discussion, it should be more carefully considered. The reason why a dominant negative would be unlikely is unclear. More specifically also see comment 8) in the public review of reviewer 2.

      Response: Indeed, the results cannot tell us if it is a true loss of function, a dominant negative or non-functional allele. We addressed it in the first part of the discussion.

      (8) The pretty small overlap of rhythmic transcripts in LD and DD could reflect the true biology of a more core clock driven-process under constant conditions and a more light-driven process under LD. But still- wouldn't one expect that similar processes should be rhythmic? If not, why not?

      It would certainly add strength to the data if for one or two transcripts these results were independently verified by qPCR from an independent sampling. This could even be done for just two time points with the most extreme differences.

      Response: We appreciate the reviewer's comments and concerns regarding the overlap of rhythmic transcripts in different conditions. In response to the reviewer's query, we revised our interpretation of the transcriptomic data, acknowledging the limited overlap between light and genotype conditions in our study. This prompted us to reconsider the underlying biological processes driving rhythmic gene expression under constant conditions versus light-dark cycles.

      Regarding the suggestion for independent verification of our RNAseq results, we agree that such validation would enhance the robustness of our findings. To address this, we chose to overlap our identified rhythmic genes under WT LD conditions with those from another transcriptomic study that shared similarities in experimental design. Notably, the majority of overlapping rhythmic genes between the studies are candidate pacemaker genes. We believe that this replication of biologically significant rhythmic genes strengthens the validity and reliability of our results (see Extended Data Fig. 2).

      (9) Expression of myh7 : Checking for co-expression should be pretty straightforward by HCR. This is what this type of staining technique is really good for. Please do clk and myh7 co-staining if you want to claim co-expression. Otherwise don't make such a claim.

      Response: We agree that checking for co-expression should be straightforward by HCR. However, due to time constraints during the revision period, we are unable to conduct the double in-situ experiment. Additionally, upon careful consideration, we recognize that including myhc-st (mistakenly named myh7) staining and co-expression analysis would not significantly contribute to the main conclusions of our study. Therefore, we have decided to remove this analysis from the revised manuscript.

      (10) Missing methodological details:

      • The false discovery rate for each analysis should be included (see Hughes et al.,: "Guidelines for Genome-Scale Analysis of Biological Rhythms," 2017).

      Response: THE FDR is indicated for each gene in supplementary table 3

      • Fig.1f- continuous light- please provide a spectrum (If there is no good spectrophotometer available, please provide at least manufacturer information.

      Response: Unfortunately, we don’t have a good spectrophotometer available during the time of the revision. We added to the method the reference of the lamp. We found the light spectrum provided by the supplier. However, we did not add it to the revised manuscript.

      Author response image 1.

      Spectrum of the Aquastar t8

      Also, it would be easier for the reader, if the measurements of light intensity are provided in photons, because this is what the light receptors ultimately measure.

      Response: Modified.

      • Fig.2E- please add the consensus sequence used for circadian E-box vs. E-box to the figure.

      Response: In the revised manuscript Fig.4c, we show which E-box motifs we extracted for our promoter analysis. We as well changed our analysis and did no longer use HOMER, but we directly extracted promoter sequences and looked for canonical Ebox CANNTG and Circadian Ebox CACGTG and generate a Circadian Ebox enrichment output per gene promoter.

      (11) There has been some discussion about the evolutionary statement as stated by the authors. It appears that depending on the background of the reader, this can be misunderstood. We thus suggest to more clearly point out where the author thinks there is evolutionary conservation (a function for clk in the circadian oscillator under constant light or dark conditions) versus where there is no apparent evolutionary conservation (the situation under light-dark conditions).

      Response: In the revised manuscript we proposed a conserved function of NvCLK in constant darkness, and a light-response pathway compensating in LD conditions in the mutant.

      Please also consider the major comments 8 and 9 of the common review from reviewer 2.

      Reviewer #1 (Recommendations For The Authors):

      The hybridization chain-reaction ISH is OK but, I'm not sure I understand the control condition-this should be clarified. I would also welcome the use of Clock-/- animals in HCR as another, more direct level of control. In addition, the authors state that the Myh7 probes hybridise in anatomical regions resembling those for Clock (Fig 3e). It would be better to duplex these two probe sets with different fluors for a better representation of the relative spatial distributions of each transcript.

      Response: We agree that checking for co-expression should be straightforward by HCR. However, due to time constraints during the revision period, we are unable to conduct the double in-situ experiment. Additionally, upon careful consideration, we recognize that including myhc-st (mistakenly named myh7) staining and co-expression analysis would not significantly contribute to the main conclusions of our study. Therefore, we have decided to remove this analysis from the revised manuscript.

      We clarified in the methods the control probes design.

      Minor points:

      Figure legends do not all convey sufficient detail. For instance, Figure 1c needs a better explanation. Figure 3e- are these images both WT? Fig 3f doesn't exist and other figure text references do not align with figures and need an overhaul.

      Response: All errors have been fixed.

      Reviewer #2 (Recommendations For The Authors):

      Major issues:

      (1) The authors need to introduce their model system better for a broad audience. What are the tissues/cells that express Clock at a higher level? What is their function, does this provide a potential explanation for their specific Clock expression, and how CLOCK might regulate behavior? Terms such as "tentacle endodermis and mesenteries" (line 132), "late planula stage" (line 133), "bisected physa" (line 149) would need some explanation.

      Response: We modified term such as planula to larvae, and bisected physa to tissue samples.

      2) Some of the terminology used is quite confusing, because of the double-meaning of the word "clock" (i.e the pacemaker and the transcription factor). The authors use terms such as "clock-controlled genes", "core clock genes", "CLOCK-dependent clock-controlled genes", "neo-clock-controlled genes". Is there any way to help the reader? Here are several suggestions: "core pacemaker genes," "CLOCK-dependent rhythmic genes" and "CLOCK-independent rhythmic genes".

      Response: all the terminology has been clarified, see previous comments

      3) Also in the abstract, there is mention of "hierarchal light- and Clock-signaling" (52-3) - is this related to the statement on line 219 that light is epistatic to Clock? I do not quite understand what epistatic would mean here. Who is upstream of whom? LD modifies rhythmicity in Clock mutant animals, but Clock mutations also impact rhythmicity in LD. Also, as epistasis is defined as the effect of gene interactions on phenotypes - what is the secondary gene impacting the phenotype of the Clock mutants? I am not sure the term epistatic is appropriate in the present context.

      Response: Indeed, Epistatic is a genetic term which might be unclear in this context. We removed it.

      4) The control for the in situ hybridization is unclear. In the methods, there are two Clock probes described (B3 & B5) and two control probes (B1 & B3), however, in the negative control image, a combination of one Clock (B1) and one control (B3) probe is used, I am not sure what "redundant detection" means in the legend of figure S2. Also, the sequences of each Clock probe should be provided. It might be worth testing the Clock mutant the authors generated. Clock mRNA could be reduced due to non-sense, mediated RNA decay, since the mutation causes a premature stop codon. This would be a great additional control for the in situ hybridization. Even better would be if, by chance, the probes target the mutated sequence. The signal should then be completely lost.

      Response: HCR is a tilling probe. Which means the target transcript is covered by dozens of successive DNA sequence “primer-like” which allow the HCRv.3 technology. We cannot design a mutant probe specific with this technology.

      (5) I have concerns with rhythmic-expression calls, particularly as there is so little overlap between LD and DD, and that a completely different set of rhythmic genes is observed in Clock mutant and wild-type animals. I am not an expert in whole-genome expression studies, so I hope one of my colleague reviewers can weigh in.

      When describing rhythmicity analysis in the Methods, it states that Benjamini-Hochberg corrections were applied to account for multiple comparisons. However, the false discovery rate for each analysis should be included (see Hughes et al.,: "Guidelines for Genome-Scale Analysis of Biological Rhythms," 2017).

      Response: As explained before we cannot used Benjamini-Hochberg corrections as only few genes (mostly oscillator gene pass the threshold). As such we combined two different algorithms (RAIN and JTK) with a p<0.01 to detect confidently rhythmic genes while reducing the risk of false-positives.

      Minor issues:

      (1) Environmental inputs are not "circadian", as written in the title.

      Response: Title modified

      (2) In the abstract, the description of the Clock mutant behavioral phenotypes is hard to follow, with no mention of whether or not Clock mutant animals are behaviorally rhythmic or arrhythmic in constant conditions.

      Response: corrected

      (3) Abstract: A 6/6 h LD cycle is not a compressed tidal cycle as written in the abstract. Light is not an input to tidal rhythms.

      Response: corrected

      (4) Line 101: timeout is not a core clock gene in animals.

      Response: we removed it from the candidate pacemaker genes.

      (5) What is the evidence for the role of PAR-Zip proteins in the Nematostella clock? The reference provided does not mention those.

      Response: There is no functional data in Nematostella yet to support their role within the pacemaker. However based on their rhythmicity in LD and protein conservation, we included them within the candidate pacemaker genes list. The refences have been corrected.

      (6) Line 125. should refer to Fig 1C when describing the Clock protein.

      Response: corrected

      (7) Line 143-4. based on the figure, the region targeted by gRNA was not "close to the 5' end" as stated, it is closer to the middle of the gene sequence as shown in Figure 1C. A more accurate description would be a region in between the PAS domains.

      Response: Indeed we modified the figure and the text.

      (8) Line 150. The mutant allele is described as Clock1 initially, then for the rest of the paper as Clock-. SInce it is not clear that the allele is a null (see major comment #8), Clock1 should be used throughout the manuscript.

      Response: the allele is named NvClk1 in the revised manuscript

      (9) Figure 2A, the second CT/ZT0 is misplaced.

      Response: Fig. 2 modified in the revised manuscript

      (10) Figure legend for 2E and 3B. "The 1000bp upstream ATG" is unclear. I guess it means that 1000bp upstream of the putative initiation codon was used.

      Response: Right, and in the revised version we analyzed 5kb upstream the putative ATG.

      (11) Line 164. The authors write "We discovered..." , but wasn't it already known that these animals are behaviorally rhythmic?

      Response: Fixed

      (12) It would be worth mentioning in the results section the reduced amplitude of rhythms in LL compared to DD (in WT and seemingly also in Clock mutants).

      Response: Indeed, we observed a significant reduction in the mean amplitude in the NvClk1-/- in DD and LL compared WT and NvClk1-/- in LD, DD and LL. However, as rhythmicity is lost by virtually all mutants in LL and DD we do not think these results add to the current interpretation of the gene function.

      (13) Please correct the figure numbers in the main text, there are several mistakes.

      Response: Done

      (14) Line 196, most genes in the quoted study did not cycle on day 2, so whether they are truly clock controlled is questionable.

      Response: We agree, identifying free-running cycling genes in cnidarian remains a challenge to overcome. One of the limitations of this study was to detect rhythmic genes in LD which conserved rhythmicity in DD. However, considering different transcriptomic studies (cited in the discussion) it seems that in the cnidaria phyla rhythmic genes in LD are not necessarily the one we identified rhythmic in DD.

      (15) Line 204-206 needs to be rephrased. It is confusing.

      Response: rephrased

      (16) Line 216. Rephrase to something like: "A similar finding was made for."

      Response: rephrased

      (17) "Clock regulates genetic pathways" sounds quite odd. Do you mean it regulates preferentially specific genetic (or maybe better, molecular) pathways?

      Response: rephrased

      (18) Figure 4 and legend: Dashed lines indicating threshold are missing. Do the black and red dots represent WT and Clock-/-, as indicated in the legend, or up/down, as indicated in the figures?

      Response: Fig.5 modified accordingly. Colors in the Volcano plot indicate Up- (black) versus Down- (red) regulated. It is now coherent within the figure.

      (19) Legend for Extended figure 1. "Immature peptide sequence" is incorrect.

      Response: rephrased

      (20) Extended data Figure 4. What the asterisks labels is unclear.

      Response: EDF4 was modified and become EDF2 with different content. The * indicates NvClk mRNA

      (21) Line 228. Gene "isoforms". I guess the authors mean "paralogs".

      Response: corrected.

      (22) Line 232-3/Figure 3e. Please include a comparable image of the Clk ISH to facilitate the comparison of the spatial expression pattern. In addition, where and what is the "analysis" referred to - "the spatial expression pattern of Myh7 closely resembled that of Clock, as evidenced by our analysis"?

      Response: the analysis has been removed from the revised manuscript because we currently cannot perform the double ish.

      (23) Line 282-3. As mentioned above, it is difficult to be sure that circadian behavior is lost, if only looking at a population of animals.

      Response: Fig.1 corrected

      (24) Line 301-5. Rephrase.

      Response: Rephrased

      (25) Line 325. I am not convinced that the author can say that their mutant is amorphic. See Major comment 8.

      Response: corrected.

      (26) Line 351 "simplifying interactions with the environment". Please explain what is meant here.

      Response: this confusing sentence has been removed from the revised manuscript

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In this work, the authors investigate the functional difference between the most commonly expressed form of PTH, and a novel point mutation in PTH identified in a patient with chronic hypocalcemia and hyperphosphatemia. The value of this mutant form of PTH as a potential anabolic agent for bone is investigated alongside PTH(1-84), which is a current anabolic therapy. The authors have achieved the aims of the study. Their conclusion that this suggests a "new path of therapeutic PTH analog development" seems unfounded; the benefit of this PTH variant is not clear, but the work is still interesting.

      The work does not identify why the patient with this mutation has hypocalcemia and hyperphosphatemia; this was not the goal of the study, but the data is useful for helping to understand it.

      Thank you for your valuable feedback. In this study, we confirmed that <sup>R25C</sup>PTH can form a dimer, and our in vivo experiments in the mouse model demonstrated that dimeric <sup>R25C</sup>PTH can stimulate bone formation similarly to normal PTH. Furthermore, patients with the <sup>R25C</sup>PTH mutation, who have been exposed to high levels of this variant over an extended period, were reported to have high bone mineral density. Based on these observations, we hypothesized that dimeric <sup>R25C</sup>PTH might have potential as a new therapeutic PTH analog, particularly as a bone anabolic agent. However, we acknowledge that it is premature to make definitive claims regarding its therapeutic utility. Thus, we are currently conducting follow-up research to further investigate the subsignaling pathway changes induced by dimeric <sup>R25C</sup>PTH and their impact on bone metabolism.

      Moreover, to fully understand the patient’s symptoms, it is crucial to determine the form in which <sup>R25C</sup>PTH exists in vivo. While our in vitro experiments demonstrated that <sup>R25C</sup>PTH is secreted primarily in its dimeric form, we do not yet know whether this dimeric structure is maintained in vivo. We are actively conducting experiments to analyze the circulating form of <sup>R25C</sup>PTH in patients through blood sample collection (Andersen et al., 2022; Lee et al., 2015). Should the mutation predominantly exist in its monomeric form in vivo, this would align with clinical findings reported by Lee et al. (2015), which could help explain the patient’s hypocalcemia and hyperphosphatemia. However, if <sup>R25C</sup>PTH primarily exists in its dimeric form, additional research will be necessary to uncover the underlying mechanisms. Based on our experimental results, the dimeric <sup>R25C</sup>PTH exhibits a reduced binding affinity to PTH1R compared to the monomeric form. Furthermore, our in vitro experiments revealed that dimeric <sup>R25C</sup>PTH induces lower levels of cAMP production upon PTH1R activation. Accordingly, we can assume that this reduction in receptor signaling is likely to account for the impaired regulation of calcium and phosphate in patients with the mutation. However, despite this diminished signaling in calcium and phosphate homeostasis, dimeric <sup>R25C</sup>PTH was still capable of promoting bone formation at levels comparable to wild-type PTH. This apparent paradox warrants further investigation, and we are actively pursuing studies to elucidate how the dimeric form exerts its effects on bone metabolism.

      References

      Andersen, S. L., Frederiksen, A. L., Rasmussen, A. B., Madsen, M., & Christensen, A. R. (2022). Homozygous missense variant of PTH (c.166C>T, p.(Arg56Cys)) as the cause of familial isolated hypoparathyroidism in a three-year-old child. J Pediatr Endocrinol Metab, 35(5), 691-694. https://doi.org/10.1515/jpem-2021-0752

      Lee, S., Mannstadt, M., Guo, J., Kim, S. M., Yi, H. S., Khatri, A., Dean, T., Okazaki, M., Gardella, T. J., & Juppner, H. (2015). A Homozygous [Cys25]PTH(1-84) Mutation That Impairs PTH/PTHrP Receptor Activation Defines a Novel Form of Hypoparathyroidism. J Bone Miner Res, 30(10), 1803-1813. https://doi.org/10.1002/jbmr.2532

      Strengths:

      The work is novel, as it describes the function of a novel, naturally occurring, variant of PTH in terms of its ability to dimerise, to lead to cAMP activation, to increase serum calcium, and its pharmacological action compared to normal PTH.

      Weaknesses:

      (1) The use of very young, 10 week old, mice as a model of postmenopausal osteoporosis remains a limitation of this study, but this is now quite clearly described as a limitation, including justifying the use of the primary spongiosa as a measurement site.

      We appreciate the reviewer’s comment.

      (2) Methods have been clarified. It is still necessary to properly define the micro-CT threshold in mm HA/cc^3. I think it might be acat about 200mg HA/cc^3 in this study.

      Thank you for your insightful comment. To address this, we utilized hydroxyapatite (HA) phantom with HA content ranging from 0 to 1200 mg/cm<sup>3</sup>, with calibration points at 0, 50, 200, 800, 1000, and 1200 mg CaHA/cm<sup>3</sup>, to measure grayscale values via µ-CT. Based on these measurements, the trabecular bone BMD in our study was determined to range from 100 to 200 mg/cm<sup>3</sup>.

      Author response image 1.

      (3) The apparent contradiction between the cortical thickness data (where there is no difference between the two PTH formulations) and the mechanical testing data (where there is a difference) remains unresolved. It is still not clear whether there is a material defect in the bone, which can be partially assessed by reporting the 3-point bending test, corrected for the diameters of the bone (i.e. as stress / strain curves).

      Thank you for your comment. First, we ensured that the bones sampled during the experiment showed no defects, and we carefully separated the femur bones from the mice to preserve their integrity. In the 3-point bending test, PTH treatment significantly increased the maximum load of the femur bone compared to the OVX-control group. Additionally, the maximum load in the PTH treatment group was significantly greater than that observed in the PTH dimer group. Furthermore, structural factors influencing bone strength, such as the perosteal perimeter and the endocortical bone perimeter, were also increased in the PTH treatment group compared to the PTH dimer group (data only for reviewer).

      Author response image 2.

      (4) It is also puzzling that both dimeric and monomeric PTH lead to a reduction in total bone area (cross sectional area?). This would suggest a reduction in bone growth. This should be discussed in the work.

      In our experiment, the data showed an increase in cortical bone area in the PTH treatment group, but not in the PTH dimer treatment group. However, both dimeric and monomeric PTH treatments resulted in a reduction in total tissue area. We added revised sentence in page 13 line 317 and page 14 line 333 as follows:

      “In addition, the data showed an increase in cortical bone area (Ct.Ar) in the PTH treatment group but not in the PTH dimer treatment group. However, both dimeric and monomeric PTH treatments reduced total tissue area (Tt.Ar), suggesting potential effects on bone growth in the width of mice or humans.”

      “This study has several limitations. First, it is urgently necessary to determine whether dimeric <sup>R25C</sup>PTH is present in human patient serum. Second, TRAP staining showed an inhibitory effect of PTH treatment on the primary spongiosa area. However, the secondary spongiosa, which more accurately reflects bone remodeling (55), was not examined due to the barely detectable bone in this area in OVX-induced osteoporosis mouse models. Third, it is unclear whether similar bone phenotypes exist between human <sup>R25C</sup>PTH patients and dimeric <sup>R25C</sup>PTH-treated mice, particularly regarding low bone strength. Although the dimeric <sup>R25C</sup>PTH-treated group showed higher cortical BMD compared to WT-Sham or PTH groups, there was no difference in bone strength compared to the osteoporotic mouse model. Fourth, our study demonstrated that PTH or <sup>R25C</sup>PTH treatment decreased circumferential length, which could affect bone growth in width. However, whether this phenotype is also observed in patients treated with PTH or <sup>R25C</sup>PTH remains uncertain.”

    1. Author Response

      The following is the authors’ response to the original reviews.

      We would like to thank the reviewers for their insightful comments and recommendations. We have extensively revised the manuscript in response to the valuable feedback. We believe the results is a more rigorous and thoughtful analysis of the data. Furthermore, our interpretation and discussion of the findings is more focused and highlights the importance of the circuit and its role in the response to stress. Thank you for helping to improve the presented science.

      Key changes made in response to the reviewers comments include:

      • Revision of statistical analyses for nearly all figures, with the addition of a new table of summary statistics to include F and/or t values alongside p-values.

      • Addition of statistical analyses for all fiber photometry data.

      • Examination of data for possible sex dependent effects.

      • Clarification of breeding strategies and genotype differences, with added details to methods to improve clarity.

      • Addressing concerns about the specificity of virus injections and the spread, with additional details added to methods.

      • Modification of terminology related to goal-directed behavior based on reviewer feedback, including removal of the term from the manuscript.

      • Clarification and additional data on the use of photostimulation and its effects, including efforts to inactivate neurons for further insight, despite technical challenges.

      • Correction of grammatical errors throughout the manuscript.

      Reviewer 1:

      Despite the manuscript being generally well-written and easy to follow, there are several grammatical errors throughout that need to be addressed.

      Thank you for highlighting this issue. Grammatical errors have been fixed in the revised version of the manuscript.

      Only p values are given in the text to support statistical differences. This is not sufficient. F and/or t values should be given as well.

      In response to this critique and similar comments from Reviewer 2, we re-evaluated our approach to statistical analyses and extensively revised analyses for nearly all figures. We also added a new table of summary statistics (Supplemental Table 1) containing the type of analysis, statistic, comparison, multiple comparisons, and p value(s). For Figures 4C-E, 5C, 6C-E, 7H-I, and 8H we analyzed these data using two-way repeated measures (RM) ANOVA that examined the main effect of time (either number of sessions or stimulation period) in the same animal and compared that to the main effect of genotype of the animal (Cre+ vs Cre-), and if there was an interaction. For Supplemental Figure 7A we also conducted a two-way RM ANOVA with time as a factor and activity state (number of port activations in active vs inactive nose port) as the other in Cre+ mice. For Figures 5D-E we conducted a two-way mixed model ANOVA that accounted and corrected for missing data. In figures that only compared two groups of data (Figures 5F-L, 6F, 8C-D, 8I, and Supp 6F-G) we used two-tailed t-test for the analysis. If our question and/or hypothesis required us to conduct multiple comparisons between or within treatments, we conducted Bonferroni’s multiple comparisons test for post hoc analysis (we note which groups we compared in Supplemental Table 1). For figures that did or did not show a change in calcium activity (Figure 3G, 3I-K, 7B, 7D-E, 8E-F), we compared waveform confidence intervals (Jean-Richard-Dit-Bressel, Clifford, McNally, 2020). The time windows we used as comparison are noted in Supplemental Table 1, and if the comparisons were significant at 95%, 99%, and 99.9% thresholds.

      None of prior comparisons in prior analyses that were significant were found to have fallen below thresh holds for significance. Of those found to be not significantly different, only one change was noted. In Figure 6E there was now a significant baseline difference between Cre+ and Cre- mice with Cre- mice taking longer to first engage the port compared to Cre+ mice (p=0.045). Although the more rigorous approach the statistical analyses did not change our interpretations we feel the enhanced the paper and thank the reviewer for pushing this improvement.

      Moreover, the fibre photometry data does not appear to have any statistical analyses reported - only confidence intervals represented in the figures without any mention of whether the null hypothesis that the elevations in activity observed are different from the baseline.

      This is particularly important where there is ambiguity, such as in Figure 3K, where the spontaneous activity of the animal appears to correlate with a spike in activity but the text mentions that there is no such difference. Without statistics, this is difficult to judge.

      Thank you for highlighting this critical point and providing an opportunity to strengthen our manuscript. We added statistical analyses of all fiber photometry data using a recently described approach based on waveform confidence intervals (Jean-Richard-Dit-Bressel, Clifford, McNally, 2020). In the statistical summary (Supplemental Table 1) we note the time window that we used for comparison in each analysis and if the comparisons were significant at 95%, 99%, and 99.9% thresholds. Thank you from highlighting this and helping make the manuscript stronger.

      With respect to Figure 3K, we are not certain we understood the spike in activity the reviewer referred to. Figure 3J and K include both velocity data (gold) and Ca2+ dependent signal (blue). We used episodes of velocity that were comparable to the avoidance respond during the ambush test and no significant differences in the Ca2+ signal when gating around changes in velocity in the absence of stressor (Supplemental Table1). This is in contrast to the significant change in Ca2+ signal following a mock predator ambush (Figure 3J). We interpret these data together to indicate that locomotion does not correlate with an increase in calcium activity in SuMVGLUT2+::POA neurons, but that coping to a stressor does. This conclusion is further examined in supplemental Figure 5, including examining cross-correlation to test for temporally offset relationship between velocity and Ca2+ signal in SUMVGLUT2+::POA neurons.

      The use of photostimulation only is unfortunate, it would have been really nice to see some inactivation of these neurons as well. This is because of the well-documented issues with being able to determine whether photostimulation is occurring in a physiological manner, and therefore makes certain data difficult to interpret. For instance, with regards to the 'active coping' behaviours - is this really the correct characterisation of what's going on? I wonder if the mice simply had developed immobile responding as a coping strategy but when they experience stimulation of these neurons that they find aversive, immobility is not sufficient to deal with the summative effects of the aversion from the swimming task as well as from the neuronal activation? An inactivation study would be more convincing.

      We agree with the point of the reviewer, experiments demonstrating necessity of SUMVGLUT2+::POA neurons would have added to the story here. We carried out multiple experiments aimed at addressing questions about necessity of SuMVGLUT2+::POA neurons in stress coping behaviors, specifically the forced swim assay. Efforts included employing chemogenetic, optogenetic, and tetanus toxin-based methods. We observed no effects on locomotor activity or stress coping. These experiments are both technically difficult and challenging to interpret. Interpretation of negative results, as we obtained, is particularly difficult because of potential technical confounds. Selective targeting of SuMVGLUT2+::POA neurons for inhibition requires a process requiring three viral injections and two recombination steps, increasing variability and reducing the number of neurons impacted. Alternatively, photoinhibition targeting SuMVGLUT2+::POA cells can be done using Retro-AAV injected into POA and a fiber implant over SuM. We tried both approaches. Data obtained were difficult to interpret because of questions about adequate coverage of SuMVGLUT2+::POA population by virally expressed constructs and/or light spread arose. The challenge of adequate coverage to effectively prevent output from the targeted population is further confounded by challenges inherent in neural inhibition, specifically determining if the inhibition created at the cellular level is adequate to block output in the context of excitatory inputs or if neurons must be first engaged in a particular manner for inhibition to be effective. Baseline neural activity, release probability, and post-synaptic effects could all be relevant, which photo-inhibition will potentially not resolve. So, while the trend is to always show “necessary and sufficient” effects, we’ve tried nearly everything, and we simply cannot conclude much from our mixed results. There are also wellestablished problems with existing photo-inhibition methods, which while people use them and tout them, are often ignored. We have a lot of expertise in photo-inhibition optogenetics, and indeed have used it with some success, developed new methods, yet in this particular case we are unable to draw conclusions related to inhibition. People have experienced similar challenges in locus coeruleus neurons, which have very low basal activity, and inhibition with chemogenetics is very hard, as well as with optogenetic pump-based approaches, because the neurons fire robust rebound APs. We have spent almost 2.5 years trying to get this to work in this circuit because reviews have been insistent on this result for the paper to be conclusive. Unfortunately, it simply isn’t possible in our view until we know more about the cell types involved. This is all in spite of experience using the approach in many other publications.

      We also employed less selective approaches, such as injecting AAV-DIO-tetanus toxin light chain (Tettox) constructs directly into SuM VGLUT2-Cre mice but found off target effects impacting animal wellbeing and impeding behavioral testing due viral spread to surrounding areas.

      While we are disappointed for being unable to directly address questions about necessity of SuMVGLUT2+::POA neurons in active coping with experimental data, we were unable to obtain results allowing for clear interpretation across numerous other domains the reviewers requested. We also feel strongly that until we have a clear picture of the molecular cell type architecture in the SuM, and Cre-drivers to target subsets of neurons, this question will be difficult to resolve for any group. We are working now on RNAseq and related spatial transcriptomics efforts in the SuM and examining additional behavioral paradigm to resolve these issues, so stay tuned for future publications.

      Accordingly, we avoid making statements relating to necessity in the manuscript. In spite of having several lines of physiological data with strong robust correlations behavior related to the SuMVGLUT2+::POA circuit.

      Nose poke is only nominally instrumental as it cannot be shown to have a unique relationship with the outcome that is independent of the stimuli-outcome relationships (in the same way that a lever press can, for example). Moreover, there is nothing here to show that the behaviours are goal-directed.

      Thank you for highlighting this point. Regarding goal-direct terminology, we removed this terminology from the manuscript. Since the mice perform highly selective (active vs inactive) port activation robustly across multiple days of training the behavior likely transitions to habitual behavior. We only tested the valuation of stimuli termination of the final day of training with time limited progressive ratio test. With respect to lever press versus active port activation, we are unclear how using a lever in this context would offer a different interpretation. Lever pressing may be more sensitive to changes in valuation when compared to nose poke port activation (Atalayer and Rowland 2008); however, in this study the focus of the operant behavior is separating innate behaviors for learned action–outcome instrumental learned behaviors for threat response (LeDoux and Daw 2018). The robust highly selective activation of the active port illustrated in Figure 6 fits as an action–outcome instrumental behavior wherein mice learn to engage the active but not inactive port to terminate photostimulation. The first activation of the port occurs through exploration of the arena but as demonstrated by the number of active port activations and the decline in time of the first active port engagement, mice expressing ChR2eYFP learn to engage the port to terminate the stimulation. To aid in illustrating this point we have added Supplemental Figure 7 showing active and inactive port activations for both Cre+ and Cre- mice. This adds clarity to high rate of selective port activation driven my stimulation of SUMVGLUT2+::POA neurons compared to controls. The elimination of goal directed and providing additional data narrows and supports one of the key points of the operant experiment.

      With regards to Figure 1: This is a nice figure, but I wonder if some quantification of the pathways and their density might be helpful, perhaps by measuring the intensity of fluorescence in image J (as these are processes, not cell bodies that can be counted)? Mind you, they all look pretty dense so perhaps this is not necessary! However, because the authors are looking at projections in so-called 'stress-engaged regions', the amygdala seems conspicuous by its absence. Did the authors look in the amygdala and find no projections? If so it seems that this would be worth noting.

      This is an interesting question but has proven to be a very technically challenging question. We consulted with several leaders who routinely use complimentary viral tracing methods in the field. We were unable to devise a method to provide a satisfactorily meaningful quantitative (as opposed to qualitative) approach to compare SUMVGLUT2+::POA to SuMVGLUT2+ projections. A few limitations are present that hinder a meaningful quantitative approach. One limitation was the need for different viral strategies to label the two populations. Labeling SuMVGLUT2+::POA neurons requires using VGLUT2-Flp mice with two injections into the POA and one into SuM. Two recombinase steps were required, reducing efficiency of overlap. This combination of viral injections, particularly the injections of RetroAAVs in the POA, can induce significant quantitative variability due to tropism, efficacy, and variability of retro-viral methods, and viral infection generally. These issues are often totally ignored in similar studies across the “neural circuit” landscape, but it doesn’t make them less relevant here.

      Although people do this in the field, and show quantification, we actually believe that it can be a quite misleading read-out of functionally relevant circuitry, given that neurotransmitter release ultimately is amplified by receptors post-synaptically, and many examples of robust behavioral effects have been observed with low fiber tracing complimentary methods (McCall, Siuda et al. 2017). In contrast, the broader SuMVGLUT2+ population was labeled using a single injection into the SuM. This means there like more efficient expression of the fluorophore. Additionally, in areas that contain terminals and passing fibers understanding and interpreting fluorescent signal is challenging. Together, these factors limit a meaningful quantitative comparison and make an interpretation difficult to make. In this context, we focused on a conservative qualitative presentation to demonstrate two central points. That 1) SuMVGLUT2+::POA neurons are subset of SuMVGLUT2+ neurons that project to specific areas and that exclude dentate gyrus, and they 2) arborize extensively to multiple areas which have be linked to threat responses. We agree that there is much to be learned about how different populations in SuM connect to targets in different regions of the brain and to continue to examine this question with different techniques. A meaningful quantitative study comparing projections is technically complex and, we feel, beyond our ability for this study.

      Also, for the reasons above we do not believe that quantification provides exceptional clarity with respect to the putative function of the circuit, glutamate released, or other cotransmitters given known amplification at the post-synaptic side of the circuit.

      With regard to the amygdala, other studies on SuM projections have found efferent projections to amygdala (Ottersen, 1980; Vertes, 1992). In our study we were unable to definitively determine projections from SuMVGLUT2+::POA neurons to amygdala, which if present are not particularly dense. For this reason we were conservative and do not comment on this particular structure.

      I would suggest removing the term goal-directed from the manuscript and just focusing on the active vs. passive distinction.

      We removed the use of goal-directed. Thank you for helping us clarify our terminology.

      The effect observed in Figure 7I is interesting, and I'm wondering if a rebound effect is the most likely explanation for this. Did the authors inhibit the VGAT neurons in this region at any other times and observe a similar rebound? If such a rebound was not observed it would suggest that it is something specific about this task that is producing the behaviour. I would like it if the authors could comment on this.

      We agree that results showing the change in coping strategy (passive to active) in forced swim after but not during stimulation of SuMVGAT+ neurons is quite interesting (Figure 7I). This experiment activated SuMVGAT+ neurons during a section of the forced swim assay and mice showed a robust shift to mobility after the stimulation of SuMVGAT+ neurons stopped. We did not carry out inhibition of SuMVGAT+ neurons in this manuscript. As the reviewer suggested, strong inhibition of local SuM neurons, including SUMVGLUT2+::POA neurons, could lead to rebound activity that may shift coping behaviors in confusing ways. We agree this is an interesting idea but do not have data to support the hypothesis further at this time.

      Reviewer 2

      (1) These are very difficult, small brain regions to hit, and it is commendable to take on the circuit under investigation here. However, there is no evidence throughout the manuscript that the authors are reliably hitting the targets and the spread is comparable across experiments, groups, etc., decreasing the significance of the current findings. There are no hit/virus spread maps presented for any data, and the representative images are cropped to avoid showing the brain regions lateral and dorsal to the target regions. In images where you can see the adjacent regions, there appears expression of cell bodies (such as Supp 6B), suggesting a lack of SuM specificity to the injections.

      We agree with the reviewer that the areas studied are small and technically challenging to hit. This was one of driving motivations for using multiple tools in tandem to restrict the area targeted for stimulation. Approaches included using a retrograde AAVs to express ChR2eFYP in SUMVGLUT2+::POA neurons; thereby, restricting expression to VGLUT2+ neurons that project to the POA. Targeting was further limited by placement of the optic fiber over cell bodies on SuM. Thus, only neurons that are VGLUT2+, project to the POA, and were close enough to the fiber were active by photostimulation. Regrettably, we were not able to compile images from mice where the fiber was misplaced leading to loss of behavioral effects. We would have liked to provide that here to address this comment. Unfortunately, generating heat maps for injections is not possible for anatomic studies that use unlabeled recombinase as part of an intersectional approach. Also determining the point of injection of a retroAAV can be difficult to accurately determine its location because neurons remote to injection site and their processes are labeled.

      Experiments described in Supplemental Figure 6B on VGAT neurons in SuM were designed and interpreted to support the point that SUMVGLUT2+::POA neurons are a distinct population that does not overlap with GABAergic neurons. For this point it is important that we targeted SuM, but highly confined targeting is not needed to support the central interpretation of the data. We do see labeling in SuM in VGAT-Cre mice but photo stimulation of SuMVGAT+ neurons does not generate the behavioral changes seen with activation of SUMVGLUT2+::POA neurons. As the reviewer points out, SuM is small target and viral injection is likely to spread beyond the anatomic boundaries to other VGAT+ neurons in the region, which are not the focus here. The activation would be restricted by the spread of light from the fiber over SuM (estimated to be about a 200um sphere in all directions). We did not further examine projections or localization of VGAT+ neurons in this study but focused on the differential behavioral effects of SUMVGLUT2+::POA neurons.

      (2) In addition, the whole brain tracing is very valuable, but there is very little quantification of the tracing. As the tracing is the first several figures and supp figure and the basis for the interpretation of the behavior results, it is important to understand things including how robust the POA projection is compared to the collateral regions, etc. Just a rep image for each of the first two figures is insufficient, especially given the above issue raised. The combination of validation of the restricted expression of viruses, rep images, and quantified tracing would add rigor that made the behavioral effects have more significance.

      For example, in Fig 2, how can one be sure that the nature of the difference between the nonspecific anterograde glutamate neuron tracing and the Sum-POA glutamate neuron tracing is real when there is no quantification or validation of the hits and expression, nor any quantification showing the effects replicate across mice? It could be due to many factors, such as the spread up the tract of the injection in the nonspecific experiment resulting in the labeling of additional regions, etc.

      Relatedly, in Supp 4, why isn’t C normalized to DAPI, which they show, or area? Similar for G what is the mcherry coverage/expression, and why isn’t Fos normalized to that?

      Thank you for highlighting the importance of anatomy and the value of anatomy. Two points based on the anatomic studies are central to our interpretation of the experimental data. First, SUMVGLUT2+::POA are a distinct population within the SuM. We show this by demonstrating they are not GABAergic and that they do not project to dentate gyrus. Projections from SuM to dentate gyrus have been described in multiple studies (Boulland et al., 2009; Haglund et al., 1987; Hashimotodani et al., 2018; Vertes, 1992) and we demonstrate them here for SuMVGLUT2+ cells. Using an intersectional approach in VGLUT2-Flp mice we show SUMVGLUT2+::POA neurons do not project to dentate gyrus. We show cell bodies of SUMVGLUT2+::POA neurons located in SuM across multiple figures including clear brain images. Thus, SUMVGLUT2+::POA neurons are SuM neurons that do not project to dentate gyrus, are not GABAergic, send projections to a distinct subset of targets, most notably excluding dentate gyrus. Second, SUMVGLUT2+::POA neurons arborize sending projections to multiple regions. We show this using a combinatorial genetic and viral approach to restrict expression of eYFP to only neurons that are in SuM (based on viral injection), project to the POA (based on retrograde AAV injection in POA), and VGLUT2+ (VGLUT2-Flp mice). Thus, any eYFP labeled projection comes from SUMVGLUT2+::POA neurons. We further confirmed projections using retroAAV injection into areas identified using anterograde approaches (Supplemental Figure 2). As discussed above in replies to Reviewer 1, we feel limitations are present that preclude meaningful quantitative analysis. We thus opted for a conservative interpretation as outlined.

      Prior studies have shown efferent projections from SuM to many areas, and projections to dentate gyrus have received substantial attention (Bouland et al., 2009; Haglund, Swanson, and Kohler, 1984; Hashimotodani et al., 2018; Soussi et al., 2010; Vertes, 1992; Pan and McNaugton, 2004). We saw many of the same projections from SuMVGLUT2+ neurons. We found no projections from SUMVGLUT2+::POA neurons to dentate gyrus (Figure 2). Our description of SuM projection to dentate gyrus is not new but finding a population of neurons in SuM that does not project to dentate gyrus but does project to other regions in hippocampus is new. This finding cannot be explained by spread of the virus in the tract or non-selective labeling.

      (3) The authors state that they use male and female mice, but they do not describe the n’s for each experiment or address sex as a biological variable in the design here. As there are baseline sex differences in locomotion, stress responses, etc., these could easily factor into behavioral effects observed here.

      Sex specific effects are possible; however, the studies presented here were not designed or powered to directly examine them. A point about experimental design that helps mitigate against strong sex dependent effect is that often the paradigm we used examined baseline (pre-stimulation) behavior, how behavior changed during stimulation, and how behavior returned (or not) to baseline after stimulation. Thus, we test changes in individual behaviors. Although we had limited statistical power, we conducted analyses to examine the effects of sex as variable in the experiments and found no differences among males and females.

      (4) In a similar vein as the above, the authors appear to use mice of different genotypes (however the exact genotypes and breeding strategy are not described) for their circuit manipulation studies without first validating that baseline behavioral expression, habituation, stress responses are not different. Therefore, it is unclear how to interpret the behavioral effects of circuit manipulation. For example in 7H, what would the VGLUT2-Cre mouse with control virus look like over time? Time is a confound for these behaviors, as mice often habituate to the task, and this varies from genotype to genotype. In Fig 8H, it looks like there may be some baseline differences between genotypes- what is normal food consumption like in these mice compared to each other? Do Cre+ mice just locomote and/or eat less? This issue exists across the figures and is related to issues of statistics, potential genotype differences, and other experimental design issues as described, as well as the question about the possibility of a general locomotor difference (vs only stress-induced). In addition, the authors use a control virus for the control groups in VGAT-Cre manipulation studies but do not explain the reasoning for the difference in approach.

      Thank you for highlighting the need for greater clarity about the breeding strategies used and for these related questions. We address the breeding strategy and then move to address the additional concerns raised. We have added details to the methods section to address this point. For VGLUT2-Cre mice we use litter mates controls from Cre/WT x WT/WT cross. The VGLUT2-Cre line (RRID:IMSR_JAX:028863) (Vong L , et al. 2011) used here been used in many other reports. We are not aware of any reports indicating a phenotype associated with the addition of the IRES-Cre to the Slc17a6 loci and there is no expected impact of expression of VGLUT2. Also, we see in many of the experiments here that the baseline (Figures 4, 5, and 7) behaviors are not different between the Cre+ and Cre- mice. For VGAT-Cre mice we used a different breeding strategy that allowed us to achieve greater control of the composition of litters and more efficient cohorts cohort. A Cre/Cre x WT/WT cross yielded all Cre/WT litters. The AAV injected, ChR2eYFP or eYFP, allowed us to balance the cohort.

      Regarding Figure 7H, which shows time immobile on the second day of a swim test, data from the Cre- mice demonstrate the natural course of progression during the second day of the test. The control mice in the VGAT-Cre cohort (Figure 7I) have similar trend. The change in behavior during the stimulation period in the Cre+ mice is caused by the activation of SUMVGLUT2+::POA neurons. The behavioral shift largely, but not completely, returns to baseline when the photostimulation stops. We have no reason to believe a VGLUT2-Cre+ mouse injected with control AAV to express eYFP would be different from WT littermate injected with AVV expressing ChR2eYFP in a Cre dependent manner.

      Turning to concerns related to 8H, which shows data from fasted mice quantify time spent interacting with food pellet immediately after presentation of a chow pellet, we found no significant difference between the control and Cre+ mice. We unaware of any evidence indicating that the two groups should have a different baseline since the Cre insertion is not expected to alter gene expression and we are unaware of reports of a phenotype relating to feeding and the presence of the transgene in this mouse line. Even if there were a small baseline shift this would not explain the large abrupt shift induced by the photostimulation. As noted above, we saw shifts in behavior abruptly induced by the initiation of photostimulation when compared to baseline in multiple experiments. This shift would not be explained by a hypothetical difference in the baseline behaviors of litter mates.

      (5) The statistics used throughout are inappropriate. The authors use serial Mann-Whitney U tests without a description of data distributions within and across groups. Further, they do not use any overall F tests even though most of the data are presented with more than two bars on the same graph. Stats should be employed according to how the data are presented together on a graph. For example, stats for pre-stim, stim, and post-stim behavior X between Cre+ and Cre- groups should employ something like a two-way repeated measures ANOVA, with post-hoc comparisons following up on those effects and interactions. There are many instances in which one group changes over time or there could be overall main effects of genotype. Not only is serially using Mann-Whitney tests within the same panel misleading and statistically inaccurate, but it cherry-picks the comparisons to be made to avoid more complex results. It is difficult to comprehend the effects of the manipulations presented without more careful consideration of the appropriate options for statistical analysis.

      We thank the reviewer for pointing this out and suggesting alterative analyses, we agree with the assessment on this topic. Therefore, we have extensively revised the statical approach to our data using the suggested approach. Reviewer 1 also made a similar comment, and we would like to point to our reply to reviewer 1’s second point in regard to what we changed and added to the new statistical analyses. Further, we have added a full table detailing the statical values for each figure to the paper.

      Conceptual:

      (6) What does the signal look like at the terminals in the POA? Any suggestion from the data that the projection to the POA is important?

      This is an interesting question that we will pursue in future investigations into the roles of the POA. We used the projection to the POA from SuM to identify a subpopulation in SuM and we were surprised to find the extensive arborization of these neurons to many areas associated with threat responses. We focused on the cell bodies as “hubs” with many “spokes”. Extensive studies are needed to understand the roles of individual projections and their targets. There is also the hypothetical technical challenge of manipulating one projection without activating retrograde propagation of action potentials to the soma. At the current time we have no specific insights into the roles of the isolated projection to POA. Interpretation of experiments activating only “spoke” of the hub would be challenging. Simple terminal stimulation experiments are challenged by the need to separate POA projections from activation of passing fibers targeting more anterior structures of the accumbens and septum.

      (7) Is this distinguishing active coping behavior without a locomotor phenotype? For example, Fig. 5I and other figure panels show a distance effect of stimulation (but see issues raised about the genotype of comparison groups). In addition, locomotor behavior is not included for many behaviors, so it is hard to completely buy the interpretation presented.

      We agree with the reviewer and thank them for highlighting this fundamental challenge in studies examining active coping behaviors in rodents, which requires movement. Additionally, actively responding to threatening stressors would include increased locomotor activity. Separation of movement alone from active coping can be challenging. Because of these concerns we undertook experiments using diverse behavioral paradigms to examine the elicited behaviors and the recruitment of SuMVGLUT2+::POA neurons to stressors. We conducted experiments to directly examine behaviors evoked by photoactivation of SuMVGLUT2+::POA. In these experiments we observed a diversity of behaviors including increased locomotion and jumping but also treading/digging (Figure 4). These are behaviors elicited in mice by threatening and noxious stimuli. An Increase of running or only jumping could signify a specific locomotor effect, but this is not what was observed. Based on these behaviors, we expected to find evidence of increase movement in open field (Figure 5G-I) and light dark choice (Figure 5J-L) assays. For many of the assays, reporting distance traveled is not practical. An important set of experiments that argues against a generic increase in locomotion is the operant behavior experiments, which require the animal to engage in a learned behavior while receiving photostimulation of SuMVGLUT2+::POA neurons (Figure 6). This is particularly true for testing using a progressive ratio when the time of ongoing photostimulation is longer, yet animals actively and selectively engage the active port (Figure 6G-H). Further, we saw a shift in behavioral strategy induce by photoactivation in forced swim test (Figure 7H). Thus, activation of SUMVGLUT2+::POA neurons elicited a range of behaviors that included swimming, jumping, treading, and learned response, not just increased movement. Together these data strongly argue that SuMVGLUT2+::POA neurons do not only promote increased locomotor behavior. We interpret these data together with the data from fiber photometry studies to show SuMVGLUT2+::POA neurons are recruited during acute stressors, contribute to aversive affective component of stress, and promote active behaviors without constraining the behavioral pattern.

      Regarding genotype, we address this in comments above as well but believe that clarifying the use of litter mates, the extensive use of the VGLUT2-Cre line by multiple groups, and experimental design allowing for comparison to baseline, stimulation evoked, and post stimulation behaviors within and across genotypes mitigate possible concerns relating to the genotype.

      (8) What is the role of GABA neurons in the SuM and how does this relate to their function and interaction with glutamate neurons? In Supp 8, GABA neuron activation also modulates locomotion and in Fig 7 there is an effect on immobility, so this seems pretty important for the overall interpretation and should probably be mentioned in the abstract.

      Thank you for noting these interesting findings. We added text to highlight these findings to the abstract. Possible roles of GABAergic neurons in SuM extend beyond the scope of the current study particularly since SuM neurons have been shown to release both GABA and glutamate (Li Y, Bao H, Luo Y, et al. 2020, Root DH, Zhang S, Barker DJ et al. 2018). GABAergic neurons regulate dentate gyrus (Ajibola MI, Wu JW, Abdulmajeed WI, Lien CC 2021), REM sleep (Billwiller F, Renouard L, Clement O, Fort P, Luppi PH 2017), and novelty processing Chen S, He L, Huang AJY, Boehringer R et al. 2020). The population of exclusively GABAergic vs dual neurotransmitter neurons in SuM requires further dissection to be understood. How they may relate to SUMVGLUT2+::POA neurons require further investigation.

      Questions about figure presentation:

      (9) In Fig 3, why are heat maps shown as a single animal for the first couple and a group average for the others?

      Thank you for highlighting this point for further clarification. We modified the labels in the figure to help make clear which figures are from one animal across multiple trials and those that are from multiple animals. In the ambush assay each animal one had one trial, to avoid habituation to the mock predator. Accordingly, we do not have multiple trials for each animal in this test. In contrast, the dunk assay (10 trial/animal) and the shock (5 trials/animal) had multiple trials for each animal. We present data from a representative animal when there are multiple trials per animal and the aggerate data.

      Why is the temporal resolution for J and K different even though the time scale shown is the same?

      Thank you for noticing this error carried forward from a prior draft of the figure so we could correct it. We replaced the image in 3J with a more correctly scaled heatmap.

      What is the evidence that these signal changes are not due to movement per se?

      Thank you for the question. There are two points of evidence. First, all the 465 nm excitation (Ca2+ dependent) data was collected in interleaved fashion with 415 nm (isosbestic) excitation data. The isosbestic signal is derived from GCaMP emission but is independent of Ca2+ binding (Martianova E, Aronson S, Proulx CD. 2019). This approach, time-division multiplexing, can correct calcium-dependent for changes in signal most often due to mechanical change. The second piece of evidence is experimental. Using multiple cohorts of mice, we examined if the change in Ca2+ signal was correlated with movement. We used the threshold of velocity of movement seen following the ambush. We found no correlation between high velocity movements and Ca2+ signal (Figure 3K) including cross correlational analysis (Supplemental figure 5). Based on these points together we conclude the change in the Ca2+ signal in SUMVGLUT2+::POA neurons is not due to movement induced mechanical changes and we find no correlation to movement unless a stressor is present, i.e. mock predator ambush or forced swim. Further, the stressors evoke very different locomotor responses fleeing, jumping, or swimming.

      (10) In Fig 4, the authors carefully code various behaviors in mice. While they pick a few and show them as bars, they do not show the distribution of behaviors in Cre- vs Cre+ mice before manipulation (to show they have similar behaviors) or how these behaviors shift categories in each group with stimulation. Which behaviors in each group are shifting to others across the stim and post-stim periods compared to pre-stim?

      This is an important point. We selected behaviors to highlight in Figure4 C-E because these behaviors are exhibited in response to stress (De Boer & Koolhaas, 2003; van Erp et al., 1994). For the highlighted behaviors, jumping, treading/digging, grooming, we show baseline (pre photostimulation), stimulation, and post stimulation for Cre+ and Cre- mice with the values for each animal plotted. We show all nine behaviors as a heat map in Figure 4B. The panels show changes that may occur as a function of time and show changes induced by photostimulation.

      The heatmaps demonstrate that photostimulation of SUMVGLUT2+::POA neurons causes a suppression of walking, grooming, and immobile behaviors with an increase in jumping, digging/treading, and rapid locomotion. After stimulation stops, there is an increase in grooming and time immobile. The control mice show a range of behaviors with no shifts noted with the onset or termination of photostimulation.

      Of note, issues of statistics, genotype, and SABV are important here. For example, the hint that treading/digging may have a slightly different pre-stim basal expression, it seems important to first evaluate strain and sex differences before interpreting these data.

      We examined the effects of sex as a biological variable in the experiments reported in the manuscript and found no differences among males and females in any of the experiments where we had enough animals in each sex (minimum of 5 mice) for meaningful comparisons. We did this by comparing means and SEM of males and females within each group (e.g. Cre+ males vs Cre+ female, Cre- males vs Cre- females) and then conducted a t-test to see if there was a difference. For figures that show time as a variable (e.g Figure 6C-E), we compared males and females with time x sex as main factors and compared them (including multiple comparisons if needed). We found no significant main effects or interactions between males and females. Because of this, and to maximize statistical power, we decided to move forward to keep males and females together in all the analyses presented in the manuscript. It is worth noting also that the core of the experimental design employed is a change in behavior caused by photostimulation. The mice are also the same strain with only difference being the modification to add an IRES and sequence for Cre behind the coding sequence of the Slc17A6 (VGLUT2) gene.

      (11) Why do the authors use 10 Hz stimulation primarily? is this a physiologically relevant stim frequency? They show that they get effects with 1 Hz, which can be quite different in terms of plasticity compared to 10 Hz.

      Thank you for the raising this important question. Because tests like open field and forced swim are subject to habituation and cannot be run multiple times per animal a test frequency was needed to use across multiple experiments for consistency. The frequency of 10Hz was selected because it falls within the rate of reported firing rates for SuM neurons (Farrel et al., 2021; Pedersen et al., 2017) and based on the robust but sub maximal effects seen in the real-time place preference assays. Identification of the native firing rates during stress response would be ideal but gathering this data for the identified population remains a dauting task.

      (12) In Fig 5A-F, it is unclear whether locomotion differences are playing a role. Entrances (which are low for both groups) are shown but distance traveled or velocity are not.

      In B, there is no color in the lower left panel. where are these mice spending their time? How is the entirety of the upper left panel brighter than the lower left? If the heat map is based on time distribution during the session, there should be more color in between blue and red in the lower left when you start to lose the red hot spots in the upper left, for example. That is, the mice have to be somewhere in apparatus. If the heat map is based on distance, it would seem the Cre- mice move less during the stim.

      We appreciate the opportunity to address this question, and the attention to detail the reviewer applied to our paper. In the real time place preference test (RTPP) stimulation would only be provided while the animal was on the stimulation side. Mice quickly leave the stimulation side of the arena, as seen in the supplemental video, particularly at the higher frequencies. Thus, the time stimulation is applied is quite low. The mice often retreat to a corner from entering the stimulation side during trials using higher frequency stimulation. Changing locomotor activity along could drive changes in the number entrances but we did not find this. In regard to the heat map, the color scale is dynamically set for each of the paired examples that are pulled from a single trial. To maximize the visibility between the paired examples the color scale does not transfer between the trials. As a result, in the example for 10 Hz the mouse spent a larger amount of time in the in the area corresponding to the lower right corner of the image and the maximum value of the color scale is assigned to that region. As seen in the supplemental video, mice often retreated to the corner of the non-stimulation side after entering the stimulation side. The control animal did not spend a concentrated amount of time in any one region, thus there is a lack of warmer colors. In contrast the baseline condition both Cre+ and Cre- mice spent time in areas disturbed on both sides of arena, as expected. As a result, the maximum value in the heat map is lower and more area are coded in warmer colors allowing for easier visual comparison between the pair. Using the scale for the 10 Hz pair across all leads to mostly dark images. We considered ways to optimized visualization across and within pairs and focused on the within pair comparison for visualization.

      (13) By starting with 1 hz, are the experimenters inducing LTD in the circuit? what would happen if you stop stimming after the first epoch? Would the behavioral effect continue? What does the heat map for the 1 hz stim look like?

      Relatedly, it is a lot of consistent stimulation over time and you likely would get glutamate depletion without a break in the stim for that long.

      Thank you for the opportunity to add clarity around this point regarding the trials in RTPP testing. Importantly, the trials were not carried out in order of increasing frequency of stimulation, as plotted. Rather, the order of trials was, to the extent possible with the number of mice, counterbalanced across the five conditions. Thus, possible contribution of effects of one trial on the next were minimized by altering the order of the trials.

      We have added a heat map for the 1 Hz condition to figure 5B.

      For experiments on RTPP the average stimulation time at 10Hz was less than 10 seconds per event. As a result, the data are unlikely to be affected by possible depletion of synaptic glutamate. For experiments using sustained stimulation (open field or light dark choice assays) we have no clear data to address if this might be a factor where 10Hz stimulation was applied for the entire trial.

      (14) In Fig 6, the authors show that the Cre- mice just don't do the task, so it is unclear what the utility of the rest of the figure is (such as the PR part). Relatedly, the pause is dependent on the activation, so isn't C just the same as D? In G and H, why ids a subset of Cre+ mice shown?

      Why not all mice, including Cre- mice?

      Thank you for the opportunity to improve the clarity of this section. A central aspect of the experiments in Figure 6 is the aversiveness of SUMVGLUT2+::POA neuron photostimulation, as shown in Figure 5B-F. The aversion to photostimulation drives task performance in the negative reinforcer paradigm. The mice perform a task (active port activation) to terminate the negative reinforcer (photostimulation of SuMVGLUT2+::POA neurons). Accordingly, control mice are not expected to perform the task because SuMVGLUT2+::POA neurons are not activated and, thus the mice are not motivated to perform the task.

      A central point we aim to covey in this figure is that while SuMVGLUT2+::POA neurons are being stimulated, mice perform the operant task. They selectively activated the active port (Supplemental Figure 7). As expected, control mice activate the active port at a low level in the process of exploring the arena. This diminishes on subsequent trials as mice habituate to the arena (Figure 6D). The data in Figures 6 C and D are related but can be divergent. Each pause in stimulation requires a port activation of a FR1 test but the number of port activations can exceed the pauses, which are 10 seconds long, if the animal continues to activate the port. Comparing data in Figures 6 C and D revels that mice generally activated the port two to three times for each pause earned with a trend towards greater efficiency on day 4 with more rewards and fewer activations.

      The purpose of the progressive ratio test is to examine if photostimulation of SuMVGLUT2+::POA continues to drive behavior as the effort required to terminate the negative stimuli increases. As seen in Figures 6 G and H, the stimulation of SuMVGLUT2+::POA neurons remains highly motivating. In the 20-minute trial we did not find a break point even as the number of port activations required to pause the stimulation exceed 50. We do not show the Cre- mice is Figure 6G and H because they did not perform the task, as seen in Figure 6F. For technical reasons in early trials, we have fully timely time stamped data for rewards and port activations from a subset of the Cre+ mice. Of note, this contains both the highest and lowest performing mice from the entire data set.

      Taken together, we interpret the results of the operant behavioral testing as demonstrating that SuMVGLUT2+::POA neuron activation is aversive, can drive performance of an operant tasks (as opposed to fixed escape behaviors), and is highly motivating.

      (15) In Fig 7, what does the GCaMP signal look like if aligned to the onset of immobility? It looks like since the hindpaw swimming is short and seems to precede immobility, and the increase in the signal is ramping up at the onset of hindpaw swimming, it may be that the calcium signal is aligned with the onset of immobility.

      What does it look like for swimming onset?

      In I, what is the temporal resolution for the decrease in immobility? Does it start prior to the termination of the stim, or does it require some elapsed time after the termination, etc?

      Thank for the opportunity to addresses these points and improve that clarity of our interpretation of the data. Regarding aligning the Ca2+ signal from fiber photometry recordings to swimming onset and offset, it is important to note that the swimming bouts are not the same length. As a result, in the time prior to alignment to offset of behaviors animals will have been swimming for different lengths of time. In Figure 7 C, we use the behavioral heat map to convey the behavioral average. Below we show the Ca2+ dependent signal aligned at the offset of hindpaw swim for an individual mouse (A) and for the total cohort (B). This alignment shows that the Ca2+ dependent signal declines corresponding to the termination of hindpaw swimming. Because these bouts last less than the total the widow shown, the data is largely included in Figure 7 C and D, which is aligned to onset. Due to the nuance of the difference is the alignment and the partial redundancy, we elected to include the requested alignment to swimming offset in the reply rather in primary figure.

      Author response image 1.

      Turning to the question regarding swimming onset, the animals started swimming immediately when placed in the water and maintained swimming and climbing behaviors until shifting behaviors as illustrated in Figure 7A and B. During this time the Ca2+-dependent signal was elevated but there is only one trial per animal. This question can perhaps be better addressed in the dunk assay presented in Figure 3C, F and G and Supplemental Figure 4 H and I. Here swimming started with each dunk and the Ca2+ signal increased.

      Regarding the question for about figure 7I. We scored for entire periods (2 mins) in aggerate. We noted in videos of the behavior test that there was an abrupt decrease in immobility tightly corresponding to the end of stimulation. In a few animals this shift occurred approximately 15-20s before the end of stimulation. This may relate to the depletion of neurotransmitter as suggested by the reviewer.

      Reviewer 3

      Major points

      (1) Results in Figure 1 suggested that SuM-Vglu2::POA projected not only POA but also to the diverse brain regions. We can think of two models which account for this. One is that homogeneous populations of neurons in SuM-Vglu2::POA have collaterals and innervated all the efferent targets shown in Figure 1. Another is to think of distinct subpopulations of neurons projecting subsets of efferent targets shown in Figure 1 as well as POA. It is suggested to address this by combining approaches taken in experiments for Figure 1 and Supplemental Figure 2.

      Thank you for raising this interesting point. We have attempted combining retroAAV injections into multiple areas that receive projections from SUMVGLUT2+::POA neurons. However, we have found the results unsatisfactory for separating the two models proposed. Using eYFP and tdTomato expressing we saw some overlapping expressing in SuM. We are not able to conclude if this indicates separate populations or partial labeling of a homogenous populations. A third option seems possible as well. There could be a mix of neurons projecting to different combinations of downstream targets. This seems particularly difficult to address using fluorophores. We are preparing to apply additional methodologies to this question, but it extends beyond the scope of this manuscript.

      (2) Since the authors drew a hypothetical model in which the diverse brain regions mediate the effect of SuM-Vglu2::POA activation in behavioral alterations at least in part, examination of the concurrent activation of those brain regions upon photoactivation of SuM-Vglu2::POA. This must help the readers to understand which neural circuits act upon the induction of active coping behavior under stress.

      Thank you for raising this important point. We agree that activating glutamatergic neurons should lead to activation of post synaptic neurons in the target regions. Delineating this in vivo is less straight forward. Doing so requires much greater knowledge of post synaptic partners of SUMVGLUT2+::POA neurons. There are a number of issues that would need to be accounted for. Undertaking two color photo stimulation plus fiber photometry is possible but not a technical triviality. Further, it is possible that we would measure Ca2+ signals in neurons that have no relevant input or that local circuits in a region may shape the signal. We would also lack temporal resolution to identify mono-postsynaptic vs polysynaptic connections. Thus, we would struggle to know if the change in signal was due to the excitatory input from SuM or from a second region. At present, we remain unclear on how to pursue this question experimentally in a manner that is likely to generate clearly interpretable results.

      (3) In Figure 4, "active coping behaviors" must be called "behaviors relevant to the active behaviors" or "active coping-like behaviors", since those behaviors were in the absence of stressors to cope with.

      Thank you for the suggestion on how to clarify our terminology. We have adopted the active coping-like term.

      (4) For the Dunk test, it is suggested to describe the results and methods more in detail, since the readers would be new to it. In particular, the mice could change their behavior between dunks under this test, although they still showed immobility across trials as in Supplemental Figure 4I. Since neural activity during the test was summarized across trials as in Figure 3, it is critical to examine whether the behavior changes according to time.

      Thank you for identifying this opportunity to improve our manuscript. We have expanded and added a detailed description of the dunk test in the methods section.

      As for Supplemental Figure 4I, we apologize for the confusion because the purpose of this figure is to show that mice remained mobile for the entire 30-second dunk trial. This did not appreciably change over the 10 trials. We have revised this figure to plot both immobile and mobile time to achieve greater clarity on this point.

      Minor points

      Typos

      In Figure 1, please add a serotype of AAVs to make it compatible with other figures and their legends.

      In the main text and Figure 2K, the authors used MHb/LHb and mHb/lHb in a mixed fashion. Please make them unified.

      In the figure legend of Figure 6, change "SuMVGLUT2+::POA neurons drive" to "SuMVGLUT2+::POA neurons " in the title.

      In line 86, please change "Retro-AAV2-Nuc-flox(mCherry)-eGFP" to "AAV5-Nuc-flox(mCherry)eGFP".

      In line 80, please change "Positive controls" to "As positive controls, ".

      Thank you for taking the time and making the effort to identify and call these out. We have corrected them.

    1. Author Response

      The following is the authors’ response to the previous reviews

      The revised manuscript is much improved - many unclear points are now better explained. However, in our opinion, some issues could still be significantly improved.

      1. Statistics: none of us are experts in statistics but several things remain questionable in our opinion and if it were our study, we would consult with an expert:

      a) while we understand the authors note about N-chasing and p-hacking, we wonder how the number of N's was premeditated before obtaining the results. Why in 4M an N of 3 is sufficient while in 3E the N is >20 (and not mentioned). At the very least, we think it would be wise to be cautious when stating something as not-significant when it is clear (as in 4M) that the likelihood of it actually being statistically significant is quite large.

      b) In most analyses, the data is not only normalized by actin or some other measure but also to the first (i.e left side on the graph) condition, resulting in identical data points that equal '1' (in Figure 4 alone - C; I; K; M; and O) - while this might be scientifically sound, it should be mentioned (the specific normalization) and also note that this technique shadows any real variance that exists in the original data in this condition. consider exploring techniques to overcome this issue.

      c) In 3C, - if we understand the experiment, you want to convince us that the DIFFERENCE between eB2-FC compared to FC is larger in the control compared to the experiment. We are not absolutely sure that the statistical tools employed here are sufficient - which is why we would consult an expert.

      A) We are aware that many studies do not consistently quantify such experiments. For example, there are essentially no published examples of the signalling timelines of EphB2 receptors as in Fig. 5. By striving to quantifying such biochemical effects, an unquantified experiment stands out, and so perhaps we were too strict by trying to quantify as many experiments as possible, resulting in low n’s for some of them. We acknowledge that additional experiments on EPHB1 protein stability may reach significance. We have adjusted our text on line 332-335 to point to this interesting trend, and slightly changed the conclusion to this section. Similarly, we commented on similar trends when describing Figs. 1E and 4G on lines 901 and 952.

      B) For the Western blot band intensity normalisation, we believe that our method is scientifically sound. Normally, when the replicate samples are loaded on one gel and blotted on the same membrane, the experimenter only needs to normalise the target band intensity to its cognate loading control band intensity for quantitation. However, we usually have a large number of samples from multiple experiments, carried out on different dates. For example, in Fig. 4B,C there are 7 biological replicates collected from 7 experiments and in Fig. 4D there are 10 protein samples. It is not possible for us to run all samples on the same gel. In addition, due to the combined effects of variance in transfer efficiency, the potency of antibodies, detection efficiency and the developing time for each blot, it is practically impossible to generate similar band intensity for each batch. Thus, we use normalisation of test bands to the loading control for individual experiments, and this analysis method is widely accepted by reputable journals with a focus on biochemical experiments (for example: PMID 37695914: Fig. 3 A,B,C; PMID 36282215: Fig. 3 B,C,D,E; PMID 33843588: Fig. 3 C,D,E,F,G,H). Since the value of the first sample on the plot is 1, which is a hypothetical value and does not meet the parametric test requirement, we performed one-sample t-test for statistics when other samples are compared with the first sample (PMID 35243233 Fig. 6 A,B,C,D; https://www.graphpad.com/quickcalcs/oneSampleT1/, “A one sample t-test compares the mean with a hypothetical value. In most cases, the hypothetical value comes from theory. For example, if you express your data as 'percent of control', you can test whether the average differs significantly from 100.”). Thus, we believe that our normalisation and statistical methods are both correct with a large number of precedents.

      C) This comment refers to the cell collapse experiment shown in Fig. 3C for which the data are plotted in Fig. 3D. We stand by the statistical method used. There are two groups of cells (CTRLCRISPR and MYCBP2 CRISPR) and two treatments for each cell group (Fc control and eB2), thus we should use two-way ANOVA. Since we compared the cell retraction effects of Fc and eB2 on the two groups of cells, Sidak post hoc comparison is the right method to avoid errors introduced by multiple comparisons. Here is an example of an eLife article that used the same statistical method for similar comparisons: PMID 37830910, Fig. 1 H,I. To make the comparison easier, we grouped the experiments by cell type (CTRLCRISPR and MYCBP2 CRISPR) as opposed to by treatment. Below, the old version is on the right, and the new version is on the left. The conclusion is that eB2 induces less cell collapse in cells depleted of MYCBP2, when compared to the control cells. However, eB2 is still able to collapse cells lacking MYCBP2.

      Author response image 1.

      Revisiting these data, we noticed an error introduced when CC compiled the data used to generate Fig. 3D. The data were acquired from nine biological replicates per condition. CC used a mix of two methods for cell collapse rate calculation: the first method involved the sum of collapsed cells and all cells from multiple regions of one coverslip (biological replicate). The second method involved computing a collapse rate in each region which then was used to calculate the average collapse rate for the entire coverslip (technical replicate). Given the small cell numbers due to sparse culture conditions, we believe that the first method is a more conservative approach. We hence re-plotted all replicate data using the first method. This resulted in slightly different % collapse and p values. These were changed accordingly in the text and plot and do not affect the conclusion of this experiment.

      2) thanks for the clarification that the interaction between the extracellular domain of EPHB2 and MYCBP2 might not occur directly - however, unless we missed this it was not clearly stated in the text. It is an important point and also a cool direction for the future - to find the elusive co-receptor that actually helps EPHB2 and MYCBP2 form a complex.

      We now also refer to this in the results section on line 215.

      “Since EPHB2 is a transmembrane protein and MYCBP2 is localised in the cytosol, these experiments suggest that the interaction between the extracellular domain of EPHB2 and MYCBP2 might be indirect and mediated by other unknown transmembrane proteins.”

      3) The Hela CRISPR cell line is better explained in the response letter but still not sufficiently explained in the text for a non-expert reader. If the authors want any reader to comprehend this, we would strongly recommend adding a scheme.

      We now include a schematic outlining the CRISPR cell generation as Fig. 3A and its description on line 926.

      Author response image 2.

      4) To clarify some of our previous (and persisting) concerns about Figure 3D/E - it is true that a reduction in 25% of cell size is dramatic. But (if we understand correctly) your claim is that a reduction in 22% (this is a guess, as the actual numbers are not supplies) is significantly less than 25%. Even if it is, statistically speaking, significant, what is the physiological relevance of this very slight effect? In this experiment, the N was quite large, and we wonder if the images in D are representative - it would be nice to label the data points in E to highlight which images you used.

      We now mention the average cell area contraction measurements in the legend to Fig. 3F on line 935. We also tracked down the individual cells shown in Fig. 3E and they are now labelled as data points in blue in Fig. 3F. HeLa cell collapse is a simplified model of EPHB2 function and we do not know whether the difference between the behaviour of CTRLCRISPR and MYCBP2 CRISPR cells is physiologically significant and thus we prefer not to speculate on this.

      5) Figure 3F and other stripe assays - In the end, it is your choice how to quantify. We believe that quantifying area of overlap is a more informative and objective measurement that might actually benefit your analyses. That said, if you do keep the quantification as it is now, you have to define the threshold of what you mean by "cell/s (or an axon in 7A, where it is even more complicated as are you eluding to primary, secondary, or even smaller branches) are RESIDING within the stripe". Is 1% overlap sufficient or do you need 10 or 50% overlap?

      We now added this statement to the methods on line 745: “A cell was considered to be on an ephrin-B2 stripe when more than 50% of its nucleus was located on that stripe”. For chick explant stripe assay, when measuring the length of an axon on a stripe, we only measured the main axons originated from the explants.

      For explant/stripe experiments in Fig. 7 AB, we now use the term “GFP-expressing neurite” rather than “branch”. This was already present in the results of the previous version, but the methods and legend needed to be brought up to date (lines 786 and 1008. We think that “branch” was a confusing term that was supposed to mean the same thing as “neurite” but came across as some indication of branching. We do not know whether the GFP+ neurites were primary or secondary extensions of explants, or in fact, whether some of them contained more than one axon. We also adjusted the method to reflect the fact that some stripes were used in conjunction with a single explant and added a reference to a previous study extensively using this method (Poliak et al., 2015) on line 778.

      6) We still don't get the link to the lysosomal degradation. Your data suggests that in your cells EPHB2 is primarily degraded by the lysosomal pathway and not proteasome. Any statement about MYCBP2 is not strongly supported by the data, in our opinion - Unless you develop some statistical measurement that shows that the effect of BafA1 is statistically different in MYCBP2 cells than in control cells. Currently, this is not the case and the link is therefore not warranted in our opinion.

      We generated a new version of Fig. 4K with average increase in EPHB2 levels in the presence of BafA1 and CoQ, compared to DMSO treated controls (see below). BafA1 and CoQ restored EPHB2 protein levels by 19% and 14% respectively in CtrlCRISPR cells, while the inhibitors restored EPHB2 protein levels by 40% and 35% respectively in MYCBP2 CRISPR cells.

      Author response image 3.

      For each of the 4 replicates, the increase in EPHB2 levels by BafA1 compared to DMSO is as follows:

      Author response table 1.

      These values are not significantly different between CtrlCRISPR cells versus MYCBP2 CRISPR cells (p= 0.08, student’s t test). Similarly for the CoQ experiment. We now temper our conclusion for this experiment: Although the difference in percentage increase between CTRLCRISPR cells and MYCBP2CRISPR cells is not significant, this trend raises the possibility that the loss of MYCBP2 promotes EPHB2 receptor degradation through the lysosomal pathway (line 319). We also adjusted the section title (line 306).

      7) While the C. elegans part is now MUCH better explained - we are not sure we understand the additional insight. The fact that vab-1 and glo4 double mutants are additive as are vab1 and fsn1, suggest they act in parallel (if the mutants are NULL, and not if they are hypomorphs, if one wants to be accurate) - how this relates to your story is unclear. The vab1/rpm1 double mutant is still uninformative and incomplete. rpm1 phenotype is so severe that nothing would make it more severe. We read the Jin paper that the authors directed to - nothing makes the rpm1 phenotype more severe. Yes, some DOWNSTREAM elements make the rpm1 phenotype LESS severe - this is not something you were testing, to the best of our knowledge. Rather, you wanted to see if rpm1 mutant resulted in stabilization of vab1 and thus suppression of vab1 phenotype - we are just not sure the system is amenable to test (actually reject) your hypothesis that Vab1 is degraded by rpm1. Also, assuming we are talking about NULLs, the fact that the rpm1 phenotype is WAY stronger than the vab1 mutant, suggests that rpm1 functions via multiple routes, adding even more complexity to the system. Given these results, despite the much improved clarity, we are still not sure that the worm data adds new insight, rather than potentially confusing the reader.

      We realise that the genetic interactions between vab-1 and the RPM-1/MYCBP2 signalling network are complicated. However, we insist on keeping the data for the sake of its availability for future studies and completeness. We also think it is important for readers and the community to see these data, even if the authors and reviewers are not entirely in agreement about the importance/interpretation of experimental outcomes. It is our hope that the community will examine the results and draw their own conclusions.

      A few points of clarification:

      The C. elegans experiments were designed to test genetically if the vertebrate interactions between EPHB2 and MYCBP2 and its signalling network are conserved. We studied two kinds of interactions: (1) between vab-1 and RPM-1/MYCBP2 downstream proteins (GLO-4 and FSN-1) and (2) between vab-1 and rpm-1. For these studies, we used null alleles for vab-1, glo-4 and fsn-1 which is now noted on lines 440, 453, 475 and 859. Our findings are consistent with the VAB-1 Ephrin receptor functioning in parallel to known RPM-1 binding proteins. This is further supported by new data: vab-1; fsn-1 double mutants showed enhanced incidence of axon overextension defects using a second transgenic background, zdIs5 (Pmec-4::GFP), to visualize axon termination (Fig. 8F).

      This second transgenic background also allowed us to generate new data to address your concerns about phenotypic saturation in rpm-1 mutants. To do this, we used the zdIs5 (Pmec4::GFP) genetic background, in which axon termination defects are not saturated in rpm-1 mutants (Fig. 8F) because they can be enhanced by other mutants such as cdc-42 and unc-33 (Fig. 7C, D, in Borgen et al. Development 144, 4658–4672 (2017), PMID 29084805). In this new background, we found that vab-1 loss of function fails to enhance the incidence of severe “hook” defects in rpm-1 mutants which is an indication that the two genes function in the same pathway. Importantly, prior studies in this background, also showed that mutants in the RPM-1 signalling network (e.g. fsn-1, glo-4 and ppm-2) do not enhance the incidence of severe “hook” defects as double mutants with rpm-1 compared to rpm-1 single mutants (Fig. 7B, ibid.).

      To reflect these ideas more clearly, we revised the Results section pertaining to C. elegans genetics (starting on line 418) and tempered our discussion (lines 517). Basically, this section now says that we studied genetic interactions between vab-1 and the RPM-1/MYCBP2 signalling network. From these experiments we conclude that: (1) The enhancement of overextension defects in vab-1; glo-4 and vab-1; fsn-1 double mutants compared to single mutants indicates that VAB-1/EPHR functions in parallel to known RPM-1 binding proteins to facilitate axon termination, and (2) Since the vab-1; rpm-1 double mutants do not display an increased frequency or severity of overextension defects compared to rpm-1 single mutants, VAB-1 /EPHR functions in the same genetic pathway as RPM-1/MYCBP2.

      The new genetic data included in this version were generated by Karla J. Opperman who is now included as a co-author.

      Further corrections:

      Author response image 4.

      Because of the errors associated with quantifications in Fig. 3D (see above), we reviewed other quantification methodologies and noticed another discrepancy that required a correction. In the hippocampal neuron growth cone collapse assay shown in the previous version of Fig. 7 D (left), the growth cones were classified into three groups: 1, fully collapsed; 2, hard to tell, but not fully collapsed; 3, fan-shape cones. Two different quantifications were performed as follows: (1), number of fully collapsed cones divided by the numbers of all growth cones; (2), number of fully collapsed cones divided by [number of fully collapsed cones + fan-shape cones]. CC erroneously used the second method to generate Fig. 7D.

      We think that the first method is more appropriate. Furthermore, since n=5 for the Fc and eB1-Fc conditions, but n=3 for the eB2-Fc condition, we decided to omit it. The final plot for figure 7D is the following:

      Author response image 5.

      Our conclusion still stands that exogenous FBD1 WT overexpression impaired the growth cone collapse mediated by EphB.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this paper, Steinemann et al. characterized the nature of stochastic signals underlying the trial-averaged responses observed in the lateral intraparietal cortex (LIP) of non-human primates (NHPs), while these performed the widely used random dot direction discrimination task. Ramp-up dynamics in the trial averaged LIP responses were reported in numerous papers before. However, the temporal dynamics of these signals at the single-trial level have been subject to debate. Using large-scale neuronal recordings with Neuropixels in NHPs, allows the authors to settle this debate rather compellingly. They show that drift-diffusion-like computations account well for the observed dynamics in LIP.

      Strengths:

      This work uses innovative technical approaches (Neuropixel recordings in behaving macaque monkeys). The authors tackle a vexing question that requires measurements of simultaneous neuronal population activity and hence leverage this advanced recording technique in a convincing way

      They use different population decoding strategies to help interpret the results.

      They also compare how decoders relying on the data-driven approach using dimensionality reduction of the full neural population space compare to decoders relying on more traditional ways to categorize neurons that are based on hypotheses about their function. Intriguingly, although the functionally identified neurons are a modest fraction of the population, decoders that only rely on this fraction achieve comparable decoding performance to those relying on the full population. Moreover, decoding weights for the full population did not allow the authors to reliably identify the functionally identified subpopulation.

      Weaknesses:

      No major weaknesses beyond a few, largely clarification issues, detailed below.

      We thank Reviewer 1 (R1) for this summary. The revised manuscript incorporates R1’s suggestions, as detailed below.

      Reviewer #2 (Public Review):

      Steinemann, Stine, and their co-authors studied the noisy accumulation of sensory evidence during perceptual decision-making using Neuropixels recordings in awake, behaving monkeys. Previous work has largely focused on describing the neural underpinnings through which sensory evidence accumulates to inform decisions, a process which on average resembles the systematic drift of a scalar decision variable toward an evidence threshold. The additional order of magnitude in recording throughput permitted by the methodology adopted in this work offers two opportunities to extend this understanding. First, larger-scale recordings allow for the study of relationships between the population activity state and behavior without averaging across trials. The authors’ observation here of covariation between the trial-to-trial fluctuations of activity and behavior (choice, reaction time) constitutes interesting new evidence for the claim that neural populations in LIP encode the behaviorally-relevant internal decision variable. Second, using Neuropixels allows the authors to sample LIP neurons with more diverse response properties (e.g. spatial RF location, motion direction selectivity), making the important question of how decision-related computations are structured in LIP amenable to study. For these reasons, the dataset collected in this study is unique and potentially quite valuable.

      However, the analyses at present do not convincingly support two of the manuscript’s key claims: (1) that ”sophisticated analyses of the full neuronal state space” and ”a simple average of Tconin neurons’ yield roughly equivalent representations of the decision variable; and (2) that direction-selective units in LIP provide the samples of instantaneous evidence that these Tconin neurons integrate. Supporting claim (1) would require results from sophisticated population analyses leveraging the full neuronal state space; however, the current analyses instead focus almost exclusively on 1D projections of the data. Supporting claim (2) convincingly would require larger samples of units overlapping the motion stimulus, as well as additional control analyses.

      We thank the reviewer (R2) for their careful reading of our paper and the many useful suggestions.

      As detailed below, the revised manuscript incorporates new control analyses, improved quantification, and statistical rigor, which now provide compelling support for key claim #1. We do not regard claim #2 as a key claim of the paper. It is an intriguing finding with solid support, worthy of dissemination and further investigation. We have clarified the writing on this matter.

      Specific shortcomings are addressed in further detail below:

      (1) The key analysis-correlation between trial-by-trial activity fluctuations and behavior, presented in Figure 5 is opaque, and would be more convincing with negative controls. To strengthen the claim that the relationship between fluctuations in (a projection of) activity and fluctuations in behavior is significant/meaningful, some evidence should be brought that this relationship is specific - e.g. do all projections of activity give rise to this relationship (or not), or what level of leverage is achieved with respect to choice/RT when the trial-by-trial correspondence with activity is broken by shuffling.

      We do not understand why R2 finds the analysis opaque, but we are grateful for the lucid recommendations. The relationships between fluctuations in neural activity and behavior are indeed “specific” in the sense that R2 uses this term. In addition to the shuffle control, which destroys both relationships (Reviewer Figure 1), we performed additional control analyses that preserve the correspondence of neural signals and behavior on the same trial. We generated random coding directions (CDs) by establishing weight vectors that were either chosen from a standard normal distribution or by permuting the weights assigned to PC-1 in each session. The latter is the more conservative measure. Projections of the neural responses onto these random coding directions render 𝑆rand(𝑡). Specifically, the degree of leverage is effectively zero or greatly reduced. These analyses are summarized in a new Supplementary Figure S10. The bottom row of Figure S10 also addresses the question, “What degree of leverage and mediation would be expected for a theoretical decision variable?” This is accomplished by simulating decision variables using the drift-diffusion model fits in Figure 1c. The simulation is consistent with the leverage and (incomplete) mediation observed for the populations of Tcon neurons. For details see Methods, Simulated decision variables and Leverage of single-trial activity on behavior.

      (2) The choice to perform most analysis on 1D projections of population activity is not wholly appropriate for this unique type of dataset, limiting the novelty of the findings, and the interpretation of similarity between results across choices of projection appears circular:

      We disagree with the characterization of our argument as circular, but R2 raises several important points that will probably occur to other careful readers. We address them as subpoints 2.1–2.4, below. Importantly, we are neither claiming nor assuming that the LIP population activity is one-dimensional. We have revised the paper to avoid giving this impression. We are also not claiming that the average of Tin neurons (or the 1D projections) explains all features of the LIP population, nor would we expect it to, given the diversity of response fields across the population. Our objective is to identify the specific dimension within population activity that captures the decision variable (DV), which has been characterized successfully as a one-dimensional stochastic process—that is, a scalar function of time. We have endeavored to clarify our thinking on this point in the revised manuscript (e.g., lines 97–98, 103–104).

      (2.1) The bulk of the analyses (Figure 2, Figure 3, part of Figure 4, Figure 5, Figure 6) operate on one of several 1D projections of simultaneously recorded activity. Unless the embedding dimension of these datasets really does not exceed 1 (dimensionality using e.g. participation ratio in each session is not quantified), it is likely that these projections elide meaningful features of LIP population activity.

      We now report the participation ratio (4.4 ± 0.4, mean ± s.e. across sessions), and we state that the first 3 PCs explain 67.1±3.1% of the variance of time- and coherence-dependent signals used for the PCA. We agree that the 1D projections may elide meaningful features of LIP population activity. Indeed, we make this point through our analysis of the Min neurons. We do not claim that the 1D projections explain all of the meaningful features of LIP population activity. They do, however, reveal the decision variable, which is our main focus. These 1D signals contain features that correlate with events in the superior colliculus, summarized in Stine et al. (2023), attesting to their biological relevance.

      (2.2) Further, the observed similarity of results across these 1D projections may not be meaningful/interpretable. First, the rationale behind deriving Sramp was based on the ramping historically observed in Tin neurons during this task, so should be expected to resemble Tin.

      The Reviewer is correct that we would expect 𝑆ramp to resemble the ramping observed in Tin neurons. We refer to this approach as hypothesis-driven. It captures the drift component of drift-diffusion. It is true that the Tcon neurons exhibit such ramps in their trial average firing rates, but this does not guarantee in

      that the single-trial population firing rates would manifest as drift-diffusion. Indeed Latimer et al. (2015) concluded that the ramp-like averages comprise stepping from a low to a high firing rate on each trial at a random time. Therefore, while R2 is right to characterize the similarity of Tcon to the ramp direction in in trial-averaged activity as unsurprising, their similarity on single trials is not guaranteed.

      (2.3) Second, Tin comprises the largest fraction of the neuron groups sampled during most sessions, so SPC1 should resemble Tin too. The finding that decision variables derived from the whole population’s activity reduce essentially to the average of Tin neurons is thus at least in part ’baked in’ to the approach used for deriving the decision variables.

      This is incorrect. The Tcon in neurons constitute only 14.5% of the population, on average, across the sessions (see Table 1). This misunderstanding might contribute to R2’s concern about the importance of these neurons in shaping PC1. It is not simply because they are over-represented. Also, addressing R2’s concern about circularity, we would like to remind R2 that the selection of Tin neurons was based only on their spatial selectivity in the delayed saccade task. We do not see how it could be baked-in/guaranteed that a simple average of these neurons (i.e. zero degrees of freedom) yields dynamics and behavioral correlations that match those produced by dimensionality-reduction techniques that (𝑖) have degrees of freedom equal to the number of neurons and (𝑖𝑖) are blind to the neurons’ spatial selectivity. We have additionally modified what is now Supplementary Figure S13 (old Supplementary Figure S8), which portrays the mean accuracy of choice decoders trained on the neural activity of all neurons, only Tin neurons, all but the Tin neurons, and all but Tin and Min neurons, respectively. Figure S13 now highlights how much more readily choice can be decoded from the small population of Tin neurons than the remainder of the population.

      (2.4) The analysis presented in Figure S6 looks like an attempt to demonstrate that this isn’t the case, but is opaque. Are the magnitudes of weights assigned to units in Tin larger than in the other groups of units with preselected response properties? What is their mean weighting magnitude, in comparison with the mean weight magnitude assigned to other groups? What is the null level of correspondence observed between weight magnitude and assignment to Tin (e.g. a negative control, where the identities of units are scrambled)?

      The revised Figure S6—what is now Figure S9—displays more clearly that the weights assigned to Tcon and Tips neurons (purple & yellow, respectively) are larger in magnitude than those assigned in in to other neurons (gray). Author response table 1 shows a more detailed breakdown of the groups. Note that the length of the vector of weights is one. We are unsure what R2 means by “the null level of correspondence.” Perhaps it helps to know that the mean weight of the “other neurons” is close to zero for all four coding directions. However, it is the overlap of the weights and the relative abundance of non-Tin neurons that is more germane to the point we are making. To wit, knowing the weight (or percentile) of a neuron is a poor predictor that it belongs to the Tin category. This point is most clearly supported by the logistic regression (Fig. S9, bottom row). In other words, the large group of non-Tin neurons contribute substantially to all four coding directions examined in Figure S9. Thus, the similarity between Tin neurons and PC1 is not simply due to an over-representation of Tin neurons as suggested in item 2.3.

      Author response table 1.

      Mean weights assigned to neuron classes in four coding directions.

      (3) The principal components analysis normalization procedure is unclear, and potentially incorrect and misleading: Why use the chosen normalization window (±25ms around 100ms after motion stimulus onset) for standardizing activity for PCA, rather than the typical choice of mean/standard deviation of activity in the full data window? This choice would specifically squash responses for units with a strong visual response, which distorts the covariance matrix, and thus the principal components that result. This kind of departure from the standard procedure should be clearly justified: what do the principal components look like when a standard procedure is used, and why was this insufficient/incorrect/unsuitable for this setting?

      We used the early window because it is a robust measure of overall excitability, but we now use a more conventional window that spans the main epoch of our analyses, 200–600 ms after motion onset. This method yields results qualitatively similar to the original method. We are persuaded that this is the more sensible choice. We thank R2 for raising this concern.

      (4) Analysis conclusions would generally be stronger with estimates of variability and control analyses: This applies broadly to Figures 2-6.

      We have added estimates of variability and control analyses where appropriate.

      Figure 2 shows examples of single-trial signals. The variability is addressed in Figure 3a and the new Supplementary Figure S5.

      Figure 3 now contains error bars derived by bootstrapping (see Methods, Variance and autocorrelation of smoothed diffusion signals). We have also added Supplementary Figure S5, which substantiates the sublinearity claim using simulations.

      Figure 4 (i) We now indicate the s.e.m. of decoding accuracy (across sessions) by the shading in Figure 4a. (ii) The black symbols in new Supplementary Figure S8 show the mean±s.e.m. for all pairwise comparisons shown in Figure 4d & e. (iii) Supplementary Figure S8 also summarizes two control analyses that deploy random coding directions (CDs) in neuronal state space. The upper row of Fig S9 compares the observed cosine similarity (CoSim)—between the CD identified by the graph title and the other four CDs labeled along the abscissa—with values obtained with 1000 random CDs established by random permutations of the weight assignments. The brown symbols are the mean±sdev of the CoSim (N=1000). The error bars are smaller than the symbols. We use the cumulative distribution of CoSim under permutation to estimate p-values (p<0.001 for all comparisons). We used a similar approach to estimate the distribution of the analogous correlation statistics between signals rendered by random directions in state space (Figure S8, lower row). For additional details, please see Methods, Similarity of single-trial signals.

      Figure 5: The rigor of all claims associated with this figure is adduced from two control analyses and a simulation. The first control breaks the trial-by-trial correspondence between neural signals and behavior (Reviewer Figure 1). The second control shows that neural activity does not have substantial leverage on behavior when projected onto random directions in state space (Supplementary Figure S10, top). Simulations of decision variables using parameters derived from the fits to the behavioral data (Figure 1) support a degree of leverage and mediation comparable to the values observed for 𝑆Tincon (Supplementary Figure S10, bottom). For additional details, please see Methods (Leverage of single-trial activity on behavior) and the reply to item 1, above.

      Figure 6: Panels c&d show estimates of variability across neurons and experimental sessions, respectively. The reported p-value is based on a permutation test (see Methods, Correlations between Min and Tconin ). The correlations shown in panel e (heatmap) are derived from pooled data across sessions. The reported p-value is based on a permutation test (see Methods, Correlations between Min and Tconin ).

      Reviewer #3 (Public Review):

      Summary:

      The paper investigates which aspects of neural activity in LIP of the macaque give rise to individual decisions

      (specificity of choice and reaction times) in single trials, by recording simultaneously from hundreds of neurons. Using a variety of dimensionality reduction and decoding techniques, they demonstrate that a population-based drift-diffusion signal, which relies on a small subset of neurons that overlap choice targets, is responsible for the choice and reaction time variability. Analysis of direction-selective neurons in LIP and their correlation with decision-related neurons (T con in [Tconin ] neurons ) suggests that evidence integration occurs within area LIP.

      Strengths:

      This is an important and interesting paper, which resolves conflicting hypotheses regarding the mechanisms that underlie decision-making in single trials. This is made possible by exploiting novel technology (Primatepixels recordings), in conjunction with state-of-the-art analyses and well-established dynamic random dot motion discrimination tasks.

      General recommendations:

      (1) Please tone down causal language. You presentcompelling correlativeevidencefor the idea thatLIP population activity encodes the drift-diffusion DV. We feel that claims beyond that (e.g., ”Single-trial drift-diffusion signals control the choice and decision time”) would require direct interventions, and are only partially supported by the current evidence. Further examples are provided in point 1) of Reviewer 1 below.

      We have adopted the recommendation to “tone down the causal language.” Throughout the manuscript, we strive to avoid conveying the false impression that the present findings provide causal support for the decision mechanism. However, other causal studies of LIP support causality in the random dot motion task (Hanks et al., 2006; Jeurissen et al., 2022). It is therefore justifiable to use terms that imply causality in statements intended to convey hypotheses about mechanism. We agree that we should not give the false impression that the present support for said mechanism is adduced from causal perturbations in this study, as there were none.

      (2) Please provide a commonly used, data-driven quantification of the dimensionality of the population activity – for example, using participation ratio or the number of PCs explaining 90 % of the variance. This will help readers evaluate the conclusions about the dimensionality of the data.

      Principal component analysis reveals a participation ratio of 4.4 ± 0.4 (mean ±s.e., across sessions), and the first 3 PCs explain 67.1 ± 3.1 percent of the variance. The dimensionality of the data is low, but greater than one. We state this in Methods (Principal Component Analysis) and in Results (Single-trial drift-diffusion signals approximate the decision variable, lines 200–201).

      (3) Please justify the normalization procedure used for PCA: Why use the chosen normalization window (±25ms around 100ms after motion stimulus onset) for standardizing activity for PCA, rather than the more common quantification of mean/standard deviation across the full data window? What do the first principal components look like when the latter procedure is used?

      We now use a more conventional window that spans the main epoch of our analyses, 200–600 ms after motion onset. This method yields results qualitatively similar to the original method. We are persuaded that this is the more sensible choice.

      (4) Please provide estimates of variability for variance and autocorrelation in Fig. 3 (e.g., through bootstrapping). Further, simulations could substantiate the claim about the expected sub-linearity at later time points (Fig. 3a) due to the upper stopping bound and limited firing rate range.

      We thank the reviewers for these helpful recommendations. The revised Fig. 3 now contains error bars derived by bootstrapping (see Methods, Variance and autocorrelation of smoothed diffusion signals). We have also added Supplementary Figure S5, which substantiates the sub-linearity claim using simulations.

      (5) Please add controls and estimates of variability for decoding across sessions in Fig. 4: what are the levels of within-trial correlation/cosine similarity for random coding directions? What is the variability in the estimates of values shown in a/d/e?

      We have addressed each of these items. (1) Figure 4a now shows the s.e.m. of decoding accuracy (across sessions). (2) Regarding the variability of estimates shown in Figure 4d & e, the standard errors are displayed in the new supplementary Figure S8. It makes sense to show them there because there is no natural way to represent error on the heat maps in Figure 4, and Figure S8 concerns the comparison of the values in Figure 4d&e to values derived from random coding directions. (3) Random coding directions lead to values of cosine similarity and within-trial correlation that do not differ significantly from zero. We show this in several ways, summarized in our reply to Public Review item 4. Additional details are in the revised manuscript (Methods, Similarity of single-trial signals) and the new Supplementary Figure S8.

      (6) Please perform additional analysis to strengthen the claim from Fig. 6, that Min represents the integrand and not the integral. The analysis in Fig. 6d could be repeated with the integral (cumulative sum) of the single-trial Min signals. Does this yield an increase in leverage over time?

      The short answer is, yes in part. Reviewer Figure 2a provides support for leverage of the integral on choice, and this leverage, like 𝑆Tincon (t), increases as a function of time. The effect is present in all seven sessions that have both Mleftin and Mrightin neurons (all 𝑝 < 1𝑒 − 10). However, as shown in panel b, the same integral fails to demonstrate more than a hint of leverage on RT. All correlations are barely negative, and the magnitude does not increase as a function of time. We suspect—but cannot prove—that this failure arises because of limited power and the expected weak effect. Recall that the mediation analysis of RT is restricted to longer trials. Moreover, the correlation between the Min difference and the Tin signal is less than 0.1 (heatmap, Fig. 6e), implying that the Min difference explains less than 1% of the variance of 𝑆Tin(𝑡). We considered including Reviewer Figure 2 in the paper, but we feel it would be disingenuous (cherry-picking) to report only the positive outcome of the leverage on choice. If the editors feel strongly about it, we would be open to including it, but leaving these analyses out of the revised manuscript seems more consistent with our effort to deëmphasize this finding. In the future, we plan to record simultaneously from populations MT and LIP neurons (Min and Tin, of course) and optimize Min neuron yield by placing the RDM stimulus in the periphery.

      (7) Please describe the complete procedure for determining spatially-selective activity. E.g.: What response epoch was used, what was the spatial layout of the response targets, were responses to all ipsi- vs contralateral targets pooled, what was the spatial distribution of response fields relative to the choice targets across the population?

      We thank the reviewers for pointing out this oversight. We now explain this procedure in the Methods (lines 629–644):

      Neurons were classified post hoc as Tin by visual-inspection of spatial heatmaps of neural activity acquired in the delayed saccade task. We inspected activity in the visual, delay, and perisaccadic epochs of the task. The distribution of target locations was guided by the spatial selectivity of simultaneously recorded neurons in the superior colliculus (see Stine 2023 for details). Briefly, after identifying the location of the SC response fields, we randomly presented saccade targets within this location and seven other, equally spaced locations at the same eccentricity. In monkey J we also included 1–3 additional eccentricities, spanning 5–16 degrees. Neurons were classified as Tin if they displayed a clear, spatially-selective response in at least one epoch to one of the two locations occupied by the choice targets in the main task. Neurons that switched their spatial selectivity in different epochs were not classified as Tin. The classification was conducted before the analyses of activity in the motion discrimination task. The procedure was meant to mimic those used in earlier single-neuron studies of LIP (e.g., Roitman & Shadlen 2002) in which the location of the choice targets was determined online by the qualitative spatial selectivity of the neuron under study. The Tcon neurons in the in present study were highly selective for either the contralateral or ipislateral choice target used in the RDM task (AUC = 0.89±0.01; 𝑝 < 0.05 for 97% of neurons, Wilcoxon rank sum test). Given the sparse sampling of saccade target locations, we are unable to supply a quantitative estimate of the center and spatial extent of the RFs.

      (8) Please clarify if a neuron could be classified as both Tin and Min. Or were these categories mutually exclusive?

      These categories are mutually exclusive. If a neuron has spatially-selective persistent activity, as defined by the method described above, it is classified as a Tin neuron and not as an Min neuron even if it also shows motion-selective activity during passive motion viewing. We now specify this in the Methods (lines 831–832).

      Reviewer #1 (Recommendations For The Authors):

      𝑅∗1.1a Causal language (Line 23-24): “population activity represents […] drift” and “we provide direct support for the hypothesis that drift-diffusion signal is the quantity responsible for the variability in choice and RT” reads at first sight as if the authors claim that they present evidence for a causal effect of LIP activity on choice. The authors areotherwisenuanced and carefultopointout thattheir evidence is correlational. What seems to be meant is that the population activity/drift-diffusion signal ”approximates the DV that gives rise to the choices […]” (cf. line 399). I would recommend using such alternative phrasing to avoid confusion (and the typically strong reactions by readers against misleading causal statements).

      We have adopted the reviewer’s recommendation and have modified the text throughout to reduce causal language. See our response to General Recommendation 1.

      𝑅∗1.1b Relatedly, any discussion about the possibility of LIP being causally involved in evidence integration (e.g. lines 429-445 [Au: now 462–478]) should also comment on the possibility of a distributed representation of the decision variable given that neural correlates of the DV have been reported in several areas including PFC, caudate and FEF.

      We believe this is possible. However, we hope to avoid discussions about causality given that it is not a focus of the paper. Although it is somewhat tangential, we have shown elsewhere that LIP is causal in the sense that causal manipulations affect behavior, but it is also true that causality does not imply necessity, and similarly, lack of necessity does not imply “only correlation.” Regarding distributed representations, it is worth keeping in mind the cautionary counter-example furnished by the SC study (Stine et al., 2023). The firing rates measured by averaging over trials are similar in SC and LIP; both manifest as coherence and direction-dependent ramps, leading to the suggestion that they form a distributed representation of the decision variable. With single-trial resolution, we now know that LIP and SC exhibit distinct dynamics—drift-diffusion and bursting, respectively. It remains to be seen if single-trial resolution achievable by simultaneous Neuropixels recordings from prefrontal areas and LIP reveal shared or distinct dynamics.

      𝑅∗1.2 How was the spatially selective activity determined? The classification of Tin neurons is critical to this study - how was their spatial selectivity determined? Please describe this in similar detail as the description of direction selectivity on lines 681-690 [Au: now 824–832]. E.g.: what response epoch was used, what was the spatial layout of the response targets, were responses to all ipsi- vs contralateral targets pooled, and what was the spatial distribution of response fields relative to the choice targets across the population?

      We now explain the selection procedure in Methods (lines 629–644). Please see our reply to General Recommendation 7, above.

      𝑅∗1.3 Could a neuron be classified as both Tin and Min, or were these categories mutually exclusive? Please clarify. (This goes beyond the scope of the current study: but did the authors find evidence for topographic organization or clustering of these categories of neurons?)

      These categories are mutually exclusive. Please see our response to General Recommendation 8, above.

      𝑅∗1.4 Contrary to the statement on line 121, the trial averages in Fig. 2a, 2b show coherence dependency at the time of the saccade in saccade-aligned traces for the coding strategies, except for STin (fig. 2c). Is this a result of the choice for t1 (= 0.1s)? (The authors may want to change their statement on line 121.) Relatedly, do the population responses for the two coding strategies Sramp and SPC1 depend on the epoch used to derive weights for individual neurons?

      We have revised the description to accommodate R2’s observation. 𝑆ramp retains weak coherence-dependence before saccades towards the choice target contralateral to the recording site. This was true in four of the eight sessions. For 𝑆PC1, there is no longer a coherence dependency for the Tin choices, owing to the change in normalization method (see revised Figure 2b).

      We also corrected an error in the Methods section. Specifically, the ramp ends at 𝑡1 \= 0.05 s before the time of the saccade, not 𝑡1 \= 0.1 s. While we no longer emphasize the similarity of traces aligned to saccade, it is reasonable to find issue with the observation that they retain a dependency on coherence (𝑆ramp only) because, according to theory, traces associated with Tin choices should reach a common positive threshold at decision termination. That said, for the Ramp direction there may be a reason to expect this discrepancy from theory. The deterministic part of drift-diffusion includes an urgency signal that confers positive convexity to the deterministic drift. This accelerating nonlinearity is not captured by the ramp, and it is more prominent at longer decision times, thus low coherences. We do not share this interpretation in the revised manuscript, in part because retention of coherence dependency is present in only half the sessions (see Reviewer Figure 3) The correction to the definition of 𝑡1 also provides an opportunity to address R2’s final question (“Relatedly,…?”). For 𝑆ramp this particular variation in 𝑡1 does not affect 𝑆ramp, and 𝑆PC1 no longer retains coherence dependency for Tin choices. Note that our choice of 𝑡0 and 𝑡1 is based on the empirical observation that the ramping activity in response averages of Tin neurons typically begins 200 ms after motion onset and ends 50–100 ms before initiation of the saccadic choice. The starting time (𝑡0) is also supported by the observation that the decoding accuracy of a choice-decoder begins to diverge from chance at this time (Figure 4a).

      𝑅∗1.5 It is intriguing that Sramp and SPC1 show dynamics that look so similar (fig. 2a, 2b). How do the weights assigned to each neuron in both strategies compare across the population?

      The weights assigned to each neuron are very similar across the two strategies as indicated by a cosine similarity (0.65 ± 0.04, mean ±s.e.m. across sessions).

      𝑅∗1.6 Tin neurons, which show dynamics closely resembling different coding directions (fig. 2) and the decoders do not have weights that can distinguish them from the rest of the population in each of these analyses (fig. S7). Is it fair to interpret these findings as evidence for broad decision-related co-variability in the recorded neural population in LIP?

      Yes, our results are consistent with this interpretation. However, it is worth reiterating that decoding performance drops considerably when Tin neurons are not included (see Supplementary Figure S13). Thus, this broad decision-related co-variability is present but weak.

      𝑅∗1.7 It is intriguing that the decoding weights of the different decoders did not allow the authors to reliably identify Tin neurons. Could this be, in part, due to the low dimensionality of the population activity and task that the animals are presumably overtrained on? Or do the authors expect this finding to hold up if the population activity and task were higher dimensional?

      Great question! We can only speculate, but it seems possible that a more complex, “higher dimensional” task could make it easier to identify Tin neurons. For example, a task with four choices instead of two may decrease correlations among groups of neurons with different response fields. We have added this caveat to the discussion (lines 459-–461). One minor semantic objection: The animal has learned to perform a highly contrived task at low signal-to-noise. The animal is well-trained, not over-trained.

      𝑅∗1.8 Lines 135-137 [Au: now 141–142]: The similarity in the single trial traces from different coding strategies (fig. 2a-2c, left) is not as evident to me as the authors suggest. It might be worthwhile computing the correlation coefficients between individual traces for each pair of strategies and reporting the mean correlation to support the author’s point.

      We report the mean correlation between single-trial signals generated by the chosen dimensionality reduction methods in Figure 4e. We show the variability in this measure in Supplementary Figure S8. We have also adjusted the opacity of the single-trial traces in Figure 2, left.

      𝑅∗1.9 Minor/typos:

      -line 74: consider additionally citing Hyafil et al. 2023.

      -line 588: ”that were strongly correlated”?

      -line 615: ”were the actual drift-diffusion process were...”.

      -line 717: ”a causal influence” -> ”no causal influence”.

      Fig. 6: panel labels e vs d are swapped between the figure and caption.

      Fig. 3c: labels r1,3 & r2,3 are flipped.

      We have addressed all of these items. Thank you.

      Reviewer #2 (Recommendations For The Authors):

      𝑅∗2.1 (Figure 2) Determine whether restricting the analysis to 1D projections of the data is a suitable approach given the actual dimensionality of the datasets being analyzed:

      - Should show some quantification of the dimensionality of the recorded activity; could do this by quantifying the dimensionality of population activity in each session, e.g. with participation ratio or related measures (like # PCs to explain some high proportion of the variance, e.g. 90 %). If much of the variation is not described in 1 dimension, then the paper would benefit from some discussion/analysis of the signals that occupy the other dimensions.

      We now report the participation ratio (4.4 ± 0.4, mean ±s.e. across sessions), and we state that the first 3 PCs explain 67.1 ± 3.1% of the variance of the time- and coherence-dependent signals used for the PCA (mean ±s.e). We agree that the 1D projections may elide meaningful features of LIP population activity. Indeed, we make this point through our analysis of the Min neurons. To reiterate our response above, we do not claim that the 1D projections explain all of the meaningful features of LIP population activity. They do, however, reveal the decision variable, which is our main focus. These 1D signals contain features that correlate with events in the superior colliculus, summarized in Stine et al. (2023), attesting to their biological relevance.

      The Reviewer is correct that our approach presupposes a linear embedding of the 1D decision variable inthepopulationactivity. Inotherwords, anonlinearrepresentationofthe1Ddecisionvariableinpopulation activity could have an embedding dimensionality greater than 1, and there may well be a non-linear method that reveals this representation. To test this possibility, we decoded choice on each trial from population activity using (1) a linear decoder (logistic classifier) or (2) a multi-layer neural network, which can exploit non-linearities. We found that, for each session, the two decoders performed similarly: the neural network outperforms the logistic decoder (barely) in just one session. The analysis suggests that the assumption of linear embedding of the decision variable is justified. We hope this analysis convinces the reviewer that “sophisticated analyses of the full neuronal state space” and “a simple average of [Tcon ] neurons” do in indeed yield roughly equivalent representations of the decision variable. We have included the results of this analysis in Supplementary Figure S12. See also item 2 of the Public response.

      𝑅∗2.2 (Figure 3) Add estimates of variability for variance and autocorrelation through time from single-trial signals:

      –   E.g. by bootstrapping. Would be helpful for making rigorous the discussion of when the deviation from the theory is outside what would be expected by chance, even if it doesn’t change the specific conclusions here.

      –   If possible, it would help (by simulations, or maybe an added reference if it exists) to substantiate the claim about the expected sub-linearity at later time-points (Figure 3a) due to the upper stopping bound and limited firing rate range.

      We thank the reviewer for this helpful comment. The revised Fig. 3 now contains error bars derived by bootstrapping (see Methods, §Variance and autocorrelation of smoothed diffusion signals). We have also added Supplementary Figure S5, which substantiates the sub-linearity claim using simulations.

      𝑅∗2.3 (Figure 4) Add controls and estimates of variability for decoding across sessions:

      –   As a baseline - what is the level of within-trial correlation/cosine similarity when random coding directions are used?

      –   What is the variability in the estimates of values shown in a/d/e?

      We have addressed each of these items. (1) Figure 4a now shows the s.e.m. of decoding accuracy (across sessions). (2) Regarding the variability of estimates shown in Figure 4d & e, the standard errors are displayed in the new Supplementary Figure S8. It makes sense to show them there because (i) there is no natural way to represent error on the heat maps in Figure 4, and (ii) S8 concerns the comparison of the values in Figure 4d & e to values derived from random coding directions. (3) Random coding directions lead to values of cosine similarity and within-trial correlation that do not differ significantly from zero. We show this in several ways, summarized in our reply to Public Review item 4. Additional details are in the revised manuscript (Methods: Similarity of single-trial signals) and the new Supplementary Figure S8. We also provide this information in response to Recommendation 5, above.

      𝑅∗2.4 (Figure 5) Add negative controls and significance tests to support claims about trends in leverage:

      –   What is the level of increase in leverage attained from random 1D projections of the data, or other projections where the prior would be no leverage?

      –   What is the range of leverage values fit for a simulated signal with a ground-truth of no trend?

      We have added two control analyses. In addition to a shuffle control, which destroys the relationship (Review Figure 1) we performed additional analyses that preserve the correspondence of neural signals and behavior on the same trial. We generated random coding directions (CDs) by establishing weight-vectors that were either chosen from a Normal distribution or by permuting the weights assigned to PC-1 in each session. The latter is the more conservative measure. Projections of the neural responses onto these random coding directions render 𝑆rand(𝑡). Specifically, the degree of leverage is effectively zero or very much reduced. These analyses are summarized in a new Supplementary Figure S10. The distributions of our test statistics (e.g., leverage on choice and RT) under the variants of the null hypothesis also support traditional metrics of statistical significance. Figure S10 (bottom row) also provides an approximate answer to the question: What degree of leverage and mediation would be expected for a theoretical decision variable? Briefly, we simulated 60,000 trials using the race model that best fits the behavioral data of monkey M. For any noise-free representation of a Markovian integration process, the leverage of an early sample of the DV on behavior would be mediated completely by later activity as the latter sample—up to the time of commitment—subsumes all variability captured by the earlier sample. We, therefore, generated 𝑆sim(𝑡) by first subsampling the simulated data to match the trial numbers of each session. To evaluate a DV approximated from the activity of 𝑁 Tconin neurons per session rather than the true DV represented by the entire population, we generated 𝑁 noisy instantiations of the signal for each of the subsampled, simulated trials. The noisy decision variable, 𝑆sim (t) is the mean activity of these 𝑁 noise-corrupted signals. The simulation is consistent with the leverage and incomplete mediation observed for the populations of Tcon neurons. For in additional details, see Methods, §Leverage of single-trial activity on behavior) and Supplementary Figure S10, caption. See also our response to item 1 of the Public Response.

      𝑅∗2.5 The analysis is performed across several signed coherence levels, with data detrended for each signed coherence and choice to enable comparison of fluctuations relative to the relevant baseline; are results similar for the different coherences?

      The results are qualitatively similar for individual coherences. There is less power, of course, because there are fewer trials. The analyses cannot be performed for coherences ≥ 12.8% because there are not enough trials that satisfy the inclusion criteria (presence of left and right choice trials with RT ≤ 670 ms). Nonetheless, leverage on choice and RT is statistically significant for 27 of the 30 combinations of motion strengths < 12.8% × three signals (𝑆ramp, 𝑆PC1 and 𝑆Tin) × behavioral measures (RT and choice) (RT: all 𝑝 < 0.008, Fisher-z; choice: all 𝑝 < 0.05, t-test ). The three exceptions are trials with 6.4% coherence rightward motion, which do not correlate significantly with RT on leftward choice trials. Reviewer Figure 4 shows the results of the leverage and mediation analyses, using only the 0% coherence trials.

      𝑅∗2.6 (Figure 6) Additional analysis to strengthen the claim that Min represents the integrand and not the integral:

      a. Repeating the analysis in Figure 6d with the integral (cumulative sum) of the single-trial Min signals and instead observing a significant increase in leverage over time would be strong evidence for this interpretation. If you again see no increase, then it suggests that the activity of these units (while direction selective) may not be strongly yoked to behavior. This scenario (no increasing leverage of the integral of Min on behavior through time) also raises an intriguing alternative possibility: that the noise driving the ’diffusion’ of drift-diffusion here may originate in the integrating circuit, rather than just reflecting the complete integration of noise in the stream of evidence itself.

      b. Repeating the analysis in Figure 6d with the projection of the M subspace onto its own first PC (e.g. take the union of units {Mrightin, Mleftin} [our ], do PCA just on those units’ single

      trial activities, identify the first PC, and project those activities on that dimension to obtain SPC1-M.

      c. Ameliorating the sample-size limitation by relaxing the criteria for inclusion in Min - performing the same analyses shown, but including all units with visual RFs overlapping the motion stimulus, irrespective of their direction selectivity.

      a. Reviewer Figure 2a provides support for leverage of the integral on choice, and this leverage, like , increases as a function of time. The effect is present in all seven sessions that have both and neurons (all 𝑝 < 1𝑒 − 10). However, as shown in panel b, the same integral fails

      to demonstrate more than a hint of leverage on RT (all correlations are negative) and the magnitude does not vary as a function of time. We suspect—but cannot prove—that this failure arises because of limited power and the expected weak effect. Recall that the mediation analysis of RT is restricted to longer trials and that the correlation between the Min difference and the signal is less than 0.1 over the heatmap in Fig. 6e, implying that the Min difference explains less than 1% of the variance of 𝑆Tin(𝑡). We considered including Reviewer Figure 2 in the paper, but we feel it would be disingenuous (cherrypicking) to report only the positive outcome of the leverage on choice. If the editors feel strongly about it, we would be open to including it, but leaving these analyses out of the revised manuscript seems more consistent with our effort to deëmphasize this finding. In the future, we plan to record simultaneously from populations MT and LIP neurons (Min and Tin, of course) and optimize Min neuron yield by placing the RDM stimulus in the periphery. We also provide this information in response to Recommendation (6) above.

      b.  We tried the R’s suggestion to apply PCA to the union of Min neurons , , fully expecting PC1 to comprise weights of opposite sign for the right and left preferring neurons, but that is not what we observed. Instead, the direction selectivity is distributed over at least two PCs. We think this is a reflection of the prominence of other signals, such as the strong visual response and normalization signals (see Shushruth et al., 2018). In the spirit of the R’s suggestion, we also established an “evidence coding direction” using a regression strategy similar to the Ramp CD applied to the union of Min neurons. The strategy produced a coding direction with opposite signed weights dominating the right and left subsets. The projection of the neural data on this evidence CD yields a signal similar to the difference variable used in Fig. 6e (i.e., signals that are approximately constant firing rates vs time and scale as a function of signed coherence). These unintegrated signals exhibit weak leverage on choice and RT, consistent with Figure 6d. However, the integrated signal has leverage on choice but not RT, similar to the integral of the difference signal in Reviewer Figure 2.

      c.   We do not understand the motivation for this analysis. We could apply PCA or dPCA (or the regression approach, described above) to the population of units with RFs that overlap the motion stimulus, but it is hard to see how this would test the hypothesis that direction-selective neurons similar to those in area MT supply the momentary evidence. As mentioned, we have very few Min neurons (as few as two in session 3). Future experiments that place the motion stimulus in the periphery would likely increase the yield of Min neurons and would be better suited to study this question. As such, we do not see the integrand-like responses of Min neurons as a major claim of the paper. Instead, we view it as an intriguing observation that deserves follow-up in future experiments, including simultaneous recordings from populations of MT and LIP neurons (Min and Tin, of course). We have softened the language considerably to make it clear that future work will be needed to make strong claims about the nature of Min neurons.

      𝑅∗2.7 Other questions: Figure 2c is described as showing the average firing rate of units in Tconin on single trials, but must also incorporate some baseline subtraction (as the shown traces dip into negative firing rates). Whatbaselineissubtracted? Aretheseresidualsignals, asdescribedforlaterfigures, orisadifferent method used? (Presumably, a similar procedure is used also for Figure 2a/b, given that all single-trial traces begin at 0.). Is the baseline subtraction justified? If the dataset really does reflect the decision variable with single-trial resolution, eliminating the baseline subtraction when visualizing single-trial activity might actually help to make the point clearer: trials which (for any reason) begin with a higher projection on the particular direction that furnishes the DV would be predicted to reach the decision bound, at any fixed coherence, more quickly than trials with a smaller projection onto this direction.

      We thank the reviewer for this comment. For each trial, the mean activity between 175 ms and 225 ms after motion onset was subtracted when generating the single-trial traces. The baseline subtraction was only applied for visualization to better portray the diffusion component in the signal. Unless otherwise indicated, all analyses are computed on non-baseline corrected data. We now describe in the caption of Figure 2 that “For visualization, single-trial traces were baseline corrected by subtracting the activity in a 50 ms window around 200 ms.” Examples of the raw traces used for all follow-up analyses are displayed in Reviewer Figure 6.

      Reviewer #3 (Recommendations For The Authors):

      I only have a few comments to make the paper more accessible:

      𝑅∗3.1 I struggle to understand how the linear fitting from -1 to 1 was done. More detail about how the single cell single-trial activity was generated to possibly go from -1 to 1 or do I completely misunderstand the approach? I assume the data standardization does that job?

      We have rephrased and added clarifying detail to the section describing the derivation of the ramp signal in the Methods (Ramp direction).

      We applied linear regression to generate a signal that best approximates a linear ramp, on each trial, 𝑖, that terminates with a saccade to the choice-target contralateral to the hemisphere of the LIP recordings. The ramps are defined in the epoch spanning the decision time: each ramp begins at 𝑓𝑖(𝑡0) = −1, where 𝑡0 \= 0.2 s after motion onset, and ends at 𝑓𝑖(𝑡1) = 1, where 𝑡1 \= 𝑡sac − 0.05 s (i.e., 50 ms before saccade initiation). The ramps are sampled every 25 ms and concatenated using all eligible trials to construct a long saw-tooth function (see Supplementary Figure S2). The regression solves for the weights assigned to each neuron such that the weighted sum of the activity of all neurons best approximates the saw-tooth. We constructed a time series of standardized neural activity, sampled identically to the saw-tooth. The spike times from each neuron are represented as delta functions (rasters) and convolved with a non-causal 25 ms boxcar filter. The mean and standard deviation of all sampled values of activity were used to standardize the activity for each neuron (i.e., Z-transform). The coefficients derived by the regression establish the vector of weights that define 𝑆ramp. The algorithm ensures that the population signal 𝑆ramp(𝑡), but not necessarily individual neurons, have amplitudes ranging from approximately −1 to 1.

      𝑅∗3.2 It is difficult to understand how the urgency signal is derived, to then generate fig S4.

      The urgency signal is estimated by averaging 𝑆𝑥(𝑡) at each time point relative to motion onset, using only the 0% coherence trials. We have clarified this in the caption of Supplementary Figure S4.

      Author response image 1.

      Shuffle control for Fig. 5. Breaking the within-trial correspondence between neural signal, 𝑆(𝑡), and choice suppresses leverage to near zero.

      Author response image 2.

      Leverage of the integrated difference signal on choice and RT. Traces are the average leverage across seven sessions. Same conventions as in Figure 5.

      Author response image 3.

      Trial-averaged 𝑆ramp activity during individual sessions. Same as Figure 2b for individual sessions for Monkey M (left) and Monkey J (right). The figure is intended to illustrate the consistency and heterogeneity of the averaged signals. For example, the saccade-aligned averages lose their association with motion strength before left (contra) choices in sessions 1, 2, 5, and 6 but retain the association in sessions 3, 4, 7, and 8.

      Author response image 4.

      Drift-diffusion signals have measurable leverage on choice and RT even when only 0%-coherence trials are included in the analysis.

      Author response image 5.

      Raw single-trial activity for three types of population averages. Representative single-trial activity during the first 300 ms of evidence accumulation using two motion strengths: 0% and 25.6% coherence toward the left (contralateral) choice target. Unlike in Figure 2 in the paper, single-trial traces are not baseline corrected by subtracting the activity in a 50 ms window around 200 ms. We highlight a number of trials with thick traces and these are the same trials in each of the rows.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Public Reviews: 

      Reviewer #1 (Public review):

      Summary:

      The investigators in this study analyzed the dataset assembly from 540 Salmonella isolates, and those from 45 recent isolates from Zhejiang University of China. The analysis and comparison of the resistome and mobilome of these isolates identified a significantly higher rate of cross-region dissemination compared to localized propagation. This study highlights the key role of the resistome in driving the transition and evolutionary history of S. Gallinarum.

      Strengths:

      The isolates included in this study were from 16 countries in the past century (1920 to 2023). While the study uses S. Gallinarun as the prototype, the conclusion from this work will likely apply to other Salmonella serotypes and other pathogens.

      Thank you very much for your positive feedback. We recognize, as you noted, that emphasizing Salmonella enterica Serovar Gallinarum in the title may lead readers to perceive our methods and conclusions as overly restrictive. In light of your evaluation of our work, we have revised the title to: “Avian-specific Salmonella transition to endemicity is accompanied by localized resistome and mobilome interaction” We believe this final version not only reflects the applicability of our conclusions, as you appreciated, but also addresses your previous suggestion to highlight the resistome and mobilome.

      Revisions in the manuscript Lines: 1-3

      Weaknesses:

      While the isolates came from 16 countries, most strains in this study were originally from China.

      We believe that this issue was discussed in detail in our previous response. Although potential bias exists, we have minimized its impact by constructing the largest global S. Gallinarum genome dataset to date. In addition, we have further emphasized these limitations in the manuscript.

      Comments on revisions:

      This reviewer is happy with the detailed responses from the authors regarding revising this manuscript. I do not have further comments.

      We greatly appreciate your positive feedback and are pleased that our responses have addressed your concerns.

      Reviewer #2 (Public review):

      Summary:

      The authors sequence 45 new samples of S. Gallinarum, a commensal Salmonella found in chickens, which can sometimes cause disease. They combine these sequences with around 500 from public databases, determine the population structure of the pathogen, and coarse relationships of lineages with geography. The authors further investigate known anti-microbial genes found in these genomes, how they associate with each other, whether they have been horizontally transferred, and date the emergence of clades.

      Strengths:

      - It doesn't seem that much is known about this serovar, so publicly available new sequences from a high burden region are a valuable addition to the literature.

      - Combining these sequences with publicly available sequences is a good way to better contextualise any findings.

      - The genomic analyses have been greatly improved since the first version of the manuscript, and appropriately analyse the population and date emergence of clades.

      - The SNP thresholds are contextualised in terms of evolutionary time.

      - The importance and context of the findings are fairly well described.

      Thank you so much for your thorough review and constructive comments on the manuscript.

      Weaknesses:

      -  There are still a few issues with the genomic analyses, although they no longer undermine the main conclusions:

      We are grateful for the valuable time and effort you have dedicated to improving our manuscript. In this revision, we have provided a point-by-point response to each of your concerns. Moreover, with the addition of new supplementary materials and modifications to the figures, we have re-examined and adjusted the numbering of figures and supplementary materials in the text to ensure they appear correctly in the manuscript.

      (1) Although the SNP distance is now considered in terms of time, the 5 SNP distance presented still represents ~7yrs evolution, so it is unlikely to be a transmission event, as described. It would be better to use a much lower threshold or describe the interpretation of these clusters more clearly. Bringing in epidemiological evidence or external references on the likely time interval between transmissions would be helpful.

      We sincerely thank you for highlighting this issue. We appreciate your concern regarding the use of a 5-SNP threshold to define a transmission event, especially given the approximate 7-year evolutionary timeframe. Considering our updated estimate for the evolutionary rate of S. Gallinarum (approximately 0.74 SNPs per year, with a 95% HPD range of 0.42 to 1.06), we have revised the manuscript to use a 2-SNP threshold (approximately representing less than two years of evolution) to better control the temporal span of transmission events. In addition, we have updated the manuscript to reflect this new threshold and demonstrated that the use of a more stringent SNP threshold does not affect the overall conclusions of the study.

      Specifically, we adopted the newly established 2-SNP threshold to update Figure 3a and corresponding Supplementary Figure 8. The heatmap on the far right of New Figure 3a illustrates the SNP distances among 45 newly isolated S. Gallinarum strains from two locations in Zhejiang Province (Taishun and Yueqing). New Supplementary Figure 8 simulates potential transmission events between the bvSP strains isolated from Zhejiang Province (n=95) and those from other regions of China with available provincial information (n=435). These analyses collectively demonstrate the localized transmission patterns of bvSP within China.

      For New Figure 3a, we found that even with the 2-SNP threshold, the number of potential transmission events among the 45 newly isolated S. Gallinarum strains from the two Zhejiang locations (Taishun and Yueqing) remains unchanged. In fact, we observed that the results from SNP tracing using an SNP threshold of less than 5 are consistent (see Author response image 1). 

      Author response image 1.

      Clustering results of 45 newly isolated S. Gallinarum strains using different SNP thresholds of 1, 2, 3, 4, and 5 SNPs. The five subplots represent the clustering results under each threshold. Each point corresponds to an individual strain, and lines connect strains with potential transmission relationships.

      For New Supplementary Figure 8, we employed the 2-SNP threshold and found that the number of transmission events between the bvSP strains isolated from Zhejiang Province (n=95) and those from other Chinese provinces (n=435) decreased from 91 to 53. The names of the strains involved in these potential transmission events are listed in Supplementary Table 5.

      Revisions in the manuscript

      Lines: 352-357

      Figures: Figure 3; Supplementary Figure 8

      Table: Supplementary Table 5

      (2) The HGT definition has not fundamentally been changed and therefore still has some issues, mainly that vertical evolution is still not systematically controlled for. 

      We sincerely thank you for highlighting this issue. We hope the following explanation will help clarify and improve our manuscript, as well as address your concerns.

      In bacteria, mobile genetic elements (MGEs) such as plasmids, transposons, integrons, and prophages, as mentioned in our manuscript, are segments of DNA that encode enzymes and proteins responsible for mediating the movement of genetic material between bacterial genomes (commonly referred to as “jumping genes”). These MGEs contribute to the mechanisms of horizontal gene transfer (HGT) in Salmonella, including transduction (via prophages), conjugation (via plasmids), and transposition (via integrons and transposons) (Nat Rev Microbiol. 2005 Sep;3(9):722-32). These “jumping genes” can enable Salmonella to acquire additional antimicrobial resistance genes (ARGs), which may not only originate from other Salmonella strains but also from distantly related species.

      To further address your concern regarding the systematic control of vertical evolution, we employed the HGTphyloDetect pipeline developed by Le Yuan et al. (Brief Bioinform. 2023 Mar 19;24(2):bbad035) to control for vertical evolution in the ARG sequences mentioned in our manuscript. We chose HGTphyloDetect because, as noted, "jumping genes" often occur among evolutionarily distant species, rendering the use of Gubbins potentially unsuitable for these distant HGT events.

      Using the HGTphyloDetect pipeline, we extracted base sequences for the eight ARGs shown in Figure 6b with an HGT frequency greater than zero (bla<sup>TEM-1B</sup>, sul1, dfrA17, aadA5, sul2, aph(3’’)-Ib, tet(A), aph(6)-Id). For bla<sup>TEM-1B</sup>, sul1, dfrA17, aadA5, and sul2, the HGT frequency reached 100% across different isolates, indicating that these ARG sequences have a unique sequence type. In contrast, due to the ResFinder settings requiring both similarity and coverage to meet a minimum value of 90%, the base sequences for aph(3’’)-Ib, tet(A), and aph(6)-Id are not unique. Consequently, we applied the HGTphyloDetect pipeline individually to each sequence type of ARGs to verify their association with HGT events. Specifically, among 436 bvSP isolates collected in China, we identified two sequence types of aph(3’’)-Ib, four sequence types of tet(A), and three sequence types of aph(6)-Id.

      Subsequently, to identify potential ARGs horizontally acquired from evolutionarily distant organisms, we queried the translated amino acid sequences of each ARG against the National Center for Biotechnology Information (NCBI) non-redundant protein database. We then evaluated whether these sequences were products of HGT by calculating Alien Index (AI) scores and out_perc values.

      The calculation of AI score is as follows:

      In this study, bbhG and bbhO represent the E-values of the best blast hit in ingroup and outgroup lineages, respectively. The outgroup lineage is defined as all species outside of the kingdom, while the ingroup lineage encompasses species within the kingdom but outside of the subphylum. An AI score ≥ 45 is considered a strong indicator that the gene in question is likely derived from an HGT event.

      Regarding the calculation method for out_perc:

      Finally, according to the definition provided by the HGTphyloDetect pipeline, ARGs with AI score ≥ 45 and out_perc ≥ 90% are presumed to be potential candidates for HGT from evolutionarily distant species. We have compiled the calculation results for the aforementioned genes in New Supplementary Table 9. The results indicate that all ARGs presented in Figure 6b, which exhibited a HGT frequency greater than zero, were acquired horizontally by S. Gallinarum. Based on these findings, we have revised the manuscript accordingly.

      Revisions in the manuscript

      Lines: 302-307; 616-650; 955-957

      Table: Supplementary Table 9

      Using a 5kb window is not sufficient, as LD may extend across the entire genome.

      We agree with your point that linkage disequilibrium (LD) could influence the transmission of genes within chromosomal regions. LD can lead to the non-random cooccurrence of alleles at different loci within a population. Considering that horizontal gene transfer (HGT) events involving more distantly related ARGs may be accompanied by vertical propagation on chromosomes, and to simultaneously assess the impact of LD, we conducted two evaluations.

      It is important to note that the following assessments are based on the assumption that plasmid replicons detected by PlasmidsFinder are part of self-replicating, extrachromosomal DNA.

      (1) In the revised pipeline used to calculate ARG HGT frequencies, we categorized a total of 621 ARGs carried by 436 bvSP isolates collected in China and found that 415 of these ARGs were located on MGEs. We further investigated the distribution of these 415 ARGs across different MGEs, taking into account the complex nesting relationships among them. We observed that 90% of the ARGs (372/415) were located on plasmid contigs. It is important to clarify that this finding does not contradict our statement in the manuscript regarding plasmids and transposons as the primary reservoirs for resistome geo-temporal dissemination. This is because transposons, integrons, and prophages carrying ARGs can also be found on plasmids. Additionally, only 25 bvSG isolates from China contained ARGs, which were likely acquired via transposons or integrons located on the chromosome.  

      (2) In our manuscript, we searched for ARGs within a 5kb upstream and downstream region (a total of 10kb) of transposons and integrons (The BLASTn parameters used in the Bacant pipeline to identify transposons and integrons were set to a coverage threshold of 60%, rather than 100%). However, in light of the potential impact of LD on vertical transmission, we expanded our search to include a 10kb upstream and downstream range (a total of 20kb)  for these 25 isolates. The decision to expand the search range to 10kb upstream and downstream range is based on the following two considerations: 1) Based on literature, we determined the overall lengths of the integrons and transposons carried by the 25 isolates (Tn801, Tn6205, Tn1721, In498, In1440, In473, and In282), and found that the maximum length of these elements is ~13.5 kb. Using a 10kb upstream and downstream threshold effectively covers these integrons/transposons. 2) The limitation posed by genomic fragmentation due to next-generation sequencing, which restrict the search range. We present the results of this expanded search for colocalization of ARGs with transposons and integrons at: Figshare:  https://doi.org/10.6084/m9.figshare.28129130.v1

      We found that these results were consistent with those obtained using the previous search range.

      Taken together, these results suggest that although linkage disequilibrium may influence genetic processes within chromosomal regions—particularly for the few chromosomeassociated antibiotic resistance genes linked to integrons and transposons—the overall impact in our study is likely minimal. This conclusion is supported by the observation that 90% of the ARGs in our dataset are located on plasmids, and even an expanded search range does not alter this outcome. Additionally, by incorporating Alien Index scores and calculating out_perc, we can further confirm the occurrence of horizontal gene transfer events.

      However, it is undeniable that other studies using our current pipeline may be affected. As a temporary remedial measure, we have included a note in the "README" file  as below (https://github.com/tjiaa/Cal_HGT_Frequency):

      “Note: Considering that ARGs located on the chromosome and carried by mobile genetic elements—such as integrons and transposons—may introduce potential computational errors, we recommend evaluating the number of ARGs associated with these elements on the chromosome during your analysis. If a majority of ARGs in your dataset fall into this category, we suggest using additional methods to evaluate the potential impact of linkage disequilibrium. Additionally, by modifying the “MGE_start” and “MGE_end” parameters in the “eLife_MGE_ARG_Co_location.ipynb” script, you can assess the distance between different ARGs and integrons or transposons on the chromosome. This approach will further aid in evaluating the impact of linkage disequilibrium on the genetic process.”

      We believe this approach will assist researchers in further assessing the potential impact of vertical evolution and help other users determine whether additional methods are necessary to account for such effects.

      As the authors have now run gubbins correctly, they could use the results from this existing analysis to find recent HGT.

      We sincerely thank you for your valuable suggestion. Utilizing additional methods to predict potential horizontal gene transfer (HGT) events could indeed enhance the robustness of the results. However, "jumping genes" often occur among evolutionarily distant species, rendering the use of Gubbins potentially unsuitable for these distant HGT events.

      Furthermore, the primary focus of our study is to identify HGT of antimicrobial resistance genes (ARGs) in the Salmonella genome driven by mobile genetic elements. Therefore, we employed the HGTphyloDetect pipeline developed by Le Yuan et al. (Brief Bioinform. 2023 Mar 19;24(2):bbad035) to control for vertical evolution in the ARG sequences. The specific computational methods and conclusions have been detailed above.

      To definite mobilisation, perhaps a standard pipeline such (e.g. https://github.com/EBIMetagenomics/mobilome-annotation-pipeline) would be more convincing.

      Thank you for your valuable suggestion. We agree that defining mobilization using a standardized pipeline can add rigor and clarity to our analysis. The pipeline you referenced (https://github.com/EBI-Metagenomics/mobilome-annotation-pipeline) is an excellent resource and provides a robust approach to the identification and annotation of mobile genetic elements.

      We have examined and run this pipeline, which uses “IntegronFinder” and “ICEfinder” to detect integrons, “geNomad” to identify plasmids, and “geNomad” and “VIRify” to detect prophages. Our initial checks revealed that the numbers of integrons, plasmids, and prophages identified using this pipeline were consistent with those detected in our study. However, due to the significantly different output formats, the results from this pipeline could not be integrated with the pipeline we used for calculating HGT frequency.

      We will incorporate the standardized pipeline you suggested in future studies to further improve the reliability of our findings.

      (3) The invasiveness index is better described, but the authors still did not provide convincing evidence that the small difference is actually biologically meaningful (there was no statistical difference between the two strains provided in response Figure 6). What do other Salmonella papers using this approach find, and can their links be brought in? If there is still no good evidence, a better description of this difference would help make the conclusions better supported.

      We sincerely appreciate your thoughtful feedback. The initial introduction of the invasiveness index in our manuscript aimed to quantitatively assess the differences in invasiveness between two geographically distinct strains of S. Gallinarum (isolated from Taishun and Yueqing) by comparing the degradation of 196 top predicted genes associated with invasiveness in their genomes. We found a highly significant statistical difference (P < 0.0001) in the invasiveness index between them.

      Several studies have also employed the invasiveness index to predict biological relevance in Salmonella strains, and we believe these examples provide further context for our approach:

      (1) Caisey V. Pulford et al, Nat Microbiol, 2021, used the same method to calculate the invasiveness index for Salmonella Typhimurium and employed it to characterize the invasiveness of different lineage strains. They found that Salmonella in Lineage-3 exhibited the highest invasiveness index, suggesting an adaptation from an intestinal to a systemic lifestyle. The authors noted, "Although the invasiveness index cannot yet be experimentally validated, Salmonella isolates with different invasiveness indices produce distinct clinical symptoms in a human population (BMC Med. 2020 Jul 17; 18(1):212)". They emphasized the necessity of developing more robust methods to measure Salmonella invasiveness.

      (2) Sandra Van Puyvelde et al, Nat Commun, 2019, reported that Salmonella Typhimurium sequence type 313 (ST313) lineage II.1 exhibited a higher invasiveness index compared to lineage II, suggesting that the two lineages might have distinct adaptations to an invasive lifestyle. Further experiments demonstrated significant differences between these lineages in terms of biofilm formation (A red dry and rough (RDAR) assay) and metabolic capacity for carbon compounds.

      (3) Wim L. Cuypers et al, Nat Commun, 2023, calculated the invasiveness index for 284 global Salmonella Concord strains across different lineages and found that Lineage-4 potentially exhibited the highest invasiveness.

      Given these evidences, we acknowledge that no significant difference in mortality was observed between the L2b and L3b S. Gallinarum strains in 16-day-old SPF chicken embryos. Existing literature suggests that strains with higher invasiveness indices may still exhibit differences in biofilm formation and metabolic capacities, reflecting their adaptation to different host environments. As such, we maintain that the invasiveness index remains a valuable metric for evaluating the genomic differences between S. Gallinarum strains from Taishun and Yueqing. We plan to further investigate these differences through phenotypic experiments in our next research.

      In the revised manuscript, we have added the following discussion along with additional references:

      Lines 358-365: “Moreover, the invasiveness index of bvSP from Taishun and Yueqing suggests that different lineages of S. Gallinarum recovered from distinct regions may exhibit biological differences. Previous studies have shown that strains with higher invasiveness indexes tend to be more virulent in hosts (30, 31), potentially causing neurological or arthritic symptoms in S. Gallinarum infections. Furthermore, strains with varying invasiveness indexes have been confirmed to differ in their biofilm formation abilities and metabolic capacities for carbon compounds (32).”

      Revisions in the manuscript:

      Lines: 358-365, 806-827.

      In summary, the analysis is broadly well described and feels appropriate. Some of the conclusions are still not fully supported, although the main points and context of the paper now appear sound.

      Thank you so much for your positive evaluation of our work. We hope that the revised manuscript meets your expectations and offers a more accurate interpretation of our findings.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      This is a great improvement over the first version and I thank the authors for a thorough response, as well as changing their conclusions in response to their improvements.

      Other small remaining issues:

      Figure 3: Heatmap of SNPs is hard to read in grayscale. It also just represents the between clade distances already shown by the tree. It would be more useful to present intraclade distances only to see the SNP resolution _within_ each lineage. Using a better colour scheme would also help.

      Thank you for your insightful comments and suggestions regarding Figure 3. We agree that the grayscale heatmap may present challenges in terms of visual clarity. To address this, we have updated the heatmap with a more distinct color gradient, ensuring better contrast and easier interpretation (New Figure 3). 

      Regarding your second suggestion: "It would be more useful to present intraclade distances only to see the SNP resolution within each lineage," we believe it is already addressed in the current version of New Figure 3. Specifically, the heatmap on the right side of New Figure 3 illustrates the SNP distances between S. Gallinarum isolates from Taishun and Yueqing, with the goal of demonstrating that genomic variation within isolates from a single region is generally smaller compared to those from different regions. In this figure, 45 newly isolated S. Gallinarum strains are categorized into two lineages: L2b and L3b. The heatmap on the right side of Figure 3 displays the SNP distances between all pairwise combinations of these 45 strains, where the intraclade distances are represented by the red regions (highlighting the pairwise distances within each lineage, specifically L3b and L2b, which are indicated by two triangles). The between-clade distances are shown by the blue regions.

      We also believe in further exploring the intraclade distances across the entire dataset of 580 S. Gallinarum strains, as it could provide additional insights. However, this analysis would extend beyond the scope of the current section.

      Revisions in the manuscript Line: 998

      Figure: Figure 3

      Please remove Figure 6c, it does not add anything to the paper and raises questions about performing this regression.

      Thank you for pointing out this issue. We have removed Figure 6c and the corresponding description in the "Results" section from the manuscript (New Figure 6).

      Revisions in the manuscript Lines: 316, 319, 1035-1041.

      Figure: Figure 6

      Again, thank you all for your time and efforts in reviewing our work. We believe the improved manuscript meets the high standards of the journal.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Weaknesses & incompletely supported claims:

      (1) A central mechanistic claim of the paper is that "DCP1a can regulate DCP2's cellular decapping activity by enhancing DCP2's affinity to RNA, in addition to bridging the interactions of DCP2 with other decapping factors. This represents a pivotal molecular mechanism by which DCP1a exerts its regulatory control over the mRNA decapping process." Similar versions of this claim are repeated in the abstract and discussion sections. However, this appears to be entirely at odds with the observation from in vitro decapping assays with immunoprecipitated DCP2 that showed DCP1 knockout does not significantly affect the enzymatic activity of DCP2 (Figures 2B-D; I note that there may be a very small change in DCP2 activity shown in panel C, but this may be due to slightly different amounts of immunoprecipitated DCP2 used in the assay, as suggested by panel D). If DCP1 pivotally regulates decapping activity by enhancing RNA binding to DCP2, why is no difference in decapping activity observed in the absence of DCP1?

      Furthermore, the authors show only weak changes in relative RNA levels immunoprecipitated by DCP2 with versus without DCP1 (~2-3 fold change; consistent with the Valkov 2016 NSMB paper, which shows what looks like only modest changes in RNA binding affinity for yeast Dcp2 +/- Dcp1). Is the argument that only a 2-3 fold change in RNA binding affinity is responsible for the sizable decapping defects and significant accumulation of deadenylated intermediates observed in cells upon Dcp1 depletion? (and if so, why is this the case for in-cell data, but not the immunoprecipitated in vitro data?)

      We appreciate the reviewer's thoughtful comments on our paper. The reviewer points out an apparent contradiction between the claim that DCP1a regulates DCP2's cellular decapping activity and the observation that knocking out DCP1a does not significantly affect DCP2's enzymatic activity in vitro. However, it is important to underscore the challenge of reconciling differences between in vitro and in vivo experiments in scientific research. Although in vitro systems provide a controlled environment, they have inherent limitations that often fail to capture the complexities of cellular processes. Our in vitro experiments used immunoprecipitated proteins to ensure the presence of relevant factors, but these experiments cannot fully replicate the precise stoichiometry and dynamic interactions present in a cellular environment. Furthermore, the limited volume in vitro can actually facilitate reactions that may not occur as readily in the complex and heterogeneous environment of a cell. Therefore, the lack of a significant difference in decapping activity observed in vitro does not necessarily negate the regulatory role of DCP1 in the cellular context. Rather, it underscores our previous oversight of DCP1's importance in the decapping process under in vitro conditions. The conclusions regarding DCP1's regulatory mechanisms remain valid and supported by the presented evidence, especially when considering the inherent differences between in vitro and in vivo experimental conditions. It is precisely because of these differences that we recognized our previous underestimation of DCP1's significance. Therefore, our subsequent experiments focused on elucidating DCP1's regulatory mechanisms in the decapping process

      The authors acknowledge this apparent discrepancy between the in vitro DCP2 decapping assays and in-cell decapping data, writing: "this observation could be attributed to the inherent constraints of in vitro assays, which often fall short of faithfully replicating the complexity of the cellular environment where multiple factors and cofactors are at play. To determine the underlying cause, we postulated that the observed cellular decapping defect in DCP1a/b knockout cells might be attributed to DCP1 functioning as a scaffold." This is fair. They next show that DCP1 acts as a scaffold to recruit multiple factors to DCP2 in cells (EDC3, DDX6, PatL1, and PNRC1 and 2). However, while DCP1 is shown to recruit multiple cofactors to DCP2 (consistent with other studies in the decapping field, and primarily through motifs in the Dcp1 C-terminal tail), the authors ultimately show that *none* of these cofactors are actually essential for DCP2-mediated decapping in cells (Figures 3A-F). More specifically, the authors showed that the EVH1 domain was sufficient to rescue decapping defects in DCP1a/b knockout cells, that PNRC1 and PNRC2 were the only cofactors that interact with the EVH1 domain, and finally that shRNA-mediated PNRC1 or PNCR2 knockdown has no effect on in-cell decapping (Figures 3E and F). Therefore, based on the presented data, while DCP1 certainly does act as a scaffold, it doesn't seem to be the case that the major cellular decapping defect observed in DCP1a/b knockout is due to DCP1's ability to recruit specific cofactors to DCP2.

      The findings that none of the decapping cofactors recruited by DCP1 to DCP2 are essential for decapping in cells further underscore the complexity of the decapping process in vivo. This observation suggests that while DCP1's scaffolding function is crucial for recruiting cofactors, the decapping process likely involves additional layers of regulation that are not fully captured by our current understanding of DCP1. Furthermore, the reviewer mentions that the observed changes in RNA binding affinity (approximately 2-3 fold) in our in vitro experiments seem relatively modest. While these changes may appear insignificant in vitro, their cumulative impact in the dynamic cellular environment could be substantial. Even minor perturbations in RNA binding affinity can trigger cascading effects, leading to significant changes in decapping activity and the accumulation of deadenylated intermediates upon Dcp1 depletion. Cellular processes involve complex networks of interrelated events, and small molecular changes can result in amplified biological outcomes. The subtle molecular variations observed in vitro may translate into significant phenotypic outcomes within the complex cellular environment, underscoring the importance of DCP1a's regulatory role in the cellular decapping process.

      So as far as I can tell, the discrepancy between the in vitro (DCP1 not required) and in-cell (DCP1 required) decapping data, remains entirely unresolved. Therefore, I don't think that the conclusions that DCP1 regulates decapping by (a) changing RNA binding affinity (authors show this doesn't matter in vitro, and that the change in RNA binding affinity is very small) or (b) by bridging interactions of cofactors with DCP2 (authors show all tested cofactors are dispensable for robust in-cell decapping activity), are supported by the evidence presented in the paper (or convincingly supported by previous structural and functional studies of the decapping complex).

      We have addressed the reconciliation of differences between in vitro and in vivo experiments in the revised manuscript and emphasized the importance of considering cellular interactions when interpreting our findings.

      (2) Related to the RNA binding claims mentioned above, are the differences shown in Figure 3H statistically significant? Why are there no error bars shown for the MBP control? (I understand this was normalized to 1, but presumably, there were 3 biological replicates here that have some spread of values?). The individual data points for each replicate should be displayed for each bar so that readers can better assess the spread of data and the significance of the observed differences. I've listed these points as major because of the key mechanistic claim that DCP1 enhances RNA binding to DCP2 hinges in large part on this data.

      Thank you for your feedback. Regarding your comments on the statistical significance of the differences shown in Figure 3H and the absence of error bars for the MBP control, we will address these concerns in the revised manuscript. We’ll include individual data points for the three biological replicates and corresponding statistical analysis to more clearly demonstrate the data spread and significance of the observed differences.

      (3) Also related to point (1) above, the kinetic analysis presented in Figure 2C shows that the large majority of transcript is mostly decapped at the first 5-minute timepoint; it may be that DCP2-mediated decapping activity is actually different in vitro with or without DCP1, but that this is being missed because the reaction is basically done in less than 5 minutes under the conditions being assayed (i.e. these are basically endpoint assays under these conditions). It may be that if kinetics were done under conditions to slow down the reaction somewhat (e.g. lower Dcp2 concentration, lower temperatures), so that more of the kinetic behavior is captured, the apparent discrepancy between in vitro and in-cell data would be much less. Indeed, previous studies have shown that in yeast, Dcp1 strongly activates the catalytic step (kcat) of decapping by ~10-fold, and reduces the KM by only ~2 fold (Floor et al, NSMB 2010). It might be beneficial to use purified proteins here (only a Western blot is used in Figure 2D to show the presence of DCP2 and/or DCP1, but do these complexes have other, and different, components immunoprecipitated along with them?), if possible, to better control reaction conditions.

      This contradiction between the in vitro and in-cell decapping data undercuts one of the main mechanistic takeaways from the first half of the paper. This needs to be addressed/resolved with further experiments to better define the role of DCP1-mediated activation, or the mechanistic conclusions significantly changed or removed.

      We genuinely appreciate the reviewer’s insightful comments on the kinetic analysis presented in Figure 2C. Your astute observation regarding the potential influence of reaction duration on the interpretation of in vitro decapping activity, especially in the absence of DCP1, is well-received. The time-sensitive nature of our experiments, as you rightly pointed out, might not fully capture the nuanced kinetic behaviors. In addition, the DCP2 complex purified from cells could not be precisely quantified. In response to your suggestion, we attempted to purify human DCP2 protein from E. coli; however, regrettably, the purified protein failed to exhibit any enzymatic activity. This disparity may be attributed to species differences.

      Considering the reviewer’s valuable insights, our revised manuscript emphasized that purified DCP2 from cells exhibits activity regardless of the presence of DCP1. This adjustment aims to provide a clearer perspective on our findings and to better align with the nuances of our experimental design and the meticulous consideration of the results.

      (4) The second half of the paper compares the transcriptomic and metabolic profiles of DCP1a versus DCP1b knockouts to reveal that these target a different subset of mRNAs for degradation and have different levels of cellular metabolites. This is a great application of the DCP1a/b KO cells developed in this paper and provides new information about DCP1a vs b function in metazoans, which to my knowledge has not really been explored at all. However, the analysis of DCP1 function/expression levels in human cancer seems superficial and inconclusive: for example, the authors conclude that "...these findings indicate that DCP1a and DCP1b likely have distinct and non-redundant roles in the development and progression of cancer", but what is the evidence for this? I see that DCP1a and b levels vary in different cancer cell types, but is there any evidence that these changes are actually linked to cancer development, progression, or tumorigenesis? If not, these broader conclusions should be removed.

      Thank you to the reviewer for pointing out that such a description may be misleading. We have removed our previous broader conclusion and revised our sentences. To further explore the potential impact of DCP1a and DCP1b on cancer progression, we examined the association between the expression levels of DCP1a and DCP1b and progression-free interval (PFI). We have incorporated this information into our revised manuscript.

      (5) The authors used CRISPR-Cas9 to introduce frameshift mutations that result in premature termination codons in DCP1a/b knockout cells (verified by Sanger sequencing). They then use Western blotting with DCP1a or DCP1b antibodies to confirm the absence of DCP1 in the knockout cell lines. However, the DCP1a antibody used in this study (Sigma D5444) is targeted to the C-terminal end of DCP1a. Can the authors conclusively rule out that the CRISPR/Cas-generated mutations do not result in the production of truncated DCP1a that is just unable to be detected by the C-terminally targeted antibody? While it is likely the introduced premature termination codon in the DCP1a gene results in nonsense-mediated decay of the resulting transcript, this outcome is indeed supported by the knockout results showing large defects in cellular decapping which can be rescued by the addition of the EVH1 domain, it would be better to carefully validate the success of the DCP1a knockout and conclusively show no truncated DCP1a is produced by using N-terminally targeted DCP1a antibodies (as was the case for DCP1b).

      Thank you for your insightful comment regarding the validation of our DCP1a/b knockout cell line. We acknowledge your point about the DCP1a C-terminal targeting of the Sigma D5444 antibody used in our Western blot analysis. We agree that we cannot definitively rule out the possibility of truncated DCP1a protein production solely based on the lack of full-length protein detection. To address this limitation, we utilized a commercial information available N-terminally targeted DCP1a antibody (aviva ARP39353_T100) in a Western blot analysis. This will allow us to comprehensively detect any truncated protein fragments remaining after the CRISPR-Cas9-generated frameshift mutation.

      Some additional minor comments:

      • More information would be helpful on the choice of DCP1 truncation boundaries; why was 1-254 chosen as one of the truncations?

      Thank you for the reviewer's comment and suggestion. Regarding the choice of DCP1 1-254 truncation boundaries based on the predicted structure from AlphaFoldDB (A0A087WT55). We will include this information in the revised manuscript.

      • Figure S2D is a pretty important experiment because it suggests that the observed deadenylated intermediates are in fact still capped; can a positive control be added to these experiments to show that removal of cap results in rapid terminator-mediated degradation?

      Unfortunately, due to our institution's current laboratory safety policies, we are unable to perform experiments involving the use of radioactive isotopes such as 32P. Therefore, while adding the suggested positive control experiment to demonstrate rapid RNA degradation upon decapping would further validate our interpretation, we regret that we cannot carry out this experiment at the moment. However, the observed deadenylated intermediates in Figure S2D match the predicted size of capped RNA fragments, and not the expected sizes of degradation products after decapping. Furthermore, previous literature has well-established that for these types of RNAs, decapping leads directly to rapid 5' to 3' exonuclease-mediated degradation, without producing stable deadenylated intermediates. Thus, we believe that the current data is sufficient to support our conclusion that the deadenylated intermediates retain the 5' cap structure.

      Reviewer #2 (Public Review):

      Weaknesses:

      The direct targets of DCP1a and/or DCP1b were not determined as the analysis was restricted to RNA-seq to assess RNA abundance, which can be a result of direct or indirect regulation by DCP1a/b.

      Thank you for raising this important point. In our study, we acknowledge that the use of RNA-seq to assess RNA abundance provides a broad overview of the regulatory impacts of DCP1a and DCP1b. This method captures changes in RNA levels that may arise from both direct and indirect regulatory actions of these proteins. While we did not directly determine the targets of DCP1a and DCP1b, the data obtained from our RNA-seq analysis serve as a foundational step for future targeted experiments, which could include techniques such as RIP-seq, to delineate the direct targets of DCP1a and DCP1b more precisely. We believe that our current findings contribute valuable information to the field and pave the way for these subsequent analyses.

      P-bodies appear to be larger in human cells lacking DCP1a and DCP1b but a lack of image quantification prevents this conclusion from being drawn.

      Thank you for the reviewer’s valuable feedback. We have addressed the reviewer’s concern regarding P-bodies' size in human cells lacking DCP1a and DCP1b. We have now performed image quantification and can confirm that P-bodies are indeed larger in these cells.

      The lack of details in the methodology and figure legends limit reader understanding.

      We acknowledge the reviewer's concerns regarding the level of detail provided in the methodology and figure legends. To address this, we are committed to enhancing both sections with additional details and clarifications in our revised manuscript. Thank you for bringing this to our attention.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) To me, the second half of the paper comparing DCP1a and DCP1b is in many ways distinct from the first half and could stand on its own as an interesting paper if this comparative analysis is explored a little deeper (maybe by validating some of the differences in decay observed for individual mRNAs targeted by DCP1a versus DCP1b, by measuring and comparing the decay rates of some individual transcripts under differential control by DCP1a vs b?), and revising the conclusions about links to cancer as mentioned above. I think these later comparative results in the paper present the most new and interesting data concerning DCP1 function in humans (especially since I think the mechanistic conclusions from the first half aren't well supported yet or are at least inconsistent), but when I read these later sections of the paper I struggle to understand the key takeaways from the transcriptomic and metabolomic data.

      Thank you for the reviewer's suggestions. Estimating the decay rates of individual transcripts within the transcriptomes of DCP1a_KO, DCP1b_KO, and wild type can provide insight into the direct targets of DCP1a or DCP1b. However, this requires either time-series RNA-seq or specialized sequencing technologies such as Precision Run-On sequencing (PRO-seq) or RNA Approach to Equilibrium Sequencing (RATE-Seq). Unfortunately, we lack the necessary dataset in our project to estimate the decay rates for the potential targets identified in our RNA-seq data. Despite this limitation, we acknowledge the potential of this approach in identifying the true targets of DCP1a and DCP1b and have included this idea in our discussion.

      (2) I think it would be helpful to add a little more descriptive or narrative language to the figure legends (I know some of them are already quite long!) so that readers can follow the general idea of the experiment through the figure legend as well as the main text; as written, the figure legends are mostly exclusively technical details, so it can be hard to parse what experiment is being carried out in some cases.

      Thank you for the reviewer’s suggestion, we will strive to improve the language of the figure legends to include technical details while clearly conveying the main idea of the experiment. We will ensure that the language of the figure legends is more readable and comprehensible so that readers can more easily parse what experiment is being carried out.

      Reviewer #2 (Recommendations For The Authors):

      Suggestions for improved or additional experiments, data, or analyses:

      The use of RNA-seq to measure RNA abundance in DCP1a and/or b knockout cells can give some insight into both the indirect and direct effects of DCP1a/b on gene expression but cannot identify the direct targets of these genes. Rather, global analysis of RNA stability or capturing uncapped RNA decay intermediates would allow the authors to conclude they have identified direct targets of DCP1a and/or b. Without such analyses, the interpretation of these data should be scaled back to clearly state that RNA levels can be altered through indirect effects of DCP1a/b absence throughout the text.

      We appreciate the reviewer's suggestion. We have modified our sentences to emphasize that the dysregulated genes could be caused by both direct and indirect effects.

      A control/randomly generated gene list should be analyzed for GO terms to determine whether the enrichment of cancer-related pathways in the differentially expressed genes in the DCP1a/b knockout cells is meaningful.

      Thank you for the reviewer's comment. We shuffled our gene list and reperformed the pathway enrichment analysis in Figure 4C and 4D 1,000 times. We focused on the following cancer-related pathways: E2F targets, MTORC1 signaling, G2M checkpoint, MYC target V1, EMT transition, KRAS signaling DN, P53 pathway, and NOTCH signaling pathways. We then calculated how many times the q-values obtained from the shuffled gene list were more significant than the q-value obtained from our real data. In four of the eight pathways (E2F targets, MTORC1 signaling, G2M checkpoint, and MYC target v1), none of the shuffled gene lists resulted in a q-value smaller than the real one. In the other four pathways (EMT transition, KRAS signaling DN, P53 pathway, and NOTCH signaling pathways), the q-values were smaller than the real q-value 2, 11, 4, and 4 times out of the 1000 shuffles. Based on the shuffled results, we conclude that the transcriptome of DCP1a/b knockout cells is statistically enriched in these cancer-related pathways.

      Author response image 1.

      Distribution of q-values resulting from the Gene Set Enrichment Analysis (GSEA) conducted on 1,000 shuffled gene lists for eight cancer-related pathways. The q-values derived from Figure 4C and 4D are indicated by red (DCP1a_KO) and blue (DCP1b_KO) dashed lines, respectively. Some q-values derived from Figure 4C are too small to be labeled on the plots, such as in E2F targets (q value: 5.87E-07), MTORC1 signaling (q values: 6.59E-07 and 1.58E-06 for DCP1a_KO and DCP1b_KO, respectively), MYC target V1 (q value: 0.004644174 for DCP1a_KO), etc. The numbers x/1000 indicate how often the shuffled q-values were smaller than the real q-value out of 1,000 permutations.

      Comparisons of the DCP1a and/or b knockout RNA-seq results should be done to published datasets such as those published by Luo et al., Cell Chemical Biology (2021) to determine whether there are common targets with DCP2 and validate the reported findings.

      Thank you for reviewer’s suggestion. We compared the upregulated genes from DCP1a_KO, DCP1b_KO, and DCP1a/b_KO cell lines with the 91 targets of DPC2 identified by Luo et al. in Cell Chemical Biology (2021). Only EPPK1 was found to be overlapped between the potential DCP1b_KO targets and the targets of DCP2. No genes were found to be overlapped between the potential DCP1a_KO targets and the targets of DCP2. However, three genes, TES, PAX6, and C18orf21, were found to be overlapped between the significantly upregulated DEGs of DCP1a/b_KO and the targets of DCP2. We have included this information in the discussion section.

      The RNA tethering assays are not clear and are difficult to interpret without further controls to delineate the polyadenylated and deadenylated species.

      Thank you for the reviewer’s feedback. We acknowledge that the reviewer might harbor some doubts regarding the outcomes of the RNA tethering assays. Nonetheless, this methodology is well-established and has also found extensive application across many studies. We are committed to enhancing the clarity of our experiment’s details and results within the figure legends and textual descriptions.

      The representative images of p-bodies clearly show that DCP1a/b KO cells have larger p-bodies than the wild-type cells. The authors should quantify p-body size in each image set as the current interpretation of the data is that there is no difference in size or number of p-bodies, but the data suggest otherwise.

      Thank you very much for the reviewer’s insightful comments and for drawing our attention to the need to quantify p-body sizes in DCP1a/b KO and wild-type cells. We agree with the reviewer’s assessment that the representative images suggest a difference in p-body size between DCP1a/b KO cells and wild-type cells, which we initially overlooked. We will revise our manuscript accordingly to include these findings, ensuring that our interpretation of the data aligns with the observed differences.

      Statistical analysis of the Figure 2C results should be included because the difference between the wild-type and Dco1a/b KO cells with GFP-DCP2 looks significantly different but is interpreted in the text as not significant.

      Thank you for pointing out the need for a statistical analysis of the results shown in Figure 2C. We acknowledge that the visual difference between the wild-type and Dco1a/b KO cells with GFP-DCP2 suggests a significant variation, which may not have been clearly communicated in our text. We will conduct the necessary statistical analysis to substantiate the observations made in Figure 2C. Furthermore, we would like to emphasize that our primary focus was to demonstrate that purified DCP2 within cells retains its activity even in the absence of DCP1. This critical point will be highlighted and clarified in the revised version of our manuscript to prevent any misunderstanding.

      Recommendations for improving the writing and presentation:

      Additional context including what is known about the role of dcp1 in decapping from the decades of work in yeast and other model organisms should be incorporated into the introduction and discussion sections.

      Thank you for the reviewer’s suggestion. We will incorporate additional context about the function and significance of DCP1 in decapping processes within our revised manuscript's introduction and discussion sections.

      Details should be provided within the figure legends and methods section on experimental approaches and the number of replicates and statistical analyses used throughout the manuscript. For example, it is not clear whether western blots or RNA-IP experiments were performed more than once as representative images are shown.

      Thank you for the reviewer’s suggestion. In the figure legends and methods section, we will provide more details about the experimental methods, number of replicates, and statistical analyses. Regarding the Western blots and RNA-IP experiments the reviewer mentioned, we performed multiple experiments and presented representative images in the manuscript. We will clarify this in the revised manuscript to eliminate potential confusion.

      The rationale for performing metabolic profiling is not clear.

      We appreciate the reviewer's thoughtful feedback. The rationale behind conducting metabolic profiling in our study is rooted in its efficacy as a valuable tool for deciphering the consequences of specific gene mutations, particularly those closely associated with phenotypic changes or final metabolic pathways. Our objective is to utilize metabolic profiling to unravel the distinct biofunctions of DCP1a and DCP1b. By employing this approach, we aim to gain insights into the intricate metabolic alterations that result from the absence of these genes, thereby enhancing our understanding of their roles in cellular processes. We recognize the necessity of clearly presenting this rationale and promise to bolster the articulation of these points in the revised version of our manuscript to ensure the clarity and transparency of our research motivation.

      Details in the methods section should be included for the CRISPR/Cas9-mediated gene editing validation. The Sangar sequencing results presented in Figure S1b should be explained. The entire western blot(s) should be shown in Figure S1A to give confidence the Dcp1a/b KO cells are not expressing truncated proteins and the epitopes of the antibodies used to detect Dcp1a/b should be described. The northern blot probes should be described and sequences included. The transcriptomics method should be detailed.

      Thank you for your feedback, in the revised manuscript we will detail the CRISPR/Cas9 gene editing validation, explain the Sanger sequencing results in Figure S1b, show the full Western blot in Figure S1A to confirm that the Dcp1a/b knockout cells are not expressing truncated proteins, describe the Northern blot probes used, and detail the transcriptomics method, all to ensure clarity and comprehensiveness in our experimental procedures and results.

      A diagram showing the RNA tethering assays with labels corresponding to all blots/gels should be provided.

      Thank you for your suggestion. We will provide a diagram showing the RNA tethering assays with labels corresponding to all blots/gels in our revised manuscript. This will help readers better understand our experimental design and results.

      The statement, "This suggests that the disruption of the decapping process in DCP1a/b-knockout cells results in the accumulation of unprocessed mRNA intermediates" regarding the results of the RNA-seq assay is not supported by the evidence as RNA-seq does not measure RNA decay intermediates or RNA decay rates.

      Thank you for the reviewer’s comment. We agree with that RNA-seq experiments indeed do not directly measure RNA decay intermediates or RNA decay rates. Our statement could have caused confusion, and we have therefore removed this sentence from the manuscript.

      Minor corrections to the text and figures:

      Figure S6A is uninterpretable as presented.

      Thank you for the reviewer’s valuable feedback. We have taken note and made improvements. We have simplified Figure S6A to enhance its interpretability, hoping that the current version will make it easier for the readers to understand.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public Review):

      Original comment: There is no explanation for how this work could be a breakthrough in simulation gregarious feeding as is stated in the manuscript.

      Reviewer response: I think I understand where the authors are trying to take this next step. If the authors were to follow up on this study with the proposed implementation of inhalant/exhalent velocities profiles (or more preferably velocity/pressure fields), then that study would be a breakthrough in simulating such gregarious feeding. Based on what has been done within the present study, I think the term "breakthrough" is instead overly emphatic. An additional note on this. The authors are correct that incorporating additional models could be used to simulation a population (as has been successfully done for several Ediacaran taxa despite computational limitations), but it's not the only way. The authors 1 might explore using periodic boundary conditions on the external faces of the flow domain. This could require only a single Olivooid model to assess gregarious impacts - see the abundant literature of modeling flow through solar array fields.

      We appreciate the reviewer 1 for the suggestion. Modeling gregarious feeding via periodic boundary conditions is surely a practical way with limited computational resources. Modeling flow through solar array fields can also be an inspiring case. However, to realism the simulation of gregarious feeding behavior on an uneven seabed and with irregular organism spatial distribution, just using periodic boundary conditions may not be sufficient (see Author response image 1 for a simple example). We will go on exploring the way of realizing the simulations of large-scale gregarious feeding.

      Author response image 1.

      An example of modeling gregarious feeding behavior on an uneven seabed.

      Original comment: The claim that olivooid-type feeding was most likely a prerequisite transitional form to jet-propelled swimming needs much more support or needs to be tailored to olivooids. This suggests that such behavior is absent (or must be convergent) before olivooids, which is at odds with the increasing quantities of pelagic life (whose modes of swimming are admittedly unconstrained) documented from Cambrian and Neoproterozoic deposits. Even among just medusozoans, ancestral 1 state reconstruction suggests that they would have been swimming during the Neoproterozoic (Kayal et al., 2018; BMC Evolutionary Biology) with no knowledge of the mechanics due to absent preservation. Author response: Thanks for your suggestions. Yes, we agree with you that the ancestral swimming medusae may appear before the early Cambrian, even at the Neoproterozoic deposits. However, discussions on the affinities of Ediacaran cnidarians are severely limited because of the lack of information concerning their soft anatomy. So, it is hard to detect the mechanics due to absent preservation. Olivooids found from the basal Cambrian Kuanchuanpu Formation can be reasonably considered as cnidarians based on their radial symmetry, external features, and especially the internal anatomies (Bengtson and Yue 1997; Dong et al. 2013; 2016; Han et al. 2013; 2016; Liu et al. 2014; Wang et al. 2017; 2020; 2022). The valid simulation experiment here was based on the soft tissue preserved in olivooids.

      Reviewer response: This response does not sufficiently address my earlier comment. While the authors are correct that individual Ediacaran affinities are an area of active research and that Olivooids can reasonably be considered cnidarians, this doesn't address the actual critique in my comment. Most (not all) Ediacaran soft-bodied fossils are considered to have been benthic, but pelagic cnidarian life is widely acknowledged to at least be present during later White Sea and Nama assemblages (and earlier depending on molecular clock interpretations). The authors have certainly provided support for the mechanics of this type of feeding being co-opted for eventual jet propulsion swimming in Olivooids. They have not provided sufficient justifications within the manuscript for this to be broadened beyond this group.

      Thanks for your sincere commentary. We of course agree with the possibility of the emergence of swimming cnidarians before the lowermost Cambrian Fortunian Stage. See lines 16-129: “Ediacaran fossil assemblages with complex ecosystems consist of exceptionally preserved soft-bodied eukaryotes of enigmatic morphology, which their affinities are mostly unresolved (Tarhan et al., 2018, Integrative and Comparative Biology, 58 (4), 688–702; Evans et al., 2022, PNAS, 11(46), e220747511).” Undoubtedly Olivooids belong to cnidarians charactered by their external and internal biological structures. Limited by the fossil records, we could only speculate on the transition from the benthic to the swimming of ancestral cnidarians via the valid fossil preservation, e.g. olivooids. The transition may require processes such as increasing body size, thickening the mesoglea, and degenerating the periderm, etc. And these processes may also evolve independently or comprehensively. Moreover, the ecological behaviors of the ancestral cnidarians may evolve independently at different stages from Ediacaran to Cambrian. We therefore could not provide more sufficient justifications beyond olivooids.

      Original comment: L446: two layers of hexahedral elements is a very low number for meshing boundary layer flow

      Reviewer response: As the authors point out in the main text, these organisms are small (millimeters in scale) and certainly lived within the boundary layer range of the ocean. While the boundary layer is not the main point, it still needs to be accurately resolved as it should certainly affect the flow further towards the far field at this scale. I'm not suggesting the authors need to perfectly resolve the boundary layer or focus on using turbulence models more tailored to boundary layer flows (such as k-w), but the flow field still needs sufficient realism for a boundary bounded flow. The authors really should consider quantitatively assessing the number of hexahedral elements within their mesh refinement study.

      To address this concern, we run another four simulations based on mesh4 within our mesh refinement study to assess the number of hexahedral elements (five layers and eight layers of hexahedral elements with different thickness of boundary layer mesh (controlled by thickness adjustment factor), respectively). the results had been supplemented to Table supplement 2. As shown in the results, the number of layers of hexahedral elements seems does not significant influence the result, but the thickness of boundary layer mesh can influence the maximum flow velocity of the contraction phase. However, the results of all the simulations were generally consistent, as shown in Author response image 2. The description of the results above were added to section “Mesh sensitivity analysis”.

      Author response image 2.

      Results of mesh refinement study of different boundary layer mesh parameters.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Overview of reviewer's concerns after peer review: 

      As for the initial submission, the reviewers' unanimous opinion is that the authors should perform additional controls to show that their key findings may not be affected by experimental or analysis artefacts, and clarify key aspects of their core methods, chiefly:  

      (1) The fact that their extremely high decoding accuracy is driven by frequency bands that would reflect the key press movements and that these are located bilaterally in frontal brain regions (with the task being unilateral) are seen as key concerns, 

      The above statement that decoding was driven by bilateral frontal brain regions is not entirely consistent with our results. The confusion was likely caused by the way we originally presented our data in Figure 2. We have revised that figure to make it more clear that decoding performance at both the parcel- (Figure 2B) and voxel-space (Figure 2C) level is predominantly driven by contralateral (as opposed to ipsilateral) sensorimotor regions. Figure 2D, which highlights bilateral sensorimotor and premotor regions, displays accuracy of individual regional voxel-space decoders assessed independently. This was the criteria used to determine which regional voxel-spaces were included in the hybridspace decoder. This result is not surprising given that motor and premotor regions are known to display adaptive interhemispheric interactions during motor sequence learning [1, 2], and particularly so when the skill is performed with the non-dominant hand [3-5]. We now discuss this important detail in the revised manuscript:

      Discussion (lines 348-353)

      “The whole-brain parcel-space decoder likely emphasized more stable activity patterns in contralateral frontoparietal regions that differed between individual finger movements [21,35], while the regional voxel-space decoder likely incorporated information related to adaptive interhemispheric interactions operating during motor sequence learning [32,36,37], particularly pertinent when the skill is performed with the non-dominant hand [38-40].”

      We now also include new control analyses that directly address the potential contribution of movement-related artefact to the results.  These changes are reported in the revised manuscript as follows:

      Results (lines 207-211):

      “An alternate decoder trained on ICA components labeled as movement or physiological artefacts (e.g. – head movement, ECG, eye movements and blinks; Figure 3 – figure supplement 3A, D) and removed from the original input feature set during the pre-processing stage approached chance-level performance (Figure 4 – figure supplement 3), indicating that the 4-class hybrid decoder results were not driven by task-related artefacts.”

      Results (lines 261-268):

      “As expected, the 5-class hybrid-space decoder performance approached chance levels when tested with randomly shuffled keypress labels (18.41%± SD 7.4% for Day 1 data; Figure 4 – figure supplement 3C). Task-related eye movements did not explain these results since an alternate 5-class hybrid decoder constructed from three eye movement features (gaze position at the KeyDown event, gaze position 200ms later, and peak eye movement velocity within this window; Figure 4 – figure supplement 3A) performed at chance levels (cross-validated test accuracy = 0.2181; Figure 4 – figure supplement 3B, C). “

      Discussion (Lines 362-368):

      “Task-related movements—which also express in lower frequency ranges—did not explain these results given the near chance-level performance of alternative decoders trained on (a) artefact-related ICA components removed during MEG preprocessing (Figure 3 – figure supplement 3A-C) and on (b) task-related eye movement features (Figure 4 – figure supplement 3B, C). This explanation is also inconsistent with the minimal average head motion of 1.159 mm (± 1.077 SD) across the MEG recording (Figure 3 – figure supplement 3D).“

      (2) Relatedly, the use of a wide time window (~200 ms) for a 250-330 ms typing speed makes it hard to pinpoint the changes underpinning learning, 

      The revised manuscript now includes analyses carried out with decoding time windows ranging from 50 to 250ms in duration. These additional results are now reported in:

      Results (lines 258-261):

      “The improved decoding accuracy is supported by greater differentiation in neural representations of the index finger keypresses performed at positions 1 and 5 of the sequence (Figure 4A), and by the trial-by-trial increase in 2-class decoding accuracy over early learning (Figure 4C) across different decoder window durations (Figure 4 – figure supplement 2).”

      Results (lines 310-312):

      “Offline contextualization strongly correlated with cumulative micro-offline gains (r = 0.903, R<sup>2</sup> = 0.816, p < 0.001; Figure 5 – figure supplement 1A, inset) across decoder window durations ranging from 50 to 250ms (Figure 5 – figure supplement 1B, C).“

      Discussion (lines 382-385):

      “This was further supported by the progressive differentiation of neural representations of the index finger keypress (Figure 4A) and by the robust trial-bytrial increase in 2-class decoding accuracy across time windows ranging between 50 and 250ms (Figure 4C; Figure 4 – figure supplement 2).”

      Discussion (lines 408-9):

      “Offline contextualization consistently correlated with early learning gains across a range of decoding windows (50–250ms; Figure 5 – figure supplement 1).”

      (3) These concerns make it hard to conclude from their data that learning is mediated by "contextualisation" ---a key claim in the manuscript; 

      We believe the revised manuscript now addresses all concerns raised in Editor points 1 and 2.

      (4) The hybrid voxel + parcel space decoder ---a key contribution of the paper--- is not clearly explained; 

      We now provide additional details regarding the hybrid-space decoder approach in the following sections of the revised manuscript:

      Results (lines 158-172):

      “Next, given that the brain simultaneously processes information more efficiently across multiple spatial and temporal scales [28, 32, 33], we asked if the combination of lower resolution whole-brain and higher resolution regional brain activity patterns further improve keypress prediction accuracy. We constructed hybrid-space decoders (N = 1295 ± 20 features; Figure 3A) combining whole-brain parcel-space activity (n = 148 features; Figure 2B) with regional voxel-space activity from a datadriven subset of brain areas (n = 1147 ± 20 features; Figure 2D). This subset covers brain regions showing the highest regional voxel-space decoding performances (top regions across all subjects shown in Figure 2D; Methods – Hybrid Spatial Approach). 

      […]

      Note that while features from contralateral brain regions were more important for whole-brain decoding (in both parcel- and voxel-spaces), regional voxel-space decoders performed best for bilateral sensorimotor areas on average across the group. Thus, a multi-scale hybrid-space representation best characterizes the keypress action manifolds.”

      Results (lines 275-282):

      “We used a Euclidian distance measure to evaluate the differentiation of the neural representation manifold of the same action (i.e. - an index-finger keypress) executed within different local sequence contexts (i.e. - ordinal position 1 vs. ordinal position 5; Figure 5). To make these distance measures comparable across participants, a new set of classifiers was then trained with group-optimal parameters (i.e. – broadband hybrid-space MEG data with subsequent manifold extraction (Figure 3 – figure supplements 2) and LDA classifiers (Figure 3 – figure supplements 7) trained on 200ms duration windows aligned to the KeyDown event (see Methods, Figure 3 – figure supplements 5). “

      Discussion (lines 341-360):

      “The initial phase of the study focused on optimizing the accuracy of decoding individual finger keypresses from MEG brain activity. Recent work showed that the brain simultaneously processes information more efficiently across multiple—rather than a single—spatial scale(s) [28, 32]. To this effect, we developed a novel hybridspace approach designed to integrate neural representation dynamics over two different spatial scales: (1) whole-brain parcel-space (i.e. – spatial activity patterns across all cortical brain regions) and (2) regional voxel-space (i.e. – spatial activity patterns within select brain regions) activity. We found consistent spatial differences between whole-brain parcel-space feature importance (predominantly contralateral frontoparietal, Figure 2B) and regional voxel-space decoder accuracy (bilateral sensorimotor regions, Figure 2D). The whole-brain parcel-space decoder likely emphasized more stable activity patterns in contralateral frontoparietal regions that differed between individual finger movements [21, 35], while the regional voxelspace decoder likely incorporated information related to adaptive interhemispheric interactions operating during motor sequence learning [32, 36, 37], particularly pertinent when the skill is performed with the non-dominant hand [38-40]. The observation of increased cross-validated test accuracy (as shown in Figure 3 – Figure Supplement 6) indicates that the spatially overlapping information in parcel- and voxel-space time-series in the hybrid decoder was complementary, rather than redundant [41].  The hybrid-space decoder which achieved an accuracy exceeding 90%—and robustly generalized to Day 2 across trained and untrained sequences— surpassed the performance of both parcel-space and voxel-space decoders and compared favorably to other neuroimaging-based finger movement decoding strategies [6, 24, 42-44].”

      Methods (lines 636-647):

      “Hybrid Spatial Approach.  First, we evaluated the decoding performance of each individual brain region in accurately labeling finger keypresses from regional voxelspace (i.e. - all voxels within a brain region as defined by the Desikan-Killiany Atlas) activity. Brain regions were then ranked from 1 to 148 based on their decoding accuracy at the group level. In a stepwise manner, we then constructed a “hybridspace” decoder by incrementally concatenating regional voxel-space activity of brain regions—starting with the top-ranked region—with whole-brain parcel-level features and assessed decoding accuracy. Subsequently, we added the regional voxel-space features of the second-ranked brain region and continued this process until decoding accuracy reached saturation. The optimal “hybrid-space” input feature set over the group included the 148 parcel-space features and regional voxelspace features from a total of 8 brain regions (bilateral superior frontal, middle frontal, pre-central and post-central; N = 1295 ± 20 features).”

      (5) More controls are needed to show that their decoder approach is capturing a neural representation dedicated to context rather than independent representations of consecutive keypresses; 

      These controls have been implemented and are now reported in the manuscript:

      Results (lines 318-328):

      “Within-subject correlations were consistent with these group-level findings. The average correlation between offline contextualization and micro-offline gains within individuals was significantly greater than zero (Figure 5 – figure supplement 4, left; t = 3.87, p = 0.00035, df = 25, Cohen's d = 0.76) and stronger than correlations between online contextualization and either micro-online (Figure 5 – figure supplement 4, middle; t = 3.28, p = 0.0015, df = 25, Cohen's d = 1.2) or micro-offline gains (Figure 5 – figure supplement 4, right; t = 3.7021, p = 5.3013e-04, df = 25, Cohen's d = 0.69). These findings were not explained by behavioral changes of typing rhythm (t = -0.03, p = 0.976; Figure 5 – figure supplement 5), adjacent keypress transition times (R2 = 0.00507, F[1,3202] = 16.3; Figure 5 – figure supplement 6), or overall typing speed (between-subject; R2 = 0.028, p \= 0.41; Figure 5 – figure supplement 7).”

      Results (lines 385-390):

      “Further, the 5-class classifier—which directly incorporated information about the sequence location context of each keypress into the decoding pipeline—improved decoding accuracy relative to the 4-class classifier (Figure 4C). Importantly, testing on Day 2 revealed specificity of this representational differentiation for the trained skill but not for the same keypresses performed during various unpracticed control sequences (Figure 5C).”

      Discussion (lines 408-423):

      “Offline contextualization consistently correlated with early learning gains across a range of decoding windows (50–250ms; Figure 5 – figure supplement 1). This result remained unchanged when measuring offline contextualization between the last and second sequence of consecutive trials, inconsistent with a possible confounding effect of pre-planning [30] (Figure 5 – figure supplement 2A). On the other hand, online contextualization did not predict learning (Figure 5 – figure supplement 3). Consistent with these results the average within-subject correlation between offline contextualization and micro-offline gains was significantly stronger than withinsubject correlations between online contextualization and either micro-online or micro-offline gains (Figure 5 – figure supplement 4). 

      Offline contextualization was not driven by trial-by-trial behavioral differences, including typing rhythm (Figure 5 – figure supplement 5) and adjacent keypress transition times (Figure 5 – figure supplement 6) nor by between-subject differences in overall typing speed (Figure 5 – figure supplement 7)—ruling out a reliance on differences in the temporal overlap of keypresses. Importantly, offline contextualization documented on Day 1 stabilized once a performance plateau was reached (trials 11-36), and was retained on Day 2, documenting overnight consolidation of the differentiated neural representations.”

      (6) The need to show more convincingly that their data is not affected by head movements, e.g., by regressing out signal components that are correlated with the fiducial signal;  

      We now include data in Figure 3 – figure supplement 3D showing that head movement was minimal in all participants (mean of 1.159 mm ± 1.077 SD).  Further, the requested additional control analyses have been carried out and are reported in the revised manuscript:

      Results (lines 204-211):

      “Testing the keypress state (4-class) hybrid decoder performance on Day 1 after randomly shupling keypress labels for held-out test data resulted in a performance drop approaching expected chance levels (22.12%± SD 9.1%; Figure 3 – figure supplement 3C). An alternate decoder trained on ICA components labeled as movement or physiological artefacts (e.g. – head movement, ECG, eye movements and blinks; Figure 3 – figure supplement 3A, D) and removed from the original input feature set during the pre-processing stage approached chance-level performance (Figure 4 – figure supplement 3), indicating that the 4-class hybrid decoder results were not driven by task-related artefacts.” Results (lines 261-268):

      “As expected, the 5-class hybrid-space decoder performance approached chance levels when tested with randomly shuffled keypress labels (18.41%± SD 7.4% for Day 1 data; Figure 4 – figure supplement 3C). Task-related eye movements did not explain these results since an alternate 5-class hybrid decoder constructed from three eye movement features (gaze position at the KeyDown event, gaze position 200ms later, and peak eye movement velocity within this window; Figure 4 – figure supplement 3A) performed at chance levels (cross-validated test accuracy = 0.2181; Figure 4 – figure supplement 3B, C). “

      Discussion (Lines 362-368):

      “Task-related movements—which also express in lower frequency ranges—did not explain these results given the near chance-level performance of alternative decoders trained on (a) artefact-related ICA components removed during MEG preprocessing (Figure 3 – figure supplement 3A-C) and on (b) task-related eye movement features (Figure 4 – figure supplement 3B, C). This explanation is also inconsistent with the minimal average head motion of 1.159 mm (± 1.077 SD) across the MEG recording (Figure 3 – figure supplement 3D). “

      (7) The offline neural representation analysis as executed is a bit odd, since it seems to be based on comparing the last key press to the first key press of the next sequence, rather than focus on the inter-sequence interval

      While we previously evaluated replay of skill sequences during rest intervals, identification of how offline reactivation patterns of a single keypress state representation evolve with learning presents non-trivial challenges. First, replay events tend to occur in clusters with irregular temporal spacing as previously shown by our group and others.  Second, replay of experienced sequences is intermixed with replay of sequences that have never been experienced but are possible. Finally, and perhaps the most significant issue, replay is temporally compressed up to 20x with respect to the behavior [6]. That means our decoders would need to accurately evaluate spatial pattern changes related to individual keypresses over much smaller time windows (i.e. - less than 10 ms) than evaluated here. This future work, which is undoubtably of great interest to our research group, will require more substantial tool development before we can apply them to this question. We now articulate this future direction in the Discussion:

      Discussion (lines 423-427):

      “A possible neural mechanism supporting contextualization could be the emergence and stabilization of conjunctive “what–where” representations of procedural memories [64] with the corresponding modulation of neuronal population dynamics [65, 66] during early learning. Exploring the link between contextualization and neural replay could provide additional insights into this issue [6, 12, 13, 15].”

      (8) And this analysis could be confounded by the fact that they are comparing the last element in a sequence vs the first movement in a new one. 

      We have now addressed this control analysis in the revised manuscript:

      Results (Lines 310-316)

      “Offline contextualization strongly correlated with cumulative micro-offline gains (r = 0.903, R<sup>2</sup> = 0.816, p < 0.001; Figure 5 – figure supplement 1A, inset) across decoder window durations ranging from 50 to 250ms (Figure 5 – figure supplement 1B, C). The offline contextualization between the final sequence of each trial and the second sequence of the subsequent trial (excluding the first sequence) yielded comparable results. This indicates that pre-planning at the start of each practice trial did not directly influence the offline contextualization measure [30] (Figure 5 – figure supplement 2A, 1st vs. 2nd Sequence approaches).”

      Discussion (lines 408-416):

      “Offline contextualization consistently correlated with early learning gains across a range of decoding windows (50–250ms; Figure 5 – figure supplement 1). This result remained unchanged when measuring offline contextualization between the last and second sequence of consecutive trials, inconsistent with a possible confounding effect of pre-planning [30] (Figure 5 – figure supplement 2A). On the other hand, online contextualization did not predict learning (Figure 5 – figure supplement 3). Consistent with these results the average within-subject correlation between offline contextualization and micro-offline gains was significantly stronger than within-subject correlations between online contextualization and either micro-online or micro-offline gains (Figure 5 – figure supplement 4).”

      It also seems to be the case that many analyses suggested by the reviewers in the first round of revisions that could have helped strengthen the manuscript have not been included (they are only in the rebuttal). Moreover, some of the control analyses mentioned in the rebuttal seem not to be described anywhere, neither in the manuscript, nor in the rebuttal itself; please double check that. 

      All suggested analyses carried out and mentioned are now in the revised manuscript.

      eLife Assessment 

      This valuable study investigates how the neural representation of individual finger movements changes during the early period of sequence learning. By combining a new method for extracting features from human magnetoencephalography data and decoding analyses, the authors provide incomplete evidence of an early, swift change in the brain regions correlated with sequence learning…

      We have now included all the requested control analyses supporting “an early, swift change in the brain regions correlated with sequence learning”:

      The addition of more control analyses to rule out that head movement artefacts influence the findings, 

      We now include data in Figure 3 – figure supplement 3D showing that head movement was minimal in all participants (mean of 1.159 mm ± 1.077 SD).  Further, we have implemented the requested additional control analyses addressing this issue:

      Results (lines 207-211):

      “An alternate decoder trained on ICA components labeled as movement or physiological artefacts (e.g. – head movement, ECG, eye movements and blinks; Figure 3 – figure supplement 3A, D) and removed from the original input feature set during the pre-processing stage approached chance-level performance (Figure 4 – figure supplement 3), indicating that the 4-class hybrid decoder results were not driven by task-related artefacts.”

      Results (lines 261-268):

      “As expected, the 5-class hybrid-space decoder performance approached chance levels when tested with randomly shuffled keypress labels (18.41%± SD 7.4% for Day 1 data; Figure 4 – figure supplement 3C). Task-related eye movements did not explain these results since an alternate 5-class hybrid decoder constructed from three eye movement features (gaze position at the KeyDown event, gaze position 200ms later, and peak eye movement velocity within this window; Figure 4 – figure supplement 3A) performed at chance levels (cross-validated test accuracy = 0.2181; Figure 4 – figure supplement 3B, C). “

      Discussion (Lines 362-368):

      “Task-related movements—which also express in lower frequency ranges—did not explain these results given the near chance-level performance of alternative decoders trained on (a) artefact-related ICA components removed during MEG preprocessing (Figure 3 – figure supplement 3A-C) and on (b) task-related eye movement features (Figure 4 – figure supplement 3B, C). This explanation is also inconsistent with the minimal average head motion of 1.159 mm (± 1.077 SD) across the MEG recording (Figure 3 – figure supplement 3D).“

      and to further explain the proposal of offline contextualization during short rest periods as the basis for improvement performance would strengthen the manuscript. 

      We have edited the manuscript to clarify that the degree of representational differentiation (contextualization) parallels skill learning.  We have no evidence at this point to indicate that “offline contextualization during short rest periods is the basis for improvement in performance”.  The following areas of the revised manuscript now clarify this point:  

      Summary (Lines 455-458):

      “In summary, individual sequence action representations contextualize during early learning of a new skill and the degree of differentiation parallels skill gains. Differentiation of the neural representations developed during rest intervals of early learning to a larger extent than during practice in parallel with rapid consolidation of skill.”

      Additional control analyses are also provided supporting a link between offline contextualization and early learning:

      Results (lines 302-318):

      “The Euclidian distance between neural representations of Index<sub>OP1</sub> (i.e. - index finger keypress at ordinal position 1 of the sequence) and Index<sub>OP5</sub> (i.e. - index finger keypress at ordinal position 5 of the sequence) increased progressively during early learning (Figure 5A)—predominantly during rest intervals (offline contextualization) rather than during practice (online) (t = 4.84, p < 0.001, df = 25, Cohen's d = 1.2; Figure 5B; Figure 5 – figure supplement 1A). An alternative online contextualization determination equaling the time interval between online and offline comparisons (Trial-based; 10 seconds between Index<sub>OP1</sub> and Index<sub>OP5</sub> observations in both cases) rendered a similar result (Figure 5 – figure supplement 2B).

      Offline contextualization strongly correlated with cumulative micro-offline gains (r = 0.903, R<sup>2</sup> = 0.816, p < 0.001; Figure 5 – figure supplement 1A, inset) across decoder window durations ranging from 50 to 250ms (Figure 5 – figure supplement 1B, C). The offline contextualization between the final sequence of each trial and the second sequence of the subsequent trial (excluding the first sequence) yielded comparable results. This indicates that pre-planning at the start of each practice trial did not directly influence the offline contextualization measure [30] (Figure 5 – figure supplement 2A, 1st vs. 2nd Sequence approaches). Conversely, online contextualization (using either measurement approach) did not explain early online learning gains (i.e. – Figure 5 – figure supplement 3).”  

      Public Reviews: 

      Reviewer #1 (Public review): 

      Summary: 

      This study addresses the issue of rapid skill learning and whether individual sequence elements (here: finger presses) are differentially represented in human MEG data. The authors use a decoding approach to classify individual finger elements and accomplish an accuracy of around 94%. A relevant finding is that the neural representations of individual finger elements dynamically change over the course of learning. This would be highly relevant for any attempts to develop better brain machine interfaces - one now can decode individual elements within a sequence with high precision, but these representations are not static but develop over the course of learning. 

      Strengths: 

      The work follows a large body of work from the same group on the behavioural and neural foundations of sequence learning. The behavioural task is well established a neatly designed to allow for tracking learning and how individual sequence elements contribute. The inclusion of short offline rest periods between learning epochs has been influential because it has revealed that a lot, if not most of the gains in behaviour (ie speed of finger movements) occur in these so-called micro-offline rest periods. 

      The authors use a range of new decoding techniques, and exhaustively interrogate their data in different ways, using different decoding approaches. Regardless of the approach, impressively high decoding accuracies are observed, but when using a hybrid approach that combines the MEG data in different ways, the authors observe decoding accuracies of individual sequence elements from the MEG data of up to 94%. 

      Weaknesses:  

      A formal analysis and quantification of how head movement may have contributed to the results should be included in the paper or supplemental material. The type of correlated head movements coming from vigorous key presses aren't necessarily visible to the naked eye, and even if arms etc are restricted, this will not preclude shoulder, neck or head movement necessarily; if ICA was conducted, for example, the authors are in the position to show the components that relate to such movement; but eye-balling the data would not seem sufficient. The related issue of eye movements is addressed via classifier analysis. A formal analysis which directly accounts for finger/eye movements in the same analysis as the main result (ie any variance related to these factors) should be presented.

      We now present additional data related to head (Figure 3 – figure supplement 3; note that average measured head movement across participants was 1.159 mm ± 1.077 SD) and eye movements (Figure 4 – figure supplement 3) and have implemented the requested control analyses addressing this issue. They are reported in the revised manuscript in the following locations: Results (lines 207-211), Results (lines 261-268), Discussion (Lines 362-368).

      This reviewer recommends inclusion of a formal analysis that the intra-vs inter parcels are indeed completely independent. For example, the authors state that the inter-parcel features reflect "lower spatially resolved whole-brain activity patterns or global brain dynamics". A formal quantitative demonstration that the signals indeed show "complete independence" (as claimed by the authors) and are orthogonal would be helpful.

      Please note that we never claim in the manuscript that the parcel-space and regional voxelspace features show “complete independence”.  More importantly, input feature orthogonality is not a requirement for the machine learning-based decoding methods utilized in the present study while non-redundancy is [7] (a requirement satisfied by our data, see below). Finally, our results show that the hybrid space decoder out-performed all other methods even after input features were fully orthogonalized with LDA (the procedure used in all contextualization analyses) or PCA dimensionality reduction procedures prior to the classification step (Figure 3 – figure supplement 2).

      Relevant to this issue, please note that if spatially overlapping parcel- and voxel-space timeseries only provided redundant information, inclusion of both as input features should increase model over-fitting to the training dataset and decrease overall cross-validated test accuracy [8]. In the present study however, we see the opposite effect on decoder performance. First, Figure 3 – figure supplement 1 & 2 clearly show that decoders constructed from hybrid-space features outperform the other input feature (sensor-, wholebrain parcel- and whole-brain voxel-) spaces in every case (e.g. – wideband, all narrowband frequency ranges, and even after the input space is fully orthogonalized through dimensionality reduction procedures prior to the decoding step). Furthermore, Figure 3 – figure supplement 6 shows that hybrid-space decoder performance supers when parceltime series that spatially overlap with the included regional voxel-spaces are removed from the input feature set. 

      We state in the Discussion (lines 353-356)

      “The observation of increased cross-validated test accuracy (as shown in Figure 3 – Figure Supplement 6) indicates that the spatially overlapping information in parcel- and voxel-space time-series in the hybrid decoder was complementary, rather than redundant [41].”

      To gain insight into the complimentary information contributed by the two spatial scales to the hybrid-space decoder, we first independently computed the matrix rank for whole-brain parcel- and voxel-space input features for each participant (shown in Author response image 1). The results indicate that whole-brain parcel-space input features are full rank (rank = 148) for all participants (i.e. - MEG activity is orthogonal between all parcels). The matrix rank of voxelspace input features (rank = 267± 17 SD), exceeded the parcel-space rank for all participants and approached the number of useable MEG sensor channels (n = 272). Thus, voxel-space features provide both additional and complimentary information to representations at the parcel-space scale.  

      Author response image 1.

      Matrix rank computed for whole-brain parcel- and voxel-space time-series in individual subjects across the training run. The results indicate that whole-brain parcel-space input features are full rank (rank = 148) for all participants (i.e. - MEG activity is orthogonal between all parcels). The matrix rank of voxel-space input features (rank = 267 ± 17 SD), on the other hand, approached the number of useable MEG sensor channels (n = 272). Although not full rank, the voxel-space rank exceeded the parcel-space rank for all participants. Thus, some voxel-space features provide additional orthogonal information to representations at the parcel-space scale.  An expression of this is shown in the correlation distribution between parcel and constituent voxel time-series in Figure 2—figure Supplement 2.

      Figure 2—figure Supplement 2 in the revised manuscript now shows that the degree of dependence between the two spatial scales varies over the regional voxel-space. That is, some voxels within a given parcel correlate strongly with the time-series of the parcel they belong to, while others do not. This finding is consistent with a documented increase in correlational structure of neural activity across spatial scales that does not reflect perfect dependency or orthogonality [9]. Notably, the regional voxel-spaces included in the hybridspace decoder are significantly less correlated with the averaged parcel-space time-series than excluded voxels. We now point readers to this new figure in the results.

      Taken together, these results indicate that the multi-scale information in the hybrid feature set is complimentary rather than orthogonal.  This is consistent with the idea that hybridspace features better represent multi-scale temporospatial dynamics reported to be a fundamental characteristic of how the brain stores and adapts memories, and generates behavior across species [9].  

      Reviewer #2 (Public review): 

      Summary: 

      The current paper consists of two parts. The first part is the rigorous feature optimization of the MEG signal to decode individual finger identity performed in a sequence (4-1-3-2-4; 1~4 corresponds to little~index fingers of the left hand). By optimizing various parameters for the MEG signal, in terms of (i) reconstructed source activity in voxel- and parcel-level resolution and their combination, (ii) frequency bands, and (iii) time window relative to press onset for each finger movement, as well as the choice of decoders, the resultant "hybrid decoder" achieved extremely high decoding accuracy (~95%). This part seems driven almost by pure engineering interest in gaining as high decoding accuracy as possible. 

      In the second part of the paper, armed with the successful 'hybrid decoder,' the authors asked more scientific questions about how neural representation of individual finger movement that is embedded in a sequence, changes during a very early period of skill learning and whether and how such representational change can predict skill learning. They assessed the difference in MEG feature patterns between the first and the last press 4 in sequence 41324 at each training trial and found that the pattern differentiation progressively increased over the course of early learning trials. Additionally, they found that this pattern differentiation specifically occurred during the rest period rather than during the practice trial. With a significant correlation between the trial-by-trial profile of this pattern differentiation and that for accumulation of offline learning, the authors argue that such "contextualization" of finger movement in a sequence (e.g., what-where association) underlies the early improvement of sequential skill. This is an important and timely topic for the field of motor learning and beyond. 

      Strengths: 

      Each part has its own strength. For the first part, the use of temporally rich neural information (MEG signal) has a significant advantage over previous studies testing sequential representations using fMRI. This allowed the authors to examine the earliest period (= the first few minutes of training) of skill learning with finer temporal resolution. Through the optimization of MEG feature extraction, the current study achieved extremely high decoding accuracy (approx. 94%) compared to previous works. For the second part, the finding of the early "contextualization" of the finger movement in a sequence and its correlation to early (offline) skill improvement is interesting and important. The comparison between "online" and "offline" pattern distance is a neat idea. 

      Weaknesses: 

      Despite the strengths raised, the specific goal for each part of the current paper, i.e., achieving high decoding accuracy and answering the scientific question of early skill learning, seems not to harmonize with each other very well. In short, the current approach, which is solely optimized for achieving high decoding accuracy, does not provide enough support and interpretability for the paper's interesting scientific claim. This reminds me of the accuracy-explainability tradeoff in machine learning studies (e.g., Linardatos et al., 2020). More details follow. 

      There are a number of different neural processes occurring before and after a key press, such as planning of upcoming movement and ahead around premotor/parietal cortices, motor command generation in primary motor cortex, sensory feedback related processes in sensory cortices, and performance monitoring/evaluation around the prefrontal area. Some of these may show learning-dependent change and others may not.  

      In this paper, the focus as stated in the Introduction was to evaluate “the millisecond-level differentiation of discrete action representations during learning”, a proposal that first required the development of more accurate computational tools.  Our first step, reported here, was to develop that tool. With that in hand, we then proceeded to test if neural representations differentiated during early skill learning. Our results showed they did.  Addressing the question the Reviewer asks is part of exciting future work, now possible based on the results presented in this paper.  We acknowledge this issue in the revised Discussion:  

      Discussion (Lines 428-434):

      “In this study, classifiers were trained on MEG activity recorded during or immediately after each keypress, emphasizing neural representations related to action execution, memory consolidation and recall over those related to planning. An important direction for future research is determining whether separate decoders can be developed to distinguish the representations or networks separately supporting these processes. Ongoing work in our lab is addressing this question. The present accuracy results across varied decoding window durations and alignment with each keypress action support the feasibility of this approach (Figure 3—figure supplement 5).”

      Given the use of whole-brain MEG features with a wide time window (up to ~200 ms after each key press) under the situation of 3~4 Hz (i.e., 250~330 ms press interval) typing speed, these different processes in different brain regions could have contributed to the expression of the "contextualization," making it difficult to interpret what really contributed to the "contextualization" and whether it is learning related. Critically, the majority of data used for decoder training has the chance of such potential overlap of signal, as the typing speed almost reached a plateau already at the end of the 11th trial and stayed until the 36th trial. Thus, the decoder could have relied on such overlapping features related to the future presses. If that is the case, a gradual increase in "contextualization" (pattern separation) during earlier trials makes sense, simply because the temporal overlap of the MEG feature was insufficient for the earlier trials due to slower typing speed.  Several direct ways to address the above concern, at the cost of decoding accuracy to some degree, would be either using the shorter temporal window for the MEG feature or training the model with the early learning period data only (trials 1 through 11) to see if the main results are unaffected would be some example. 

      We now include additional analyses carried out with decoding time windows ranging from 50 to 250ms in duration, which have been added to the revised manuscript as follows: 

      Results (lines 258-261):

      “The improved decoding accuracy is supported by greater differentiation in neural representations of the index finger keypresses performed at positions 1 and 5 of the sequence (Figure 4A), and by the trial-by-trial increase in 2-class decoding accuracy over early learning (Figure 4C) across different decoder window durations (Figure 4 – figure supplement 2).”

      Results (lines 310-312):

      “Offline contextualization strongly correlated with cumulative micro-offline gains (r = 0.903, R<sup>2</sup> = 0.816, p < 0.001; Figure 5 – figure supplement 1A, inset) across decoder window durations ranging from 50 to 250ms (Figure 5 – figure supplement 1B, C).“

      Discussion (lines 382-385):

      “This was further supported by the progressive differentiation of neural representations of the index finger keypress (Figure 4A) and by the robust trial-by trial increase in 2-class decoding accuracy across time windows ranging between 50 and 250ms (Figure 4C; Figure 4 – figure supplement 2).”

      Discussion (lines 408-9):

      “Offline contextualization consistently correlated with early learning gains across a range of decoding windows (50–250ms; Figure 5 – figure supplement 1).”

      Several new control analyses are also provided addressing the question of overlapping keypresses:

      Reviewer #3 (Public review):

      Summary: 

      One goal of this paper is to introduce a new approach for highly accurate decoding of finger movements from human magnetoencephalography data via dimension reduction of a "multi-scale, hybrid" feature space. Following this decoding approach, the authors aim to show that early skill learning involves "contextualization" of the neural coding of individual movements, relative to their position in a sequence of consecutive movements.

      Furthermore, they aim to show that this "contextualization" develops primarily during short rest periods interspersed with skill training and correlates with a performance metric which the authors interpret as an indicator of offline learning. 

      Strengths: 

      A strength of the paper is the innovative decoding approach, which achieves impressive decoding accuracies via dimension reduction of a "multi-scale, hybrid space". This hybridspace approach follows the neurobiologically plausible idea of concurrent distribution of neural coding across local circuits as well as large-scale networks. A further strength of the study is the large number of tested dimension reduction techniques and classifiers. 

      Weaknesses: 

      A clear weakness of the paper lies in the authors' conclusions regarding "contextualization". Several potential confounds, which partly arise from the experimental design (mainly the use of a single sequence) and which are described below, question the neurobiological implications proposed by the authors and provide a simpler explanation of the results. Furthermore, the paper follows the assumption that short breaks result in offline skill learning, while recent evidence, described below, casts doubt on this assumption.  

      Please, see below for detailed response to each of these points.

      Specifically: The authors interpret the ordinal position information captured by their decoding approach as a reflection of neural coding dedicated to the local context of a movement (Figure 4). One way to dissociate ordinal position information from information about the moving effectors is to train a classifier on one sequence and test the classifier on other sequences that require the same movements, but in different positions (Kornysheva et al., Neuron 2019). In the present study, however, participants trained to repeat a single sequence (4-1-3-2-4).

      A crucial difference between our present study and the elegant study from Kornysheva et al. (2019) in Neuron highlighted by the Reviewer is that while ours is a learning study, the Kornysheva et al. study is not. Kornysheva et al. included an initial separate behavioral training session (i.e. – performed outside of the MEG) during which participants learned associations between fractal image patterns and different keypress sequences. Then in a separate, later MEG session—after the stimulus-response associations had been already learned in the first session—participants were tasked with recalling the learned sequences in response to a presented visual cue (i.e. – the paired fractal pattern). 

      Our rationale for not including multiple sequences in the same Day 1 training session of our study design was that it would lead to prominent interference effects, as widely reported in the literature [10-12].  Thus, while we had to take the issue of interference into consideration for our design, the Kornysheva et al. study did not. While Kornysheva et al. aimed to “dissociate ordinal position information from information about the moving effectors”, we tested various untrained sequences on Day 2 allowing us to determine that the contextualization result was specific to the trained sequence. By using this approach, we avoided interference effects on the learning of the primary skill caused by simultaneous acquisition of a second skill.

      The revised manuscript states our findings related to the Day 2 Control data in the following locations:

      Results (lines 117-122):

      “On the following day, participants were retested on performance of the same sequence (4-1-3-2-4) over 9 trials (Day 2 Retest), as well as on the single-trial performance of 9 different untrained control sequences (Day 2 Controls: 2-1-3-4-2, 4-2-4-3-1, 3-4-2-3-1, 1-4-3-4-2, 3-2-4-3-1, 1-4-2-3-1, 3-2-4-2-1, 3-2-1-4-2, and 4-23-1-4). As expected, an upward shift in performance of the trained sequence (0.68 ± SD 0.56 keypresses/s; t = 7.21, p < 0.001) was observed during Day 2 Retest, indicative of an overnight skill consolidation effect (Figure 1 – figure supplement 1A).”

      Results (lines 212-219):

      “Utilizing the highest performing decoders that included LDA-based manifold extraction, we assessed the robustness of hybrid-space decoding over multiple sessions by applying it to data collected on the following day during the Day 2 Retest (9-trial retest of the trained sequence) and Day 2 Control (single-trial performance of 9 different untrained sequences) blocks. The decoding accuracy for Day 2 MEG data remained high (87.11% ± SD 8.54% for the trained sequence during Retest, and 79.44% ± SD 5.54% for the untrained Control sequences; Figure 3 – figure supplement 4). Thus, index finger classifiers constructed using the hybrid decoding approach robustly generalized from Day 1 to Day 2 across trained and untrained keypress sequences.”

      Results (lines 269-273):

      “On Day 2, incorporating contextual information into the hybrid-space decoder enhanced classification accuracy for the trained sequence only (improving from 87.11% for 4-class to 90.22% for 5-class), while performing at or below-chance levels for the Control sequences (≤ 30.22% ± SD 0.44%). Thus, the accuracy improvements resulting from inclusion of contextual information in the decoding framework was specific for the trained skill sequence.”

      As a result, ordinal position information is potentially confounded by the fixed finger transitions around each of the two critical positions (first and fifth press). Across consecutive correct sequences, the first keypress in a given sequence was always preceded by a movement of the index finger (=last movement of the preceding sequence), and followed by a little finger movement. The last keypress, on the other hand, was always preceded by a ring finger movement, and followed by an index finger movement (=first movement of the next sequence). Figure 4 - supplement 2 shows that finger identity can be decoded with high accuracy (>70%) across a large time window around the time of the keypress, up to at least +/-100 ms (and likely beyond, given that decoding accuracy is still high at the boundaries of the window depicted in that figure). This time window approaches the keypress transition times in this study. Given that distinct finger transitions characterized the first and fifth keypress, the classifier could thus rely on persistent (or "lingering") information from the preceding finger movement, and/or "preparatory" information about the subsequent finger movement, in order to dissociate the first and fifth keypress. 

      Currently, the manuscript provides little evidence that the context information captured by the decoding approach is more than a by-product of temporally extended, and therefore overlapping, but independent neural representations of consecutive keypresses that are executed in close temporal proximity - rather than a neural representation dedicated to context. 

      During the review process, the authors pointed out that a "mixing" of temporally overlapping information from consecutive keypresses, as described above, should result in systematic misclassifications and therefore be detectable in the confusion matrices in Figures 3C and 4B, which indeed do not provide any evidence that consecutive keypresses are systematically confused. However, such absence of evidence (of systematic misclassification) should be interpreted with caution, and, of course, provides no evidence of absence. The authors also pointed out that such "mixing" would hamper the discriminability of the two ordinal positions of the index finger, given that "ordinal position 5" is systematically followed by "ordinal position 1". This is a valid point which, however, cannot rule out that "contextualization" nevertheless reflects the described "mixing".

      The revised manuscript contains several control analyses which rule out this potential confound.

      Results (lines 318-328):

      “Within-subject correlations were consistent with these group-level findings. The average correlation between offline contextualization and micro-offline gains within individuals was significantly greater than zero (Figure 5 – figure supplement 4, left; t = 3.87, p = 0.00035, df = 25, Cohen's d = 0.76) and stronger than correlations between online contextualization and either micro-online (Figure 5 – figure supplement 4, middle; t = 3.28, p = 0.0015, df = 25, Cohen's d = 1.2) or micro-offline gains (Figure 5 – figure supplement 4, right; t = 3.7021, p = 5.3013e-04, df = 25, Cohen's d = 0.69). These findings were not explained by behavioral changes of typing rhythm (t = -0.03, p = 0.976; Figure 5 – figure supplement 5), adjacent keypress transition times (R<sup>2</sup> = 0.00507, F[1,3202] = 16.3; Figure 5 – figure supplement 6), or overall typing speed (between-subject; R<sup>2</sup> = 0.028, p \= 0.41; Figure 5 – figure supplement 7).”

      Results (lines 385-390):

      “Further, the 5-class classifier—which directly incorporated information about the sequence location context of each keypress into the decoding pipeline—improved decoding accuracy relative to the 4-class classifier (Figure 4C). Importantly, testing on Day 2 revealed specificity of this representational differentiation for the trained skill but not for the same keypresses performed during various unpracticed control sequences (Figure 5C).”

      Discussion (lines 408-423):

      “Offline contextualization consistently correlated with early learning gains across a range of decoding windows (50–250ms; Figure 5 – figure supplement 1). This result remained unchanged when measuring offline contextualization between the last and second sequence of consecutive trials, inconsistent with a possible confounding effect of pre-planning [30] (Figure 5 – figure supplement 2A). On the other hand, online contextualization did not predict learning (Figure 5 – figure supplement 3). Consistent with these results the average within-subject correlation between offline contextualization and micro-offline gains was significantly stronger than within subject correlations between online contextualization and either micro-online or micro-offline gains (Figure 5 – figure supplement 4). 

      Offline contextualization was not driven by trial-by-trial behavioral differences, including typing rhythm (Figure 5 – figure supplement 5) and adjacent keypress transition times (Figure 5 – figure supplement 6) nor by between-subject differences in overall typing speed (Figure 5 – figure supplement 7)—ruling out a reliance on differences in the temporal overlap of keypresses. Importantly, offline contextualization documented on Day 1 stabilized once a performance plateau was reached (trials 11-36), and was retained on Day 2, documenting overnight consolidation of the differentiated neural representations.”

      During the review process, the authors responded to my concern that training of a single sequence introduces the potential confound of "mixing" described above, which could have been avoided by training on several sequences, as in Kornysheva et al. (Neuron 2019), by arguing that Day 2 in their study did include control sequences. However, the authors' findings regarding these control sequences are fundamentally different from the findings in Kornysheva et al. (2019), and do not provide any indication of effector-independent ordinal information in the described contextualization - but, actually, the contrary. In Kornysheva et al. (Neuron 2019), ordinal, or positional, information refers purely to the rank of a movement in a sequence. In line with the idea of competitive queuing, Kornysheva et al. (2019) have shown that humans prepare for a motor sequence via a simultaneous representation of several of the upcoming movements, weighted by their rank in the sequence. Importantly, they could show that this gradient carries information that is largely devoid of information about the order of specific effectors involved in a sequence, or their timing, in line with competitive queuing. They showed this by training a classifier to discriminate between the five consecutive movements that constituted one specific sequence of finger movements (five classes: 1st, 2nd, 3rd, 4th, 5th movement in the sequence) and then testing whether that classifier could identify the rank (1st, 2nd, 3rd, etc) of movements in another sequence, in which the fingers moved in a different order, and with different timings. Importantly, this approach demonstrated that the graded representations observed during preparation were largely maintained after this cross decoding, indicating that the sequence was represented via ordinal position information that was largely devoid of information about the specific effectors or timings involved in sequence execution. This result differs completely from the findings in the current manuscript. Dash et al. report a drop in detected ordinal position information (degree of contextualization in figure 5C) when testing for contextualization in their novel, untrained sequences on Day 2, indicating that context and ordinal information as defined in Dash et al. is not at all devoid of information about the specific effectors involved in a sequence. In this regard, a main concern in my public review, as well as the second reviewer's public review, is that Dash et al. cannot tell apart, by design, whether there is truly contextualization in the neural representation of a sequence (which they claim), or whether their results regarding "contextualization" are explained by what they call "mixing" in their author response, i.e., an overlap of representations of consecutive movements, as suggested as an alternative explanation by Reviewer 2 and myself.

      Again, as stated in response to a related comment by the Reviewer above, it is not surprising that our results differ from the study by Kornysheva et al. (2019) . A crucial difference between the studies that the Reviewer fails to recognize is that while ours is a learning study, the Kornysheva et al. study is not. Our rationale for not including multiple sequences in the same Day 1 training session of our study design was that it would lead to prominent interference effects, as widely reported in the literature [10-12].  Thus, while we had to take the issue of interference into consideration for our design, the Kornysheva et al. study did not, since it was not concerned with learning dynamics. The strengths of the elegant Kornysheva study highlighted by the Reviewer—that the pre-planned sequence queuing gradient of sequence actions was independent of the effectors or timings used—is precisely due to the fact that participants were selecting between sequence options that had been previously—and equivalently—learned. The decoders in the Kornynsheva study were trained to classify effector- and timing-independent sequence position information— by design—so it is not surprising that this is the information they reflect.

      The questions asked in our study were different: 1) Do the neural representations of the same sequence action executed in different skill (ordinal sequence) locations differentiate (contextualize) during early learning?  and 2) Is the observed contextualization specific to the learned sequence? Thus, while Kornysheva et al. aimed to “dissociate ordinal position information from information about the moving effectors”, we tested various untrained sequences on Day 2 allowing us to determine that the contextualization result was specific to the trained sequence. By using this approach, we avoided interference effects on the learning of the primary skill caused by simultaneous acquisition of a second skill.

      Such temporal overlap of consecutive, independent finger representations may also account for the dynamics of "ordinal coding"/"contextualization", i.e., the increase in 2class decoding accuracy, across Day 1 (Figure 4C). As learning progresses, both tapping speed and the consistency of keypress transition times increase (Figure 1), i.e., consecutive keypresses are closer in time, and more consistently so. As a result, information related to a given keypress is increasingly overlapping in time with information related to the preceding and subsequent keypresses. The authors seem to argue that their regression analysis in Figure 5 - figure supplement 3 speaks against any influence of tapping speed on "ordinal coding" (even though that argument is not made explicitly in the manuscript). However, Figure 5 - figure supplement 3 shows inter-individual differences in a between-subject analysis (across trials, as in panel A, or separately for each trial, as in panel B), and, therefore, says little about the within-subject dynamics of "ordinal coding" across the experiment. A regression of trial-by-trial "ordinal coding" on trial-by-trial tapping speed (either within-subject, or at a group-level, after averaging across subjects) could address this issue. Given the highly similar dynamics of "ordinal coding" on the one hand (Figure 4C), and tapping speed on the other hand (Figure 1B), I would expect a strong relationship between the two in the suggested within-subject (or group-level) regression. 

      The aim of the between-subject regression analysis presented in the Results (see below) and in Figure 5—figure supplement 7 (previously Figure 5—figure supplement 3) of the revised manuscript, was to rule out a general effect of tapping speed on the magnitude of contextualization observed. If temporal overlap of neural representations was driving their differentiation, then participants typing at higher speeds should also show greater contextualization scores. We made the decision to use a between-subject analysis to address this issue since within-subject skill speed variance was rather small over most of the training session. 

      The Reviewer’s request that we additionally carry-out a “regression of trial-by-trial "ordinal coding" on trial-by-trial tapping speed (either within-subject, or at a group-level, after averaging across subjects)” is essentially the same request of Reviewer 2 above. That request was to perform a modified simple linear regression analysis where the predictor is the sum the 4-4 and 4-1 transition times, since these transitions are where any temporal overlaps of neural representations would occur.  A new Figure 5 – figure supplement 6 in the revised manuscript includes a scatter plot showing the sum of adjacent index finger keypress transition times (i.e. – the 4-4 transition at the conclusion of one sequence iteration and the 4-1 transition at the beginning of the next sequence iteration) versus online contextualization distances measured during practice trials. Both the keypress transition times and online contextualization scores were z-score normalized within individual subjects, and then concatenated into a single data superset. As is clear in the figure data, results of the regression analysis showed a very weak linear relationship between the two (R<sup>2</sup> = 0.00507, F[1,3202] = 16.3). Thus, contextualization score magnitudes do not reflect the amount of overlap between adjacent keypresses when assessed either within- or between-subject.

      The revised manuscript now states:

      Results (lines 318-328):

      “Within-subject correlations were consistent with these group-level findings. The average correlation between offline contextualization and micro-offline gains within individuals was significantly greater than zero (Figure 5 – figure supplement 4, left; t = 3.87, p = 0.00035, df = 25, Cohen's d = 0.76) and stronger than correlations between online contextualization and either micro-online (Figure 5 – figure supplement 4, middle; t = 3.28, p = 0.0015, df = 25, Cohen's d = 1.2) or micro-offline gains (Figure 5 – figure supplement 4, right; t = 3.7021, p = 5.3013e-04, df = 25, Cohen's d = 0.69). These findings were not explained by behavioral changes of typing rhythm (t = -0.03, p = 0.976; Figure 5 – figure supplement 5), adjacent keypress transition times (R<sup>2</sup> = 0.00507, F[1,3202] = 16.3; Figure 5 – figure supplement 6), or overall typing speed (between-subject; R<sup>2</sup> = 0.028, p \= 0.41; Figure 5 – figure supplement 7).”

      Furthermore, learning should increase the number of (consecutively) correct sequences, and, thus, the consistency of finger transitions. Therefore, the increase in 2-class decoding accuracy may simply reflect an increasing overlap in time of increasingly consistent information from consecutive keypresses, which allows the classifier to dissociate the first and fifth keypress more reliably as learning progresses, simply based on the characteristic finger transitions associated with each. In other words, given that the physical context of a given keypress changes as learning progresses - keypresses move closer together in time and are more consistently correct - it seems problematic to conclude that the mental representation of that context changes. To draw that conclusion, the physical context should remain stable (or any changes to the physical context should be controlled for). 

      The revised manuscript now addresses specifically the question of mixing of temporally overlapping information:

      Results (Lines 310-328)

      “Offline contextualization strongly correlated with cumulative micro-offline gains (r = 0.903, R<sup>2</sup> = 0.816, p < 0.001; Figure 5 – figure supplement 1A, inset) across decoder window durations ranging from 50 to 250ms (Figure 5 – figure supplement 1B, C). The offline contextualization between the final sequence of each trial and the second sequence of the subsequent trial (excluding the first sequence) yielded comparable results. This indicates that pre-planning at the start of each practice trial did not directly influence the offline contextualization measure [30] (Figure 5 – figure supplement 2A, 1st vs. 2nd Sequence approaches). Conversely, online contextualization (using either measurement approach) did not explain early online learning gains (i.e. – Figure 5 – figure supplement 3). Within-subject correlations were consistent with these group-level findings. The average correlation between offline contextualization and micro-offline gains within individuals was significantly greater than zero (Figure 5 – figure supplement 4, left; t = 3.87, p = 0.00035, df = 25, Cohen's d = 0.76) and stronger than correlations between online contextualization and either micro-online (Figure 5 – figure supplement 4, middle; t = 3.28, p = 0.0015, df = 25, Cohen's d = 1.2) or micro-offline gains (Figure 5 – figure supplement 4, right; t = 3.7021, p = 5.3013e-04, df = 25, Cohen's d = 0.69). These findings were not explained by behavioral changes of typing rhythm (t = -0.03, p = 0.976; Figure 5 – figure supplement 5), adjacent keypress transition times (R<sup>2</sup> = 0.00507, F[1,3202] = 16.3; Figure 5 – figure supplement 6), or overall typing speed (between-subject; R<sup>2</sup> = 0.028, p \= 0.41; Figure 5 – figure supplement 7). “

      Discussion (Lines 417-423)

      “Offline contextualization was not driven by trial-by-trial behavioral differences, including typing rhythm (Figure 5 – figure supplement 5) and adjacent keypress transition times (Figure 5 – figure supplement 6) nor by between-subject differences in overall typing speed (Figure 5 – figure supplement 7)—ruling out a reliance on differences in the temporal overlap of keypresses. Importantly, offline contextualization documented on Day 1 stabilized once a performance plateau was reached (trials 11-36), and was retained on Day 2, documenting overnight consolidation of the differentiated neural representations.”

      A similar difference in physical context may explain why neural representation distances ("differentiation") differ between rest and practice (Figure 5). The authors define "offline differentiation" by comparing the hybrid space features of the last index finger movement of a trial (ordinal position 5) and the first index finger movement of the next trial (ordinal position 1). However, the latter is not only the first movement in the sequence but also the very first movement in that trial (at least in trials that started with a correct sequence), i.e., not preceded by any recent movement. In contrast, the last index finger of the last correct sequence in the preceding trial includes the characteristic finger transition from the fourth to the fifth movement. Thus, there is more overlapping information arising from the consistent, neighbouring keypresses for the last index finger movement, compared to the first index finger movement of the next trial. A strong difference (larger neural representation distance) between these two movements is, therefore, not surprising, given the task design, and this difference is also expected to increase with learning, given the increase in tapping speed, and the consequent stronger overlap in representations for consecutive keypresses. Furthermore, initiating a new sequence involves pre-planning, while ongoing practice relies on online planning (Ariani et al., eNeuro 2021), i.e., two mental operations that are dissociable at the level of neural representation (Ariani et al., bioRxiv 2023).  

      The revised manuscript now addresses specifically the question of pre-planning:

      Results (lines 310-318):

      “Offline contextualization strongly correlated with cumulative micro-offline gains (r = 0.903, R<sup>2</sup> = 0.816, p < 0.001; Figure 5 – figure supplement 1A, inset) across decoder window durations ranging from 50 to 250ms (Figure 5 – figure supplement 1B, C). The offline contextualization between the final sequence of each trial and the second sequence of the subsequent trial (excluding the first sequence) yielded comparable results. This indicates that pre-planning at the start of each practice trial did not directly influence the offline contextualization measure [30] (Figure 5 – figure supplement 2A, 1st vs. 2nd Sequence approaches). Conversely, online contextualization (using either measurement approach) did not explain early online learning gains (i.e. – Figure 5 – figure supplement 3).”

      Discussion (lines 408-416):

      “Offline contextualization consistently correlated with early learning gains across a range of decoding windows (50–250ms; Figure 5 – figure supplement 1). This result remained unchanged when measuring offline contextualization between the last and second sequence of consecutive trials, inconsistent with a possible confounding effect of pre-planning [30] (Figure 5 – figure supplement 2A). On the other hand, online contextualization did not predict learning (Figure 5 – figure supplement 3). Consistent with these results the average within-subject correlation between offline contextualization and micro-offline gains was significantly stronger than within-subject correlations between online contextualization and either micro-online or micro-offline gains (Figure 5 – figure supplement 4).”

      A further complication in interpreting the results stems from the visual feedback that participants received during the task. Each keypress generated an asterisk shown above the string on the screen. It is not clear why the authors introduced this complicating visual feedback in their task, besides consistency with their previous studies. The resulting systematic link between the pattern of visual stimulation (the number of asterisks on the screen) and the ordinal position of a keypress makes the interpretation of "contextual information" that differentiates between ordinal positions difficult. During the review process, the authors reported a confusion matrix from a classification of asterisks position based on eye tracking data recorded during the task and concluded that the classifier performed at chance level and gaze was, thus, apparently not biased by the visual stimulation. However, the confusion matrix showed a huge bias that was difficult to interpret (a very strong tendency to predict one of the five asterisk positions, despite chance-level performance). Without including additional information for this analysis (or simply the gaze position as a function of the number of astersisk on the screen) in the manuscript, this important control analysis cannot be properly assessed, and is not available to the public.  

      We now include the gaze position data requested by the Reviewer alongside the confusion matrix results in Figure 4 – figure supplement 3.

      Results (lines 207-211):

      “An alternate decoder trained on ICA components labeled as movement or physiological artefacts (e.g. – head movement, ECG, eye movements and blinks; Figure 3 – figure supplement 3A, D) and removed from the original input feature set during the pre-processing stage approached chance-level performance (Figure 4 – figure supplement 3), indicating that the 4-class hybrid decoder results were not driven by task-related artefacts.” Results (lines 261-268):

      “As expected, the 5-class hybrid-space decoder performance approached chance levels when tested with randomly shuffled keypress labels (18.41%± SD 7.4% for Day 1 data; Figure 4 – figure supplement 3C). Task-related eye movements did not explain these results since an alternate 5-class hybrid decoder constructed from three eye movement features (gaze position at the KeyDown event, gaze position 200ms later, and peak eye movement velocity within this window; Figure 4 – figure supplement 3A) performed at chance levels (cross-validated test accuracy = 0.2181; Figure 4 – figure supplement 3B, C). “

      Discussion (Lines 362-368):

      “Task-related movements—which also express in lower frequency ranges—did not explain these results given the near chance-level performance of alternative decoders trained on (a) artefact-related ICA components removed during MEG preprocessing (Figure 3 – figure supplement 3A-C) and on (b) task-related eye movement features (Figure 4 – figure supplement 3B, C). This explanation is also inconsistent with the minimal average head motion of 1.159 mm (± 1.077 SD) across the MEG recording (Figure 3 – figure supplement 3D).”

      The rationale for the task design including the asterisks is presented below:

      Methods (Lines 500-514)

      “The five-item sequence was displayed on the computer screen for the duration of each practice round and participants were directed to fix their gaze on the sequence. Small asterisks were displayed above a sequence item after each successive keypress, signaling the participants' present position within the sequence. Inclusion of this feedback minimizes working memory loads during task performance [73]. Following the completion of a full sequence iteration, the asterisk returned to the first sequence item. The asterisk did not provide error feedback as it appeared for both correct and incorrect keypresses. At the end of each practice round, the displayed number sequence was replaced by a string of five "X" symbols displayed on the computer screen, which remained for the duration of the rest break. Participants were instructed to focus their gaze on the screen during this time. The behavior in this explicit, motor learning task consists of generative action sequences rather than sequences of stimulus-induced responses as in the serial reaction time task (SRTT). A similar real-world example would be manually inputting a long password into a secure online application in which one intrinsically generates the sequence from memory and receives similar feedback about the password sequence position (also provided as asterisks), which is typically ignored by the user.”

      The authors report a significant correlation between "offline differentiation" and cumulative micro-offline gains. However, this does not address the question whether there is a trial-by-trial relation between the degree of "contextualization" and the amount of micro-offline gains - i.e., the question whether performance changes (micro-offline gains) are less pronounced across rest periods for which the change in "contextualization" is relatively low. The single-subject correlation between contextualization changes "during" rest and micro-offline gains (Figure 5 - figure supplement 4) addresses this question, however, the critical statistical test (are correlation coefficients significantly different from zero) is not included. Given the displayed distribution, it seems unlikely that correlation coefficients are significantly above zero. 

      As recommend by the Reviewer, we now include one-way right-tailed t-test results which provide further support to the previously reported finding. The mean of within-subject correlations between offline contextualization and cumulative micro-offline gains was significantly greater than zero (t = 3.87, p = 0.00035, df = 25, Cohen's d = 0.76; see Figure 5 – figure supplement 4, left), while correlations for online contextualization versus cumulative micro-online (t = -1.14, p = 0.8669, df = 25, Cohen's d = -0.22) or micro-offline gains t = -0.097, p = 0.5384, df = 25, Cohen's d = -0.019) were not. We have incorporated the significant one-way t-test for offline contextualization and cumulative micro-offline gains in the Results section of the revised manuscript (lines 313-318) and the Figure 5 – figure supplement 4 legend.

      The authors follow the assumption that micro-offline gains reflect offline learning.

      However, there is no compelling evidence in the literature, and no evidence in the present manuscript, that micro-offline gains (during any training phase) reflect offline learning. Instead, emerging evidence in the literature indicates that they do not (Das et al., bioRxiv 2024), and instead reflect transient performance benefits when participants train with breaks, compared to participants who train without breaks, however, these benefits vanish within seconds after training if both groups of participants perform under comparable conditions (Das et al., bioRxiv 2024). During the review process, the authors argued that differences in the design between Das et al. (2024) on the one hand (Experiments 1 and 2), and the study by Bönstrup et al. (2019) on the other hand, may have prevented Das et al. (2024) from finding the assumed (lasting) learning benefit by micro-offline consolidation. However, the Supplementary Material of Das et al. (2024) includes an experiment (Experiment S1) whose design closely follows the early learning phase of Bönstrup et al. (2019), and which, nevertheless, demonstrates that there is no lasting benefit of taking breaks for the acquired skill level, despite the presence of micro-offline gains. 

      We thank the Reviewer for alerting us to this new data added to the revised supplementary materials of Das et al. (2024) posted to bioRxiv. However, despite the Reviewer’s claim to the contrary, a careful comparison between the Das et al and Bönstrup et al studies reveal more substantive differences than similarities and does not “closely follows a large proportion of the early learning phase of Bönstrup et al. (2019)” as stated. 

      In the Das et al. Experiment S1, sixty-two participants were randomly assigned to “with breaks” or “no breaks” skill training groups. The “with breaks” group alternated 10 seconds of skill sequence practice with 10 seconds of rest over seven trials (2 min and 2 sec total training duration). This amounts to 66.7% of the early learning period defined by Bönstrup et al. (2019) (i.e. - eleven 10-second-long practice periods interleaved with ten 10-second-long rest breaks; 3 min 30 sec total training duration).  

      Also, please note that while no performance feedback nor reward was given in the Bönstrup et al. (2019) study, participants in the Das et al. study received explicit performance-based monetary rewards, a potentially crucial driver of differentiated behavior between the two studies:

      “Participants were incentivized with bonus money based on the total number of correct sequences completed throughout the experiment.”

      The “no breaks” group in the Das et al. study practiced the skill sequence for 70 continuous seconds. Both groups (despite one being labeled “no breaks”) follow training with a long 3-minute break (also note that since the “with breaks” group ends with 10 seconds of rest their break is actually longer), before finishing with a skill “test” over a continuous 50-second-long block. During the 70 seconds of training, the “with breaks” group shows more learning than the “no breaks” group. Interestingly, following the long 3minute break the “with breaks” group display a performance drop (relative to their performance at the end of training) that is stable over the full 50-second test, while the “no breaks” group shows an immediate performance improvement following the long break that continues to increase over the 50-second test.  

      Separately, there are important issues regarding the Das et al. study that should be considered through the lens of recent findings not referred to in the preprint. A major element of their experimental design is that both groups—“with breaks” and “no breaks”— actually receive quite a long 3-minute break just before the skill test. This long break is more than 2.5x the cumulative interleaved rest experienced by the “with breaks” group. Thus, although the design is intended to contrast the presence or absence of rest “breaks”, that difference between groups is no longer maintained at the point of the skill test. 

      The Das et al. results are most consistent with an alternative interpretation of the data— that the “no breaks” group experiences offline learning during their long 3-minute break. This is supported by the recent work of Griffin et al. (2025) where micro-array recordings from primary and premotor cortex were obtained from macaque monkeys while they performed blocks of ten continuous reaching sequences up to 81.4 seconds in duration (see source data for Extended Data Figure 1h) with 90 seconds of interleaved rest. Griffin et al. observed offline improvement in skill immediately following the rest break that was causally related to neural reactivations (i.e. – neural replay) that occurred during the rest break. Importantly, the highest density of reactivations was present in the very first 90second break between Blocks 1 and 2 (see Fig. 2f in Griffin et al., 2025). This supports the interpretation that both the “with breaks” and “no breaks” group express offline learning gains, with these gains being delayed in the “no breaks” group due to the practice schedule.

      On the other hand, if offline learning can occur during this longer break, then why would the “with breaks” group show no benefit? Again, it could be that most of the offline gains for this group were front-loaded during the seven shorter 10-second rest breaks. Another possible, though not mutually exclusive, explanation is that the observed drop in performance in the “with breaks” group is driven by contextual interference. Specifically, similar to Experiments 1 and 2 in Das et al. (2024), the skill test is conducted under very different conditions than those which the “with breaks” group practiced the skill under (short bursts of practiced alternating with equally short breaks). On the other hand, the “no breaks” group is tested (50 seconds of continuous practice) under quite similar conditions to their training schedule (70 seconds of continuous practice). Thus, it is possible that this dissimilarity between training and test could lead to reduced performance in the “with breaks” group.

      We made the following manuscript revisions related to these important issues: 

      Introduction (Lines 26-56)

      “Practicing a new motor skill elicits rapid performance improvements (early learning) [1] that precede skill performance plateaus [5]. Skill gains during early learning accumulate over rest periods (micro-offline) interspersed with practice [1, 6-10], and are up to four times larger than offline performance improvements reported following overnight sleep [1]. During this initial interval of prominent learning, retroactive interference immediately following each practice interval reduces learning rates relative to interference after passage of time, consistent with stabilization of the motor memory [11]. Micro-offline gains observed during early learning are reproducible [7, 10-13] and are similar in magnitude even when practice periods are reduced by half to 5 seconds in length, thereby confirming that they are not merely a result of recovery from performance fatigue [11]. Additionally, they are unaffected by the random termination of practice periods, which eliminates the possibility of predictive motor slowing as a contributing factor [11]. Collectively, these behavioral findings point towards the interpretation that micro offline gains during early learning represent a form of memory consolidation [1]. 

      This interpretation has been further supported by brain imaging and electrophysiological studies linking known memory-related networks and consolidation mechanisms to rapid offline performance improvements. In humans, the rate of hippocampo-neocortical neural replay predicts micro-offline gains [6]. Consistent with these findings, Chen et al. [12] and Sjøgård et al. [13] furnished direct evidence from intracranial human EEG studies, demonstrating a connection between the density of hippocampal sharp-wave ripples (80-120 Hz)—recognized markers of neural replay—and micro-offline gains during early learning. Further, Griffin et al. reported that neural replay of task-related ensembles in the motor cortex of macaques during brief rest periods— akin to those observed in humans [1, 6-8, 14]—are not merely correlated with, but are causal drivers of micro-offline learning [15]. Specifically, the same reach directions that were replayed the most during rest breaks showed the greatest reduction in path length (i.e. – more efficient movement path between two locations in the reach sequence) during subsequent trials, while stimulation applied during rest intervals preceding performance plateau reduced reactivation rates and virtually abolished micro-offline gains [15]. Thus, converging evidence in humans and non-human primates across indirect non-invasive and direct invasive recording techniques link hippocampal activity, neural replay dynamics and offline skill gains in early motor learning that precede performance plateau.”

      Next, in the Methods, we articulate important constrains formulated by Pan and Rickard and Bonstrup et al for meaningful measurements:

      Methods (Lines 493-499)

      “The study design followed specific recommendations by Pan and Rickard (2015): 1) utilizing 10-second practice trials and 2) constraining analysis of micro-offline gains to early learning trials (where performance monotonically increases and 95% of overall performance gains occur) that precede the emergence of “scalloped” performance dynamics strongly linked to reactive inhibition effects ( [29, 72]). This is precisely the portion of the learning curve Pan and Rickard referred to when they stated “…rapid learning during that period masks any reactive inhibition effect” [29].”

      We finally discuss the implications of neglecting some or all of these recommendations:

      Discussion (Lines 444-452):

      “Finally, caution should be exercised when extrapolating findings during early skill learning, a period of steep performance improvements, to findings reported after insufficient practice [67], post-plateau performance periods [68], or non-learning situations (e.g. performance of non-repeating keypress sequences in  [67]) when reactive inhibition or contextual interference effects are prominent. Ultimately, it will be important to develop new paradigms allowing one to independently estimate the different coincident or antagonistic features (e.g. - memory consolidation, planning, working memory and reactive inhibition) contributing to micro-online and micro-offline gains during and after early skill learning within a unifying framework.”

      Along these lines, the authors' claim, based on Bönstrup et al. 2020, that "retroactive interference immediately following practice periods reduces micro-offline learning", is not supported by that very reference. Citing Bönstrup et al. (2020), "Regarding early learning dynamics (trials 1-5), we found no differences in microscale learning parameters (micro online/offline) or total early learning between both interference groups." That is, contrary to Dash et al.'s current claim, Bönstrup et al. (2020) did not find any retroactive interference effect on the specific behavioral readout (micro-offline gains) that the authors assume to reflect consolidation. 

      Please, note that the Bönstrup et al. 2020 paper abstract states: 

      “Third, retroactive interference immediately after each practice period reduced the learning rate relative to interference after passage of time (N = 373), indicating stabilization of the motor memory at a microscale of several seconds.”

      which is further supported by this statement in the Results: 

      “The model comprised three parameters representing the initial performance, maximum performance and learning rate (see Eq. 1, “Methods”, “Data Analysis” section). We then statistically compared the model parameters between the interference groups (Fig. 2d). The late interference group showed a higher learning rate compared with the early interference group (late: 0.26 ± 0.23, early: 2.15 ± 0.20, P=0.04). The effect size of the group difference was small to medium (Cohen’s d 0.15)[29]. Similar differences with a stronger rise in the learning curve of a late interference groups vs. an early interference group were found in a smaller sample collected in the lab environment (Supplementary Fig. 3).”

      We have modified the statement in the revised manuscript to specify that the difference observed was between learning rates: Introduction (Lines 30-32)

      “During this initial interval of prominent learning, retroactive interference immediately following each practice interval reduces learning rates relative to interference after passage of time, consistent with stabilization of the motor memory [11].”

      The authors conclude that performance improves, and representation manifolds differentiate, "during" rest periods (see, e.g., abstract). However, micro-offline gains (as well as offline contextualization) are computed from data obtained during practice, not rest, and may, thus, just as well reflect a change that occurs "online", e.g., at the very onset of practice (like pre-planning) or throughout practice (like fatigue, or reactive inhibition).  

      The Reviewer raises again the issue of a potential confound of “pre-planning” on our contextualization measures as in the comment above: 

      “Furthermore, initiating a new sequence involves pre-planning, while ongoing practice relies on online planning (Ariani et al., eNeuro 2021), i.e., two mental operations that are dissociable at the level of neural representation (Ariani et al., bioRxiv 2023).”

      The cited studies by Ariani et al. indicate that effects of pre-planning are likely to impact the first 3 keypresses of the initial sequence iteration in each trial. As stated in the response to this comment above, we conducted a control analysis of contextualization that ignores the first sequence iteration in each trial to partial out any potential preplanning effect. This control analyses yielded comparable results, indicating that preplanning is not a major driver of our reported contextualization effects. We now report this in the revised manuscript:

      We also state in the Figure 1 legend (Lines 99-103) in the revised manuscript that preplanning has no effect on the behavioral measures of micro-offline and micro-online gains in our dataset:

      The Reviewer also raises the issue of possible effects stemming from “fatigue” and “reactive inhibition” which inhibit performance and are indeed relevant to skill learning studies. We designed our task to specifically mitigate these effects. We now more clearly articulate this rationale in the description of the task design as well as the measurement constraints essential for minimizing their impact.

      We also discuss the implications of fatigue and reactive inhibition effects in experimental designs that neglect to follow these recommendations formulated by Pan and Rickard in the Discussion section and propose how this issue can be better addressed in future investigations.

      To summarize, the results of our study indicate that: (a) offline contextualization effects are not explained by pre-planning of the first action sequence iteration in each practice trial; and (b) the task design implemented in this study purposefully minimize any possible effects of reactive inhibition or fatigue.  Circling back to the Reviewer’s proposal that “contextualization…may just as well reflect a change that occurs "online"”, we show in this paper direct empirical evidence that contextualization develops to a greater extent across rest periods rather than across practice trials, contrary to the Reviewer’s proposal.  

      That is, the definition of micro-offline gains (as well as offline contextualization) conflates online and "offline" processes. This becomes strikingly clear in the recent Nature paper by Griffin et al. (2025), who computed micro-offline gains as the difference in average performance across the first five sequences in a practice period (a block, in their terminology) and the last five sequences in the previous practice period. Averaging across sequences in this way minimises the chance to detect online performance changes and inflates changes in performance "offline". The problem that "online" gains (or contextualization) is actually computed from data entirely generated online, and therefore subject to processes that occur online, is inherent in the very definition of micro-online gains, whether, or not, they computed from averaged performance.

      We would like to make it clear that the issue raised by the Reviewer with respect to averaging across sequences done in the Griffin et al. (2025) study does not impact our study in any way. The primary skill measure used in all analyses reported in our paper is not temporally averaged. We estimated instantaneous correct sequence speed over the entire trial. Once the first sequence iteration within a trial is completed, the speed estimate is then updated at the resolution of individual keypresses. All micro-online and -offline behavioral changes are measured as the difference in instantaneous speed at the beginning and end of individual practice trials.

      Methods (lines 528-530):

      “The instantaneous correct sequence speed was calculated as the inverse of the average KTT across a single correct sequence iteration and was updated for each correct keypress.”

      The instantaneous speed measure used in our analyses, in fact, maximizes the likelihood of detecting changes in online performance, as the Reviewer indicates.  Despite this optimally sensitive measurement of online changes, our findings remained robust, consistently converging on the same outcome across our original analyses and the multiple controls recommended by the reviewers. Notably, online contextualization changes are significantly weaker than offline contextualization in all comparisons with different measurement approaches.

      Results (lines 302-309)

      “The Euclidian distance between neural representations of Index<sub>OP1</sub> (i.e. - index finger keypress at ordinal position 1 of the sequence) and Index<sub>OP5</sub> (i.e. - index finger keypress at ordinal position 5 of the sequence) increased progressively during early learning (Figure 5A)—predominantly during rest intervals (offline contextualization) rather than during practice (online) (t = 4.84, p < 0.001, df = 25, Cohen's d = 1.2; Figure 5B; Figure 5 – figure supplement 1A). An alternative online contextualization determination equalling the time interval between online and offline comparisons (Trial-based; 10 seconds between Index<sub>OP1</sub> and Index<sub>OP5</sub> observations in both cases) rendered a similar result (Figure 5 – figure supplement 2B).

      Results (lines 316-318)

      “Conversely, online contextualization (using either measurement approach) did not explain early online learning gains (i.e. – Figure 5 – figure supplement 3).”

      Results (lines 318-328)

      “Within-subject correlations were consistent with these group-level findings. The average correlation between offline contextualization and micro-offline gains within individuals was significantly greater than zero (Figure 5 – figure supplement 4, left; t = 3.87, p = 0.00035, df = 25, Cohen's d = 0.76) and stronger than correlations between online contextualization and either micro-online (Figure 5 – figure supplement 4, middle; t = 3.28, p = 0.0015, df = 25, Cohen's d = 1.2) or microoffline gains (Figure 5 – figure supplement 4, right; t = 3.7021, p = 5.3013e-04, df = 25, Cohen's d = 0.69). These findings were not explained by behavioral changes of typing rhythm (t = -0.03, p = 0.976; Figure 5 – figure supplement 5), adjacent keypress transition times (R<sup>2</sup> = 0.00507, F[1,3202] = 16.3; Figure 5 – figure supplement 6), or overall typing speed (between-subject; R<sup>2</sup> = 0.028, p \= 0.41; Figure 5 – figure supplement 7).”

      We disagree with the Reviewer’s statement that “the definition of micro-offline gains (as well as offline contextualization) conflates online and "offline" processes”.  From a strictly behavioral point of view, it is obviously true that one can only measure skill (rather than the absence of it during rest) to determine how it changes over time.  While skill changes surrounding rest are used to infer offline learning processes, recovery of skill decay following intense practice is used to infer “unmeasurable” recovery from fatigue or reactive inhibition. In other words, the alternative processes proposed by the Reviewer also rely on the same inferential reasoning. 

      Importantly, inferences can be validated through the identification of mechanisms. Our experiment constrained the study to evaluation of changes in neural representations of the same action in different contexts, while minimized the impact of mechanisms related to fatigue/reactive inhibition [13, 14]. In this way, we observed that behavioral gains and neural contextualization occurs to a greater extent over rest breaks rather than during practice trials and that offline contextualization changes strongly correlate with the offline behavioral gains, while online contextualization does not. This result was supported by the results of all control analyses recommended by the Reviewers. Specifically:

      Methods (Lines 493-499)

      “The study design followed specific recommendations by Pan and Rickard (2015): 1) utilizing 10-second practice trials and 2) constraining analysis of micro-offline gains to early learning trials (where performance monotonically increases and 95% of overall performance gains occur) that precede the emergence of “scalloped” performance dynamics strongly linked to reactive inhibition effects ( [29, 72]). This is precisely the portion of the learning curve Pan and Rickard referred to when they stated “…rapid learning during that period masks any reactive inhibition effect” [29].”

      And Discussion (Lines 444-448):

      “Finally, caution should be exercised when extrapolating findings during early skill learning, a period of steep performance improvements, to findings reported after insufficient practice [67], post-plateau performance periods [68], or non-learning situations (e.g. performance of non-repeating keypress sequences in  [67]) when reactive inhibition or contextual interference effects are prominent.”

      Next, we show that offline contextualization is greater than online contextualization and predicts offline behavioral gains across all measurement approaches, including all controls suggested by the Reviewer’s comments and recommendations. 

      Results (lines 302-318):

      “The Euclidian distance between neural representations of Index<sub>OP1</sub> (i.e. - index finger keypress at ordinal position 1 of the sequence) and Index<sub>OP5</sub> (i.e. - index finger keypress at ordinal position 5 of the sequence) increased progressively during early learning (Figure 5A)—predominantly during rest intervals (offline contextualization) rather than during practice (online) (t = 4.84, p < 0.001, df = 25, Cohen's d = 1.2; Figure 5B; Figure 5 – figure supplement 1A). An alternative online contextualization determination equalling the time interval between online and offline comparisons (Trial-based; 10 seconds between Index<sub>OP1</sub> and Index<sub>OP5</sub> observations in both cases) rendered a similar result (Figure 5 – figure supplement 2B).

      Offline contextualization strongly correlated with cumulative micro-offline gains (r = 0.903, R<sup>2</sup> = 0.816, p < 0.001; Figure 5 – figure supplement 1A, inset) across decoder window durations ranging from 50 to 250ms (Figure 5 – figure supplement 1B, C). The offline contextualization between the final sequence of each trial and the second sequence of the subsequent trial (excluding the first sequence) yielded comparable results. This indicates that pre-planning at the start of each practice trial did not directly influence the offline contextualization measure [30] (Figure 5 – figure supplement 2A, 1st vs. 2nd Sequence approaches). Conversely, online contextualization (using either measurement approach) did not explain early online learning gains (i.e. – Figure 5 – figure supplement 3).”

      Results (lines 318-324)

      “Within-subject correlations were consistent with these group-level findings. The average correlation between offline contextualization and micro-offline gains within individuals was significantly greater than zero (Figure 5 – figure supplement 4, left; t = 3.87, p = 0.00035, df = 25, Cohen's d = 0.76) and stronger than correlations between online contextualization and either micro-online (Figure 5 – figure supplement 4, middle; t = 3.28, p = 0.0015, df = 25, Cohen's d = 1.2) or microoffline gains (Figure 5 – figure supplement 4, right; t = 3.7021, p = 5.3013e-04, df = 25, Cohen's d = 0.69).”

      Discussion (lines 408-416):

      “Offline contextualization consistently correlated with early learning gains across a range of decoding windows (50–250ms; Figure 5 – figure supplement 1). This result remained unchanged when measuring offline contextualization between the last and second sequence of consecutive trials, inconsistent with a possible confounding effect of pre-planning [30] (Figure 5 – figure supplement 2A). On the other hand, online contextualization did not predict learning (Figure 5 – figure supplement 3). Consistent with these results the average within-subject correlation between offline contextualization and micro-offline gains was significantly stronger than within subject correlations between online contextualization and either micro-online or micro-offline gains (Figure 5 – figure supplement 4).”

      We then show that offline contextualization is not explained by pre-planning of the first action sequence:

      Results (lines 310-316):

      “Offline contextualization strongly correlated with cumulative micro-offline gains (r = 0.903, R<sup>2</sup> = 0.816, p < 0.001; Figure 5 – figure supplement 1A, inset) across decoder window durations ranging from 50 to 250ms (Figure 5 – figure supplement 1B, C). The offline contextualization between the final sequence of each trial and the second sequence of the subsequent trial (excluding the first sequence) yielded comparable results. This indicates that pre-planning at the start of each practice trial did not directly influence the offline contextualization measure [30] (Figure 5 – figure supplement 2A, 1st vs. 2nd Sequence approaches).”

      Discussion (lines 409-412):

      “This result remained unchanged when measuring offline contextualization between the last and second sequence of consecutive trials, inconsistent with a possible confounding effect of pre-planning [30] (Figure 5 – figure supplement 2A).”

      In summary, none of the presented evidence in this paper—including results of the multiple control analyses carried out in response to the Reviewers’ recommendations— supports the Reviewer’s position. 

      Please note that the micro-offline learning "inference" has extensive mechanistic support across species and neural recording techniques (see Introduction, lines 26-56). In contrast, the reactive inhibition "inference," which is the Reviewer's alternative interpretation, has no such support yet [15].

      Introduction (Lines 26-56)

      “Practicing a new motor skill elicits rapid performance improvements (early learning) [1] that precede skill performance plateaus [5]. Skill gains during early learning accumulate over rest periods (micro-offline) interspersed with practice [1, 6-10], and are up to four times larger than offline performance improvements reported following overnight sleep [1]. During this initial interval of prominent learning, retroactive interference immediately following each practice interval reduces learning rates relative to interference after passage of time, consistent with stabilization of the motor memory [11]. Micro-offline gains observed during early learning are reproducible [7, 10-13] and are similar in magnitude even when practice periods are reduced by half to 5 seconds in length, thereby confirming that they are not merely a result of recovery from performance fatigue [11]. Additionally, they are unaffected by the random termination of practice periods, which eliminates the possibility of predictive motor slowing as a contributing factor [11]. Collectively, these behavioral findings point towards the interpretation that microoffline gains during early learning represent a form of memory consolidation [1]. 

      This interpretation has been further supported by brain imaging and electrophysiological studies linking known memory-related networks and consolidation mechanisms to rapid offline performance improvements. In humans, the rate of hippocampo-neocortical neural replay predicts micro-offline gains [6].

      Consistent with these findings, Chen et al. [12] and Sjøgård et al. [13] furnished direct evidence from intracranial human EEG studies, demonstrating a connection between the density of hippocampal sharp-wave ripples (80-120 Hz)—recognized markers of neural replay—and micro-offline gains during early learning. Further, Griffin et al. reported that neural replay of task-related ensembles in the motor cortex of macaques during brief rest periods— akin to those observed in humans [1, 6-8, 14]—are not merely correlated with, but are causal drivers of micro-offline learning [15]. Specifically, the same reach directions that were replayed the most during rest breaks showed the greatest reduction in path length (i.e. – more efficient movement path between two locations in the reach sequence) during subsequent trials, while stimulation applied during rest intervals preceding performance plateau reduced reactivation rates and virtually abolished micro-offline gains [15]. Thus, converging evidence in humans and non-human primates across indirect non-invasive and direct invasive recording techniques link hippocampal activity, neural replay dynamics and offline skill gains in early motor learning that precede performance plateau.”

      That said, absence of evidence, is not evidence of absence and for that reason we also state in the Discussion (lines 448-452):

      A simple control analysis based on shuffled class labels could lend further support to the authors' complex decoding approach. As a control analysis that completely rules out any source of overfitting, the authors could test the decoder after shuffling class labels. Following such shuffling, decoding accuracies should drop to chance-level for all decoding approaches, including the optimized decoder. This would also provide an estimate of actual chance-level performance (which is informative over and beyond the theoretical chance level). During the review process, the authors reported this analysis to the reviewers. Given that readers may consider following the presented decoding approach in their own work, it would have been important to include that control analysis in the manuscript to convince readers of its validity. 

      As requested, the label-shuffling analysis was carried out for both 4- and 5-class decoders and is now reported in the revised manuscript.

      Results (lines 204-207):

      “Testing the keypress state (4-class) hybrid decoder performance on Day 1 after randomly shuffling keypress labels for held-out test data resulted in a performance drop approaching expected chance levels (22.12%± SD 9.1%; Figure 3 – figure supplement 3C).”

      Results (lines 261-264):

      “As expected, the 5-class hybrid-space decoder performance approached chance levels when tested with randomly shuffled keypress labels (18.41%± SD 7.4% for Day 1 data; Figure 4 – figure supplement 3C).”

      Furthermore, the authors' approach to cortical parcellation raises questions regarding the information carried by varying dipole orientations within a parcel (which currently seems to be ignored?) and the implementation of the mean-flipping method (given that there are two dimensions - space and time - it is unclear what the authors refer to when they talk about the sign of the "average source", line 477). 

      The revised manuscript now provides a more detailed explanation of the parcellation, and sign-flipping procedures implemented:

      Methods (lines 604-611):

      “Source-space parcellation was carried out by averaging all voxel time-series located within distinct anatomical regions defined in the Desikan-Killiany Atlas [31]. Since source time-series estimated with beamforming approaches are inherently sign-ambiguous, a custom Matlab-based implementation of the mne.extract_label_time_course with “mean_flip” sign-flipping procedure in MNEPython [78] was applied prior to averaging to prevent within-parcel signal cancellation. All voxel time-series within each parcel were extracted and the timeseries sign was flipped at locations where the orientation difference was greater than 90° from the parcel mode. A mean time-series was then computed across all voxels within the parcel after sign-flipping.”

      Recommendations for the authors: 

      Reviewer #1 (Recommendations for the authors): 

      Comments on the revision: 

      The authors have made large efforts to address all concerns raised. A couple of suggestions remain: 

      - formally show if and how movement artefacts may contribute to the signal and analysis; it seems that the authors have data to allow for such an analysis  

      We have implemented the requested control analyses addressing this issue. They are reported in: Results (lines 207-211 and 261-268), Discussion (Lines 362-368):

      - formally show that the signals from the intra- and inter parcel spaces are orthogonal. 

      Please note that, despite the Reviewer’s statement above, we never claim in the manuscript that the parcel-space and regional voxel-space features show “complete independence”. 

      Furthermore, the machine learning-based decoding methods used in the present study do not require input feature orthogonality, but instead non-redundancy [7], which is a requirement satisfied by our data (see below and the new Figure 2 – figure supplement 2 in the revised manuscript). Finally, our results already show that the hybrid space decoder outperformed all other methods even after input features were fully orthogonalized with LDA or PCA dimensionality reduction procedures prior to the classification step (Figure 3 – figure supplement 2).

      We also highlight several additional results that are informative regarding this issue. For example, if spatially overlapping parcel- and voxel-space time-series only provided redundant information, inclusion of both as input features should increase model overfitting to the training dataset and decrease overall cross-validated test accuracy [8]. In the present study however, we see the opposite effect on decoder performance. First, Figure 3 – figure supplements 1 & 2 clearly show that decoders constructed from hybrid-space features outperform the other input feature (sensor-, whole-brain parcel- and whole-brain voxel-) spaces in every case (e.g. – wideband, all narrowband frequency ranges, and even after the input space is fully orthogonalized through dimensionality reduction procedures prior to the decoding step). Furthermore, Figure 3 – figure supplement 6 shows that hybridspace decoder performance supers when parcel-time series that spatially overlap with the included regional voxel-spaces are removed from the input feature set.  We state in the Discussion (lines 353-356)

      “The observation of increased cross-validated test accuracy (as shown in Figure 3 – Figure Supplement 6) indicates that the spatially overlapping information in parcel- and voxel-space time-series in the hybrid decoder was complementary, rather than redundant [41].”

      To gain insight into the complimentary information contributed by the two spatial scales to the hybrid-space decoder, we first independently computed the matrix rank for whole-brain parcel- and voxel-space input features for each participant (shown in Author response image 1). The results indicate that whole-brain parcel-space input features are full rank (rank = 148) for all participants (i.e. - MEG activity is orthogonal between all parcels). The matrix rank of voxelspace input features (rank = 267± 17 SD), exceeded the parcel-space rank for all participants and approached the number of useable MEG sensor channels (n = 272). Thus, voxel-space features provide both additional and complimentary information to representations at the parcel-space scale.  

      Figure 2—figure Supplement 2 in the revised manuscript now shows that the degree of dependence between the two spatial scales varies over the regional voxel-space. That is, some voxels within a given parcel correlate strongly with the time-series of the parcel they belong to, while others do not. This finding is consistent with a documented increase in correlational structure of neural activity across spatial scales that does not reflect perfect dependency or orthogonality [9]. Notably, the regional voxel-spaces included in the hybridspace decoder are significantly less correlated with the averaged parcel-space time-series than excluded voxels. We now point readers to this new figure in the results.

      Taken together, these results indicate that the multi-scale information in the hybrid feature set is complimentary rather than orthogonal.  This is consistent with the idea that hybridspace features better represent multi-scale temporospatial dynamics reported to be a fundamental characteristic of how the brain stores and adapts memories, and generates behavior across species [9].

      Reviewer #2 (Recommendations for the authors):  

      I appreciate the authors' efforts in addressing the concerns I raised. The responses generally made sense to me. However, I had some trouble finding several corrections/additions that the authors claim they made in the revised manuscript: 

      "We addressed this question by conducting a new multivariate regression analysis to directly assess whether the neural representation distance score could be predicted by the 4-1, 2-4, and 4-4 keypress transition times observed for each complete correct sequence (both predictor and response variables were z-score normalized within-subject). The results of this analysis also affirmed that the possible alternative explanation that contextualization effects are simple reflections of increased mixing is not supported by the data (Adjusted R<sup>2</sup> = 0.00431; F = 5.62).  We now include this new negative control analysis in the revised manuscript."  

      This approach is now reported in the manuscript in the Results (Lines 324-328 and Figure 5-Figure Supplement 6 legend.

      "We strongly agree with the Reviewer that the issue of generalizability is extremely important and have added a new paragraph to the Discussion in the revised manuscript highlighting the strengths and weaknesses of our study with respect to this issue." 

      Discussion (Lines 436-441)

      “One limitation of this study is that contextualization was investigated for only one finger movement (index finger or digit 4) embedded within a relatively short 5-item skill sequence. Determining if representational contextualization is exhibited across multiple finger movements embedded within for example longer sequences (e.g. – two index finger and two little finger keypresses performed within a short piece of piano music) will be an important extension to the present results.”

      "We strongly agree with the Reviewer that any intended clinical application must carefully consider the specific input feature constraints dictated by the clinical cohort, and in turn impose appropriate and complimentary constraints on classifier parameters that may differ from the ones used in the present study. We now highlight this issue in the Discussion of the revised manuscript and relate our present findings to published clinical BCI work within this context."  

      Discussion (Lines 441-444)

      “While a supervised manifold learning approach (LDA) was used here because it optimized hybrid-space decoder performance, unsupervised strategies (e.g. - PCA and MDS, which also substantially improved decoding accuracy in the present study; Figure 3 – figure supplement 2) are likely more suitable for real-time BCI applications.”

      and 

      "The Reviewer makes a good point. We have now implemented the suggested normalization procedure in the analysis provided in the revised manuscript." 

      Results (lines 275-282)

      “We used a Euclidian distance measure to evaluate the differentiation of the neural representation manifold of the same action (i.e. - an index-finger keypress) executed within different local sequence contexts (i.e. - ordinal position 1 vs. ordinal position 5; Figure 5). To make these distance measures comparable across participants, a new set of classifiers was then trained with group-optimal parameters (i.e. – broadband hybrid-space MEG data with subsequent manifold extraction (Figure 3 – figure supplements 2) and LDA classifiers (Figure 3 – figure supplements 7) trained on 200ms duration windows aligned to the KeyDown event (see Methods, Figure 3 – figure supplements 5). “

      Where are they in the manuscript? Did I read the wrong version? It would be more helpful to specify with page/line numbers. Please also add the detailed procedure of the control/additional analyses in the Method. 

      As requested, we now refer to all manuscript revisions with specific line numbers. We have also included all detailed procedures related to any additional analyses requested by reviewers.

      I also have a few other comments back to the authors' following responses: 

      "Thus, increased overlap between the "4" and "1" keypresses (at the start of the sequence) and "2" and "4" keypresses (at the end of the sequence) could artefactually increase contextualization distances even if the underlying neural representations for the individual keypresses remain unchanged. One must also keep in mind that since participants repeat the sequence multiple times within the same trial, a majority of the index finger keypresses are performed adjacent to one another (i.e. - the "4-4" transition marking the end of one sequence and the beginning of the next). Thus, increased overlap between consecutive index finger keypresses as typing speed increased should increase their similarity and mask contextualization- related changes to the underlying neural representations."  "We also re-examined our previously reported classification results with respect to this issue. 

      We reasoned that if mixing effects reflecting the ordinal sequence structure is an important driver of the contextualization finding, these effects should be observable in the distribution of decoder misclassifications. For example, "4" keypresses would be more likely to be misclassified as "1" or "2" keypresses (or vice versa) than as "3" keypresses. The confusion matrices presented in Figures 3C and 4B and Figure 3-figure supplement 3A display a distribution of misclassifications that is inconsistent with an alternative mixing effect explanation of contextualization." 

      "Based upon the increased overlap between adjacent index finger keypresses (i.e. - "4-4" transition), we also reasoned that the decoder tasked with separating individual index finger keypresses into two distinct classes based upon sequence position, should show decreased performance as typing speed increases. However, Figure 4C in our manuscript shows that this is not the case. The 2-class hybrid classifier actually displays improved classification performance over early practice trials despite greater temporal overlap. Again, this is inconsistent with the idea that the contextualization effect simply reflects increased mixing of individual keypress features."  

      As the time window for MEG feature is defined after the onset of each press, it is more likely that the feature overlap is the current and the future presses, rather than the current and the past presses (of course the three will overlap at very fast typing speed). Therefore, for sequence 41324, if we note the planning-related processes by a Roman numeral, the overlapping features would be '4i', '1iii', '3ii', '2iv', and '4iv'. Assuming execution-related process (e.g., 1) and planning-related process (e.g., i) are not necessarily similar, especially in finer temporal resolution, the patterns for '4i' and '4iv' are well separated in terms of process 'i' and 'iv,' and this advantage will be larger in faster typing speed. This also applies to the other presses. Thus, the author's arguments about the masking of contextualization and misclassification due to pattern overlap seem odd. The most direct and probably easiest way to resolve this would be to use a shorter time window for the MEG feature. Some decrease in decoding accuracy in this case is totally acceptable for the science purpose.  

      The revised manuscript now includes analyses carried out with decoding time windows ranging from 50 to 250ms in duration. These additional results are now reported in:

      Results (lines 258-268):

      “The improved decoding accuracy is supported by greater differentiation in neural representations of the index finger keypresses performed at positions 1 and 5 of the sequence (Figure 4A), and by the trial-by-trial increase in 2-class decoding accuracy over early learning (Figure 4C) across different decoder window durations (Figure 4 – figure supplement 2). As expected, the 5-class hybrid-space decoder performance approached chance levels when tested with randomly shuffled keypress labels (18.41%± SD 7.4% for Day 1 data; Figure 4 – figure supplement 3C). Task-related eye movements did not explain these results since an alternate 5-class hybrid decoder constructed from three eye movement features (gaze position at the KeyDown event, gaze position 200ms later, and peak eye movement velocity within this window; Figure 4 – figure supplement 3A) performed at chance levels (crossvalidated test accuracy = 0.2181; Figure 4 – figure supplement 3B, C).”

      Results (lines 310-316):

      “Offline contextualization strongly correlated with cumulative micro-offline gains (r = 0.903, R² = 0.816, p < 0.001; Figure 5 – figure supplement 1A, inset) across decoder window durations ranging from 50 to 250ms (Figure 5 – figure supplement 1B, C). The offline contextualization between the final sequence of each trial and the second sequence of the subsequent trial (excluding the first sequence) yielded comparable results. This indicates that pre-planning at the start of each practice trial did not directly influence the offline contextualization measure [30] (Figure 5 – figure supplement 2A, 1st vs. 2nd Sequence approaches). “

      Discussion (lines 380-385):

      “The first hint of representational differentiation was the highest false-negative and lowest false-positive misclassification rates for index finger keypresses performed at different locations in the sequence compared with all other digits (Figure 3C). This was further supported by the progressive differentiation of neural representations of the index finger keypress (Figure 4A) and by the robust trial-by-trial increase in 2class decoding accuracy across time windows ranging between 50 and 250ms (Figure 4C; Figure 4 – figure supplement 2).”

      Discussion (lines 408-9):

      “Offline contextualization consistently correlated with early learning gains across a range of decoding windows (50–250ms; Figure 5 – figure supplement 1).”

      "We addressed this question by conducting a new multivariate regression analysis to directly assess whether the neural representation distance score could be predicted by the 4-1, 2-4 and 4-4 keypress transition times observed for each complete correct sequence" 

      For regression analysis, I recommend to use total keypress time per a sequence (or sum of 4-1 and 4-4) instead of specific transition intervals, because there likely exist specific correlational structure across the transition intervals. Using correlated regressors may distort the result.  

      This approach is now reported in the manuscript:

      Results (Lines 324-328) and Figure  5-Figure Supplement 6 legend.

      "We do agree with the Reviewer that the naturalistic, generative, self-paced task employed in the present study results in overlapping brain processes related to planning, execution, evaluation and memory of the action sequence. We also agree that there are several tradeoffs to consider in the construction of the classifiers depending on the study aim. Given our aim of optimizing keypress decoder accuracy in the present study, the set of tradeoffs resulted in representations reflecting more the latter three processes, and less so the planning component. Whether separate decoders can be constructed to tease apart the representations or networks supporting these overlapping processes is an important future direction of research in this area. For example, work presently underway in our lab constrains the selection of windowing parameters in a manner that allows individual classifiers to be temporally linked to specific planning, execution, evaluation or memoryrelated processes to discern which brain networks are involved and how they adaptively reorganize with learning. Results from the present study (Figure 4-figure supplement 2) showing hybrid-space decoder prediction accuracies exceeding 74% for temporal windows spanning as little as 25ms and located up to 100ms prior to the KeyDown event strongly support the feasibility of such an approach." 

      I recommend that the authors add this paragraph or a paragraph like this to the Discussion. This perspective is very important and still missing in the revised manuscript. 

      We now included in the manuscript the following sections addressing this point:

      Discussion (lines 334-338)

      “The main findings of this study during which subjects engaged in a naturalistic, self-paced task were that individual sequence action representations differentiate during early skill learning in a manner reflecting the local sequence context in which they were performed, and that the degree of representational differentiation— particularly prominent over rest intervals—correlated with skill gains. “

      Discussion (lines 428-434)

      “In this study, classifiers were trained on MEG activity recorded during or immediately after each keypress, emphasizing neural representations related to action execution, memory consolidation and recall over those related to planning. An important direction for future research is determining whether separate decoders can be developed to distinguish the representations or networks separately supporting these processes. Ongoing work in our lab is addressing this question. The present accuracy results across varied decoding window durations and alignment with each keypress action support the feasibility of this approach (Figure 3—figure supplement 5).”

      "The rapid initial skill gains that characterize early learning are followed by micro-scale fluctuations around skill plateau levels (i.e. following trial 11 in Figure 1B)"  Is this a mention of Figure 1 Supplement 1 A?  

      The sentence was replaced with the following: Results (lines 108-110)

      “Participants reached 95% of maximal skill (i.e. - Early Learning) within the initial 11 practice trials (Figure 1B), with improvements developing over inter-practice rest periods (micro-offline gains) accounting for almost all total learning across participants (Figure 1B, inset) [1].”

      The citation below seems to have been selected by mistake; 

      "9. Chen, S. & Epps, J. Using task-induced pupil diameter and blink rate to infer cognitive load. Hum Comput Interact 29, 390-413 (2014)." 

      We thank the Reviewer for bringing this mistake to our attention. This citation has now been corrected.

      Reviewer #3 (Recommendations for the authors):  

      The authors write in their response that "We now provide additional details in the Methods of the revised manuscript pertaining to the parcellation procedure and how the sign ambiguity problem was addressed in our analysis." I could not find anything along these lines in the (redlined) version of the manuscript and therefore did not change the corresponding comment in the public review.  

      The revised manuscript now provides a more detailed explanation of the parcellation, and sign-flipping procedure implemented:

      Methods (lines 604-611):

      “Source-space parcellation was carried out by averaging all voxel time-series located within distinct anatomical regions defined in the Desikan-Killiany Atlas [31]. Since source time-series estimated with beamforming approaches are inherently sign-ambiguous, a custom Matlab-based implementation of the mne.extract_label_time_course with “mean_flip” sign-flipping procedure in MNEPython [78] was applied prior to averaging to prevent within-parcel signal cancellation. All voxel time-series within each parcel were extracted and the timeseries sign was flipped at locations where the orientation difference was greater than 90° from the parcel mode. A mean time-series was then computed across all voxels within the parcel after sign-flipping.”

      The control analysis based on a multivariate regression that assessed whether the neural representation distance score could be predicted by the 4-1, 2-4 and 4-4 keypress transition times, as briefly mentioned in the authors' responses to Reviewer 2 and myself, was not included in the manuscript and could not be sufficiently evaluated. 

      This approach is now reported in the manuscript: Results (Lines 324-328) and Figure  5-Figure Supplement 6 legend.

      The authors argue that differences in the design between Das et al. (2024) on the one hand (Experiments 1 and 2), and the study by Bönstrup et al. (2019) on the other hand, may have prevented Das et al. (2024) from finding the assumed learning benefit by micro-offline consolidation. However, the Supplementary Material of Das et al. (2024) includes an experiment (Experiment S1) whose design closely follows a large proportion of the early learning phase of Bönstrup et al. (2019), and which, nevertheless, demonstrates that there is no lasting benefit of taking breaks with respect to the acquired skill level, despite the presence of micro-offline gains.  

      We thank the Reviewer for alerting us to this new data added to the revised supplementary materials of Das et al. (2024) posted to bioRxiv. However, despite the Reviewer’s claim to the contrary, a careful comparison between the Das et al and Bönstrup et al studies reveal more substantive differences than similarities and does not “closely follows a large proportion of the early learning phase of Bönstrup et al. (2019)” as stated. 

      In the Das et al. Experiment S1, sixty-two participants were randomly assigned to “with breaks” or “no breaks” skill training groups. The “with breaks” group alternated 10 seconds of skill sequence practice with 10 seconds of rest over seven trials (2 min and 2 sec total training duration). This amounts to 66.7% of the early learning period defined by Bönstrup et al. (2019) (i.e. - eleven 10-second long practice periods interleaved with ten 10-second long rest breaks; 3 min 30 sec total training duration). Also, please note that while no performance feedback nor reward was given in the Bönstrup et al. (2019) study, participants in the Das et al. study received explicit performance-based monetary rewards, a potentially crucial driver of differentiated behavior between the two studies:

      “Participants were incentivized with bonus money based on the total number of correct sequences completed throughout the experiment.”

      The “no breaks” group in the Das et al. study practiced the skill sequence for 70 continuous seconds. Both groups (despite one being labeled “no breaks”) follow training with a long 3-minute break (also note that since the “with breaks” group ends with 10 seconds of rest their break is actually longer), before finishing with a skill “test” over a continuous 50-second-long block. During the 70 seconds of training, the “with breaks” group shows more learning than the “no breaks” group. Interestingly, following the long 3minute break the “with breaks” group display a performance drop (relative to their performance at the end of training) that is stable over the full 50-second test, while the “no breaks” group shows an immediate performance improvement following the long break that continues to increase over the 50-second test.  

      Separately, there are important issues regarding the Das et al study that should be considered through the lens of recent findings not referred to in the preprint. A major element of their experimental design is that both groups—“with breaks” and “no breaks”— actually receive quite a long 3-minute break just before the skill test. This long break is more than 2.5x the cumulative interleaved rest experienced by the “with breaks” group. Thus, although the design is intended to contrast the presence or absence of rest “breaks”, that difference between groups is no longer maintained at the point of the skill test. 

      The Das et al results are most consistent with an alternative interpretation of the data— that the “no breaks” group experiences offline learning during their long 3-minute break. This is supported by the recent work of Griffin et al. (2025) where micro-array recordings from primary and premotor cortex were obtained from macaque monkeys while they performed blocks of ten continuous reaching sequences up to 81.4 seconds in duration (see source data for Extended Data Figure 1h) with 90 seconds of interleaved rest. Griffin et al. observed offline improvement in skill immediately following the rest break that was causally related to neural reactivations (i.e. – neural replay) that occurred during the rest break. Importantly, the highest density of reactivations was present in the very first 90second break between Blocks 1 and 2 (see Fig. 2f in Griffin et al., 2025). This supports the interpretation that both the “with breaks” and “no breaks” group express offline learning gains, with these gains being delayed in the “no breaks” group due to the practice schedule.

      On the other hand, if offline learning can occur during this longer break, then why would the “with breaks” group show no benefit? Again, it could be that most of the offline gains for this group were front-loaded during the seven shorter 10-second rest breaks. Another possible, though not mutually exclusive, explanation is that the observed drop in performance in the “with breaks” group is driven by contextual interference. Specifically, similar to Experiments 1 and 2 in Das et al. (2024), the skill test is conducted under very different conditions than those which the “with breaks” group practiced the skill under (short bursts of practiced alternating with equally short breaks). On the other hand, the “no breaks” group is tested (50 seconds of continuous practice) under quite similar conditions to their training schedule (70 seconds of continuous practice). Thus, it is possible that this dissimilarity between training and test could lead to reduced performance in the “with breaks” group.

      We made the following manuscript revisions related to these important issues: 

      Introduction (Lines 26-56)

      “Practicing a new motor skill elicits rapid performance improvements (early learning) [1] that precede skill performance plateaus [5]. Skill gains during early learning accumulate over rest periods (micro-offline) interspersed with practice [1, 6-10], and are up to four times larger than offline performance improvements reported following overnight sleep [1]. During this initial interval of prominent learning, retroactive interference immediately following each practice interval reduces learning rates relative to interference after passage of time, consistent with stabilization of the motor memory [11]. Micro-offline gains observed during early learning are reproducible [7, 10-13] and are similar in magnitude even when practice periods are reduced by half to 5 seconds in length, thereby confirming that they are not merely a result of recovery from performance fatigue [11]. Additionally, they are unaffected by the random termination of practice periods, which eliminates the possibility of predictive motor slowing as a contributing factor [11]. Collectively, these behavioral findings point towards the interpretation that microoffline gains during early learning represent a form of memory consolidation [1]. 

      This interpretation has been further supported by brain imaging and electrophysiological studies linking known memory-related networks and consolidation mechanisms to rapid offline performance improvements. In humans, the rate of hippocampo-neocortical neural replay predicts micro-offline gains [6]. Consistent with these findings, Chen et al. [12] and Sjøgård et al. [13] furnished direct evidence from intracranial human EEG studies, demonstrating a connection between the density of hippocampal sharp-wave ripples (80-120 Hz)—recognized markers of neural replay—and micro-offline gains during early learning. Further, Griffin et al. reported that neural replay of task-related ensembles in the motor cortex of macaques during brief rest periods— akin to those observed in humans [1, 6-8, 14]—are not merely correlated with, but are causal drivers of micro-offline learning [15]. Specifically, the same reach directions that were replayed the most during rest breaks showed the greatest reduction in path length (i.e. – more efficient movement path between two locations in the reach sequence) during subsequent trials, while stimulation applied during rest intervals preceding performance plateau reduced reactivation rates and virtually abolished micro-offline gains [15]. Thus, converging evidence in humans and non-human primates across indirect non-invasive and direct invasive recording techniques link hippocampal activity, neural replay dynamics and offline skill gains in early motor learning that precede performance plateau.”

      Next, in the Methods, we articulate important constraints formulated by Pan and Rickard (2015) and Bönstrup et al. (2019) for meaningful measurements:

      Methods (Lines 493-499)

      “The study design followed specific recommendations by Pan and Rickard (2015): 1) utilizing 10-second practice trials and 2) constraining analysis of micro-offline gains to early learning trials (where performance monotonically increases and 95% of overall performance gains occur) that precede the emergence of “scalloped” performance dynamics strongly linked to reactive inhibition effects ([29, 72]). This is precisely the portion of the learning curve Pan and Rickard referred to when they stated “…rapid learning during that period masks any reactive inhibition effect” [29].”

      We finally discuss the implications of neglecting some or all of these recommendations:

      Discussion (Lines 444-452):

      “Finally, caution should be exercised when extrapolating findings during early skill learning, a period of steep performance improvements, to findings reported after insufficient practice [67], post-plateau performance periods [68], or non-learning situations (e.g. performance of non-repeating keypress sequences in  [67]) when reactive inhibition or contextual interference effects are prominent. Ultimately, it will be important to develop new paradigms allowing one to independently estimate the different coincident or antagonistic features (e.g. - memory consolidation, planning, working memory and reactive inhibition) contributing to micro-online and micro-offline gains during and after early skill learning within a unifying framework.”

      Personally, given that the idea of (micro-offline) consolidation seems to attract a lot of interest (and therefore cause a lot of future effort/cost public money) in the scientific community, I would find it extremely important to be cautious in interpreting results in this field. For me, this would include abstaining from the claim that processes occur "during" a rest period (see abstract, for example), given that micro-offline gains (as well as offline contextualization) are computed from data obtained during practice, not rest, and may, thus, just as well reflect a change that occurs "online", e.g., at the very onset of practice (like pre-planning) or throughout practice (like fatigue, or reactive inhibition). In addition, I would suggest to discuss in more depth the actual evidence not only in favour, but also against, the assumption of micro-offline gains as a phenomenon of learning.  

      We agree with the reviewer that caution is warranted. Based upon these suggestions, we have now expanded the manuscript to very clearly define the experimental constraints under which different groups have successfully studied micro-offline learning and its mechanisms, the impact of fatigue/reactive inhibition on micro-offline performance changes unrelated to learning, as well as the interpretation problems that emerge when those recommendations are not followed. 

      We clearly articulate the crucial constrains recommended by Pan and Rickard (2015) and Bönstrup et al. (2019) for meaningful measurements and interpretation of offline gains in the revised manuscript. 

      Methods (Lines 493-499)

      “The study design followed specific recommendations by Pan and Rickard (2015): 1) utilizing 10-second practice trials and 2) constraining analysis of micro-offline gains to early learning trials (where performance monotonically increases and 95% of overall performance gains occur) that precede the emergence of “scalloped” performance dynamics strongly linked to reactive inhibition effects ( [29, 72]). This is precisely the portion of the learning curve Pan and Rickard referred to when they stated “…rapid learning during that period masks any reactive inhibition effect” [29].”

      In the Introduction, we review the extensive evidence emerging from LFP and microelectrode recordings in humans and monkeys (including causality of neural replay with respect to micro-offline gains and early learning in the Griffin et al. Nature 2025 publication):

      Introduction (Lines 26-56)

      “Practicing a new motor skill elicits rapid performance improvements (early learning) [1] that precede skill performance plateaus [5]. Skill gains during early learning accumulate over rest periods (micro-offline) interspersed with practice [1, 6-10], and are up to four times larger than offline performance improvements reported following overnight sleep [1]. During this initial interval of prominent learning, retroactive interference immediately following each practice interval reduces learning rates relative to interference after passage of time, consistent with stabilization of the motor memory [11]. Micro-offline gains observed during early learning are reproducible [7, 10-13] and are similar in magnitude even when practice periods are reduced by half to 5 seconds in length, thereby confirming that they are not merely a result of recovery from performance fatigue [11]. Additionally, they are unaffected by the random termination of practice periods, which eliminates the possibility of predictive motor slowing as a contributing factor [11]. Collectively, these behavioral findings point towards the interpretation that microoffline gains during early learning represent a form of memory consolidation [1]. 

      This interpretation has been further supported by brain imaging and electrophysiological studies linking known memory-related networks and consolidation mechanisms to rapid offline performance improvements. In humans, the rate of hippocampo-neocortical neural replay predicts micro-offline gains [6]. Consistent with these findings, Chen et al. [12] and Sjøgård et al. [13] furnished direct evidence from intracranial human EEG studies, demonstrating a connection between the density of hippocampal sharp-wave ripples (80-120 Hz)—recognized markers of neural replay—and micro-offline gains during early learning. Further, Griffin et al. reported that neural replay of task-related ensembles in the motor cortex of macaques during brief rest periods— akin to those observed in humans [1, 6-8, 14]—are not merely correlated with, but are causal drivers of micro-offline learning [15]. Specifically, the same reach directions that were replayed the most during rest breaks showed the greatest reduction in path length (i.e. – more efficient movement path between two locations in the reach sequence) during subsequent trials, while stimulation applied during rest intervals preceding performance plateau reduced reactivation rates and virtually abolished micro-offline gains [15]. Thus, converging evidence in humans and non-human primates across indirect non-invasive and direct invasive recording techniques link hippocampal activity, neural replay dynamics and offline skill gains in early motor learning that precede performance plateau.”

      Following the reviewer’s advice, we have expanded our discussion in the revised manuscript of alternative hypotheses put forward in the literature and call for caution when extrapolating results across studies with fundamental differences in design (e.g. – different practice and rest durations, or presence/absence of extrinsic reward, etc). 

      Discussion (Lines 444-452):

      “Finally, caution should be exercised when extrapolating findings during early skill learning, a period of steep performance improvements, to findings reported after insufficient practice [67], post-plateau performance periods [68], or non-learning situations (e.g. performance of non-repeating keypress sequences in  [67]) when reactive inhibition or contextual interference effects are prominent. Ultimately, it will be important to develop new paradigms allowing one to independently estimate the different coincident or antagonistic features (e.g. - memory consolidation, planning, working memory and reactive inhibition) contributing to micro-online and micro-offline gains during and after early skill learning within a unifying framework.”

      References

      (1) Zimerman, M., et al., Disrupting the Ipsilateral Motor Cortex Interferes with Training of a Complex Motor Task in Older Adults. Cereb Cortex, 2012.

      (2) Waters, S., T. Wiestler, and J. Diedrichsen, Cooperation Not Competition: Bihemispheric tDCS and fMRI Show Role for Ipsilateral Hemisphere in Motor Learning. J Neurosci, 2017. 37(31): p. 7500-7512.

      (3) Sawamura, D., et al., Acquisition of chopstick-operation skills with the nondominant hand and concomitant changes in brain activity. Sci Rep, 2019. 9(1): p. 20397.

      (4) Lee, S.H., S.H. Jin, and J. An, The dieerence in cortical activation pattern for complex motor skills: A functional near- infrared spectroscopy study. Sci Rep, 2019. 9(1): p. 14066.

      (5) Grafton, S.T., E. Hazeltine, and R.B. Ivry, Motor sequence learning with the nondominant left hand. A PET functional imaging study. Exp Brain Res, 2002. 146(3): p. 369-78.

      (6) Buch, E.R., et al., Consolidation of human skill linked to waking hippocamponeocortical replay. Cell Rep, 2021. 35(10): p. 109193.

      (7) Wang, L. and S. Jiang, A feature selection method via analysis of relevance, redundancy, and interaction, in Expert Systems with Applications, Elsevier, Editor. 2021.

      (8) Yu, L. and H. Liu, Eeicient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research, 2004. 5: p. 1205-1224.

      (9) Munn, B.R., et al., Multiscale organization of neuronal activity unifies scaledependent theories of brain function. Cell, 2024.

      (10) Borragan, G., et al., Sleep and memory consolidation: motor performance and proactive interference eeects in sequence learning. Brain Cogn, 2015. 95: p. 54-61.

      (11) Landry, S., C. Anderson, and R. Conduit, The eeects of sleep, wake activity and timeon-task on oeline motor sequence learning. Neurobiol Learn Mem, 2016. 127: p. 5663.

      (12) Gabitov, E., et al., Susceptibility of consolidated procedural memory to interference is independent of its active task-based retrieval. PLoS One, 2019. 14(1): p. e0210876.

      (13) Pan, S.C. and T.C. Rickard, Sleep and motor learning: Is there room for consolidation? Psychol Bull, 2015. 141(4): p. 812-34.

      (14) , M., et al., A Rapid Form of Oeline Consolidation in Skill Learning. Curr Biol, 2019. 29(8): p. 1346-1351 e4.

      (15) Gupta, M.W. and T.C. Rickard, Comparison of online, oeline, and hybrid hypotheses of motor sequence learning using a quantitative model that incorporate reactive inhibition. Sci Rep, 2024. 14(1): p. 4661.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public Review):

      Summary:

      This paper explores how diverse forms of inhibition impact firing rates in models for cortical circuits. In particular, the paper studies how the network operating point affects the balance of direct inhibition from SOM inhibitory neurons to pyramidal cells, and disinhibition from SOM inhibitory input to PV inhibitory neurons. This is an important issue as these two inhibitory pathways have largely been studies in isolation. Support for the main conclusions is generally solid, but could be strengthened by additional analyses.

      Strengths

      The paper has improved in revision, and the new intuitive summary statements added to the end of each results section are quite helpful. Weaknesses

      The concern about whether the results hold outside of the range in which neural responses are linear remains. This is particularly true given the discontinuity observed in the stability measure. I appreciate the concern (provided in the response to the first round of reviews) that studying nonlinear networks requires a lot of work. A more limited undertaking would be to test the behavior of a spiking network at a few key points identified by your linearization approach. Such tests could use relatively simple (and perhaps imperfect) measures of gain and stability. This could substantially enhance the paper, regardless of the outcome.

      We appreciate the reviewer’s concern and in our resubmission we explore if networks dynamics that operate outside of the case where linearization is possible would continue to show our main result on the (dis)entanglement of stability and gain; the short answer is yes. To this end we have added a new section and Figure to our main text.

      “Gain and stability in stochastically forced E – PV – SOM circuits

      To confirm that our results do not depend on our approach of a linearization around a fixed point, we numerically simulate similar networks as shown above (Figure 2) in which the E and PV population receive slow varying, large amplitude noise (Figure 6A). This leads to noisy rate dynamics sampling a large subspace of the full firing rate grid (r<sub>E</sub>,r<sub>P</sub>) and thus any linearization would fail to describe the network response. In this stochastically forced network we explore how adding an SOM modulation or a stimulus affects this subspace (Figure 6B). To quantify stability without linearization, we assume that a network is more stable the lower the mean and variance of E rates. This is because very stable networks can better quench input fluctuations [Kanashiro et al., 2017; Hennequin et al., 2018]. To quantify gain, we calculate the change in E rates when adding the stimulus, yet having identical noise realizations for stimulated and non-stimulated networks (Methods).

      For the disinhibitory network without feedback a positive SOM modulation decreases stability due to increases of the mean and variance of E rates (Figure 6Ci) while the network gain increases (Figure 6Cii). As seen before (Figure 2A,B), stability and gain change in opposite directions in a disinhibitory circuit without feedback. Adding feedback PV → SOM and applying a negative SOM modulation increases both, stability and gain and therefore disentangles the inverse relation also in a noisy circuit (Figure 6D-F). This gives numerical support that our results do not depend on the assumption of linearization.

      “Methods: Noisy input and numerical measurement of stability and gain

      We consider a temporally smoothed input process ξ<sub>X</sub> with white noise ζ (zero mean, standard deviation one): for populations X ∈{E,P} with timescale τ<sub>ξ</sub> = 50ms, σ<sub>X</sub> \= 6 and fixed mean input IX. To quantify the stability of the network without linearization, we assume that a network is more stable if the mean and variance of excitatory rates are low. To quantify network gain, we freeze the white noise process ζ for the case of with and without stimulus presentation and calculate the difference of E rates at each time point, leading to a distribution of network gains (Figure 6Cii,Fii). Total simulation time is 1000 seconds.”

      We decided against using a spiking network because sufficiently asynchronous spiking network dynamics can still obey a linearized mean field theory (if the fluctuations in population firing rates are small). In our new analysis the firing rate deviations from the time averaged firing rate are sizable, making a linearization ineffective.

      In summary, based on our additional analysis of recurrent circuits with noisy inputs we conclude that our results also hold in fluctuating networks, without the need of assuming realization aroud a stable fixed point.

      Reviewer #2 (Public Review):

      Summary:

      Bos and colleagues address the important question of how two major inhibitory interneuron classes in the neocortex differentially affect cortical dynamics. They address this question by studying Wilson-Cowan-type mathematical models. Using a linearized fixed point approach, they provide convincing evidence that the existence of multiple interneuron classes can explain the counterintuitive finding that inhibitory modulation can increase the gain of the excitatory cell population while also increasing the stability of the circuit’s state to minor perturbations. This effect depends on the connection strengths within their circuit model, providing valuable guidance as to when and why it arises.

      Overall, I find this study to have substantial merit. I have some suggestions on how to improve the clarity and completeness of the paper.

      Strengths:

      (1) The thorough investigation of how changes in the connectivity structure affect the gain-stability relationship is a major strength of this work. It provides an opportunity to understand when and why gain and stability will or will not both increase together. It also provides a nice bridge to the experimental literature, where different gain-stability relationships are reported from different studies.

      (2) The simplified and abstracted mathematical model has the benefit of facilitating our understanding of this puzzling phenomenon. (I have some suggestions for how the authors could push this understanding further.) It is not easy to find the right balance between biologically-detailed models vs simple but mathematically tractable ones, and I think the authors struck an excellent balance in this study.

      We thank the reviewer for their support of our work.

      Weaknesses:

      (1) The fixed-point analysis has potentially substantial limitations for understanding cortical computations away from the steady-state. I think the authors should have emphasized this limitation more strongly and possibly included some additional analyses to show that their conclusions extend to the chaotic dynamical regimes in which cortical circuits often live.

      In the response to reviewer 1 we have included model analyses that addresses the limitations of linearization. Rather than use a chaotic model, which would require significant effort, we opted for a stochastically forced network, where the sizable fluctuations in rate dynamics preclude linearization.

      (2) The authors could have discussed – even somewhat speculatively – how VIP interneurons fit into this picture. Their absence from this modelling framework stands out as a missed opportunity.

      We agree that including VIP neurons into the framework would be an obvious and potentially interesting next step. At this point we only include them as potential modulators of SOM neurons. Modeling their dynamics without them receiving inputs from E, PV, or SOM neurons would be uninteresting. However, including them properly into the circuit would be outside the scope of the paper.

      (3) The analysis is limited to paths within this simple E, PV, SOM circuit. This misses more extended paths (like thalamocortical loops) that involve interactions between multiple brain areas. Including those paths in the expansion in Eqs. 11-14 (Fig. 1C) may be an important consideration.

      We agree that our pathway expansion can be used to study more than just the E – PV – SOM circuit. However, properly investigating full thalamocortcial loops should be done in a subsequent study.

      Comments on revisions:

      I think the authors have done a reasonable job of responding to my critiques, and the paper is in pretty good shape. (Also, thanks for correctly inferring that I meant VIP interneurons when I had written SST in my review! I have updated the public review accordingly.)

      I still think this line of research would benefit substantially from considering dynamic regimes including chaotic ones. I strongly encourage the authors to consider such an extension in future work.

      Please see our response above to Reviewer 1.

      Reviewer #3 (Public Review):

      Summary:

      Bos et al study a computational model of cortical circuits with excitatory (E) and two subtypes of inhibition parvalbumin (PV) and somatostatin (SOM) expressing interneurons. They perform stability and gain analysis of simplified models with nonlinear transfer functions when SOM neurons are perturbed. Their analysis suggests that in a specific setup of connectivity, instability and gain can be untangled, such that SOM modulation leads to both increases in stability and gain, in contrast to the typical direction in neuronal networks where increased gain results in decreased stability.

      Strengths:

      - Analysis of the canonical circuit in response to SOM perturbations. Through numerical simulations and mathematical analysis, the authors have provided a rather comprehensive picture of how SOM modulation may affect response changes.

      - Shedding light on two opposing circuit motifs involved in the canonical E-PV-SOM circuitry - namely, direct inhibition (SOM -¿ E) vs disinhibition (SOM -¿ PV -¿ E). These two pathways can lead to opposing effects, and it is often difficult to predict which one results from modulating SOM neurons. In simplified circuits, the authors show how these two motifs can emerge and depend on parameters like connection weights.

      - Suggesting potentially interesting consequences for cortical computation. The authors suggest that certain regimes of connectivity may lead to untangling of stability and gain, such that increases in network gain are not compromised by decreasing stability. They also link SOM modulation in different connectivity regimes to versatile computations in visual processing in simple models.

      We thank the reviewer for their support of our work.

      Weaknesses

      Computationally, the analysis is solid, but it’s very similar to previous studies (del Molino et al, 2017). Many studies in the past few years have done the perturbation analysis of a similar circuitry with or without nonlinear transfer functions (some of them listed in the references). This study applies the same framework to SOM perturbations, which is a useful computational analysis, in view of the complexity of the high-dimensional parameter space.

      Link to biology: the most interesting result of the paper with regard to biology is the suggestion of a regime in which gain and stability can be modulated in an unconventional way - however, it is difficult to link the results to biological networks:

      - A general weakness of the paper is a lack of direct comparison to biological parameters or experiments. How different experiments can be reconciled by the results obtained here, and what new circuit mechanisms can be revealed? In its current form, the paper reads as a general suggestion that different combinations of gain modulation and stability can be achieved in a circuit model equipped with many parameters (12 parameters). This is potentially interesting but not surprising, given the high dimensional space of possible dynamical properties. A more interesting result would have been to relate this to biology, by providing reasoning why it might be relevant to certain circuits (and not others), or to provide some predictions or postdictions, which are currently missing in the manuscript.

      - For instance, a nice motivation for the paper at the beginning of the Results section is the different results of SOM modulation in different experiments - especially between L23 (inhibition) and L4 (disinhibition). But no further explanation is provided for why such a difference should exist, in view of their results and the insights obtained from their suggested circuit mechanisms. How the parameters identified for the two regimes correspond to different properties of different layers?

      Please see our answer to the previous round of revision.

      - One of the key assumptions of the model is nonlinear transfer functions for all neuron types. In terms of modelling and computational analysis, a thorough analysis of how and when this is necessary is missing (an analysis similar to what has been attempted in Figure 6 for synaptic weights, but for cellular gains). A discussion of this, along with the former analysis to know which nonlinearities would be necessary for the results, is needed, but currently missing from the study. The nonlinearity is assumed for all subtypes because it seems to be needed to obtain the results, but it’s not clear how the model would behave in the presence or absence of them, and whether they are relevant to biological networks with inhibitory transfer functions.

      Please see our answer to the previous round of revision.

      - Tuning curves are simulated for an individual orientation (same for all), not considering the heterogeneity of neuronal networks with multiple orientation selectivity (and other visual features) - making the model too simplistic.

      Please see our answer to the previous round of revision.

      Reviewer #1 (Recommendations For The Authors):

      Introduction, first paragraph, last sentence: suggest ”sense,” -¿ ”sense” (no comma)

      Introduction, second paragraph, first sentence: suggest ”is been” -¿ ”has been”

      Introduction, very end of next to last paragraph: clarify ”modulate the circuit”

      Figure 1 legend: can you make the ”Change ...” in the legend for 1D clearer - e.g. ”strenghen SOM → E connections and eliminate SOM → P connections”.

      Paragraph immediately below Figure 1: In sentence starting ”Specifically ...” can you relate the cases described here back to the equation in Figure 1C?

      Sentence right below equation 2: This sentence does not separate the network gain from the cellular gain as clearly as it could.

      Page 7, second full paragraph: sentence starting ”Therefore, with ...” could be split into two or otherwise made clearer.

      Sentence starting ”Furthermore” right below Figure 5 has an extra comma

      We thank the reviewer for their additional comments, we made the respective changes in the manuscript.

      Reviewer #3 (Recommendations For The Authors):

      There is a long part in the reply letter discussing the link to biology - but the revised manuscript doesn’t seem to reflect that.

      The information in the reply letter discussing the link to biology has been added at multiple points in the discussion. In the section ‘decision of labor between PV and SOM neurons’ we mention Ferguson and Carding 2020, in the section ‘impact of SOM neuron modulation on tuning curves’ we discuss Phillups and Hasenstaub 2016, and in the section ‘limitations and future directions’ we mention Tobin et al., 2023.

      The writing can be improved - for example, see below instances:

      P. 7: Intuitively, the inverse relationship follows for inhibitory and disinhibitory pathways (and their mixture) because the firing rate grid (heatmap) does not depend on how the SOM neurons inhibit the E - PV circuit.

      P.8: We first remark that by adding feedback E connections onto SOM neurons, changes in SOM rates can now affect the underlying heatmaps in the (rE, rP) grid.

      Not clear how ”rates can affect the heatmaps”. It’s too colloquial and not scientifically rigorous or sound.

      We added further explanations at the respective places in the manuscript to improve the writing.

    1. Author Response

      The following is the authors’ response to the previous reviews.

      We appreciate the constructive comments made by the editor and the reviewers. We have corrected errors and provided additional experimental data and analysis to address the latest criticisms raised by the reviewers and provided point-by-point response to the reviewers as below.

      Reviewer #1 (Recommendations For The Authors):

      I do acknowledge the work the authors put into this manuscript and I can accept the fact that the authors decided on a minimum of additional experiments. However, I would recommend the authors to be more concise by adding more information in the method and result sections about how they performed their experiments such as which Nav and AMPAR DNA constructs they used, the age of the mice, how long time they exposed the patches to quinidine, information on how many times they repeated their pull downs etc.

      Answer: We thank the reviewer’s comments. we have incorporated the suggested modifications into our revised manuscript. Specifically, we have included detailed information on the NaV and AMPAR constructs in the Methods section. The age of the homozygous NaV1.6 knockout mice and the wild-type littermate controls is postnatal (P0-P1) (see in Results and Methods section). Prior to the application of step pulses, cells were subjected to the bath solution containing quinidine for approximately one minute (see in Methods section). Additionally, the co-immunoprecipitation assays for Slack and NaV1.6 were repeated three times (see in Methods section).

      Minor detail in line 263: "...KCNT1 (Slack) have been identified to related to seizure..." I guess this should have been "...KCNT1 (Slack) have been identified and related to seizure..."?

      Answer: We thank the reviewer for raising this point. We have corrected it in the revised manuscript.

      Also, and again minor detail, I had a comment about the color coding in Fig 4 and by mistake, I added 4B, but I meant the use of colors in the entire figure, and mainly the use of colors in 4C, G and I.

      Answer: We apologize for the confusion. We have changed the color coding of Figure 4 in the revised manuscript.

      Reviewer #2 (Recommendations For The Authors):

      While the paper is improved, several concerns do not seem to have been addressed. Some may have been missed because there is no response at all, but others may have been unclear because the response does not address the concern, but a related issue. Details are below.

      Answer: We thank the reviewer for the criticisms. We have made changes of our manuscript to address the concerns.

      Original issue:

      3) Remove the term in vivo.

      Answer: We thank the reviewer for raising this point. In our experiments, although we did not conduct experiments directly in living organisms, our results demonstrated the coimmunoprecipitation of NaV1.6 with Slack in homogenates from mouse cortical and hippocampal tissues (Fig. 3C). This result may support that the interaction between Slack and NaV1.6 occurs in vivo.

      New comment from reviewer:

      The argument to use the term in vivo is not well supported by what the authors have said. Just because tissues are used from an animal does not mean experiments were conducted in vivo. As the authors say, they did not conduct experiments in living organisms. Therefore the term in vivo should be avoided. This is a minor point.

      Answer: We thank the reviewer for pointing this out. We have removed the term “in vivo” in the revised manuscript.

      Original:

      4) Figure 1C Why does Nav1.2 have a small inward current before the large inward current in the inset?

      Answer: We apologize for the confusion. We would like to clarify that the small inward current can be attributed to the current of membrane capacitance (slow capacitance or C-slow). The larger inward current is mediated by NaV1.2.

      New comment:

      This is not well argued. Please note why the authors know the current is due to capacitance. Also, how do they know the larger current is due to NaV1.2? Please add that to the paper so readers know too.

      Answer: We thank the reviewer’s comment. To provide a clearer representation of NaV1.2mediated currents in Fig. 1C, we have replaced the original example trace with a new one in which only one inward current is observed.

      Original:

      The slope of the rising phase of the larger sodium current seems greater than Nav1.6 or Nav1.5. Was this examined?

      Answer: Additionally, we did not compare the slope of the rising phase of NaV subtypes sodium currents but primarily focused on the current amplitudes.

      New comment:

      This is not a strong answer. There seems to be an effect that the authors do not mention and evidently did not quantify that argues against their conclusion, which weakens the presentation.

      Answer: We thank the reviewer’s comment. To assess the slope of the rising phase of NaV subtype currents, we compared the activation time constants of NaV1.2, NaV1.5, and NaV1.6 peak currents in HEK293 cells co-expressing NaV channel subtypes with Slack. The results have shown no significant differences (Author response image 1). We have included this analysis (see Fig. S9A) and the corresponding fitting equation (see in Methods section) in the revised manuscript.

      Author response image 1.

      The activation time constants of peak sodium currents in HEK293 cells co-expressing NaV1.2 (n=6), NaV1.5 (n=5), and NaV1.6 (n=5) with Slack, respectively. ns, p > 0.05, one-way ANOVA followed by Bonferroni’s post hoc test.

      Original:

      2D-E For Nav1.5 the sodium current is very large compared to Nav1.6. Is it possible the greater effect of quinidine for Nav1.6 is due to the lesser sodium current of Nav1.6?

      Answer: We thank the reviewer for raising this point. We would like to clarify that our results indicate that transient sodium currents contribute to the sensitization of Slack to quinidine blockade (Fig. 2C,E). Therefore, it is unlikely that the greater effect observed for NaV1.6 in sensitizing Slack is due to its lower sodium currents.

      New comment:

      I am not sure the question I was asking was clear. How can the authors discount the possibility that quinidine is more effective on NaV1.6 because the NaV1.6 current is relatively weak?

      Answer: We thank the reviewer for raising this point. We have examined the sodium current amplitudes of NaV1.5, NaV1.5/1.6 chimeras, and NaV1.6 upon co-expression of NaV with Slack. Our analysis revealed that there are no significant differences between NaV1.5 and NaV1.5/6N, with both exhibiting much larger current amplitudes compared to NaV1.6 (Author response image 2), but only NaV1.5/6N replicates the effect of NaV1.6 in sensitizing Slack to quinidine blockade (Fig. 4H-I), suggesting the observed differences between NaV1.5 and NaV1.6 in sensitizing Slack are unlikely to be attributed to NaV1.6's lower sodium currents but may instead involve NaV1.6's Nterminus-induced physical interaction. We have included this analysis in the revised manuscript (see Fig. S9B).

      Author response image 2.

      Comparison of peak sodium current amplitudes of NaV1.5 (n=9), NaV1.5/6NC (n=13), NaV1.5/6N (n=10), and NaV1.6 (n=8) upon co-expressed with Slack in HEK293 cells. ns, p > 0.05, * p < 0.05, ** p < 0.01, *** p < 0.001, **** p < 0.0001; one-way ANOVA followed by Bonferroni’s post hoc test.

      Original:

      The differences between WT and KO in G -H are hard to appreciate. Could quantification be shown? The text uses words like "block" but this is not clear from the figure. It seems that the replacement of Na+ with Li+ did not block the outward current or effect of quinidine.

      Answer: We apologize for the confusion. We would like to clarify the methods used in this experiment. The lithium ion (Li+) is a much weaker activator of sodium-activated potassium channel Slack than sodium ion (Na+)1,2.

      1. Zhang Z, Rosenhouse-Dantsker A, Tang QY, Noskov S, Logothetis DE. The RCK2 domain uses a coordination site present in Kir channels to confer sodium sensitivity to Slo2.2 channels. J Neurosci. Jun 2 2010;30(22):7554-62. doi:10.1523/JNEUROSCI.0525-10.2010

      2. Kaczmarek LK. Slack, Slick and Sodium-Activated Potassium Channels. ISRN Neurosci. Apr 18 2013;2013(2013)doi:10.1155/2013/354262 Therefore, we replaced Na+ with Li+ in the bath solution to measure the current amplitudes of sodium-activated potassium currents (IKNa)3.

      3. Budelli G, Hage TA, Wei A, et al. Na+-activated K+ channels express a large delayed outward current in neurons during normal physiology. Nat Neurosci. Jun 2009;12(6):745-50. doi:10.1038/nn.2313

      The following equation was used for quantification:

      Furthermore, the remaining IKNa after application of 3 μM quinidine in the bath solution was measured as the following:

      The quantification results were presented in Fig. 1K. The term "block" used in the text referred to the inhibitory effect of quinidine on IKNa.

      New comment:

      The fact remains that the term "block" is too strong for an effect that is incomplete. Also, the authors should add to the paper that Li+ is a weaker activator, so the reader knows some of the caveats to the approach.

      Answer: We thank the reviewer for raising this point. We have added related citations and replaced the term “block” with “inhibit” in the revised manuscript.

      Original:

      1. In K, for the WT, why is the effect of quinidine only striking for the largest currents?

      Answer: We thank the reviewer for raising this point. After conducting an analysis, we found no correlation between the inhibitory effect of quinidine and the amplitudes of baseline IKNa in WT neurons (p = 0.6294) (Author response image 3). Therefore, the effect of quinidine is not solely limited to targeting the larger currents.

      Author response image 3.

      The correlation between the inhibitory effect of quinidine and the amplitudes of baseline IKNa in WT neurons (data from manuscript Fig. 1K). r = 0.1555, p=0.6294, Pearson correlation analysis.

      New comment:

      Please add this to the paper and the figure as Supplemental.

      Answer: We thank the reviewer for raising this point. We have added this figure as Fig.S3B in the revised manuscript.

      Original:

      5) Figure 2 A. The argument could be better made if the same concentration of quinidine were used for Slack and Slack + Nav1.6. It is recognized a greater sensitivity to quinidine is to be shown but as presented the figure is a bit confusing."

      Answer: We apologize for the confusion. We would like to clarify that the presented concentrations of quinidine were chosen to be near the IC50 values for Slack and Slack+NaV1.6.

      New comment:

      Please add this to the paper.

      Answer: We thank the reviewer for raising this point. We have added the clarification about the presented concentrations in the revised manuscript.

      Original:

      2C. Can the authors add the effect of quinidine to the condition where the prepulse potential was 90?"

      Answer: We apologize for the confusion. We would like to clarify that the condition of prepulse potential at -90 mV is the same as the condition in Fig. 1. We only changed one experiment condition where the prepulse potential was changed to -40 mV from -90 mV.

      New comment:

      There was no confusion. The authors should consider adding the condition where the prepulse potential was -90.

      Answer: We thank the reviewer for raising this point. We have added the clarification about the voltage condition in the revised manuscript (see in Fig. 2A caption).

      Original:

      2A. Clarify these 6 panels."

      Answer: We thank the reviewer for raising this point. We have clarified the captions of Fig. 3A in the revised manuscript.

      New comment: Clarification is needed. What is the blue? DAPI? What area of hippocamps? Please label cell layers. What area of cortex? Please label layers.

      Answer: We thank the reviewer for raising this point. We have included the clarification in the Figure caption.

      Original:

      Figure 7. The images need more clarity. They are very hard to see. Text is also hard to see."

      Answer: We apologize for the lack of clarity in the images and text. we would like to provide a concise summary of the key findings shown in this figure.

      Figure 7 illustrates an innovative intervention for treating SlackG269S-induced seizures in mice by disrupting the Slack-NaV1.6 interaction. Our results showed that blocking NaV1.6-mediated sodium influx significantly reduced Slack current amplitudes (Fig. 2D,G), suggesting that the Slack-NaV1.6 interaction contributes to the current amplitudes of epilepsy-related Slack mutant variants, aggravating the gain-of-function phenotype. Additionally, Slack’s C-terminus is involved in the Slack-NaV1.6 interaction (Fig. 5D). We assumed that overexpressing Slack’s C-terminus can disrupt the Slack-NaV1.6 interaction (compete with Slack) and thereby encounter the current amplitudes of epilepsy-related Slack mutant variants.

      In HEK293 cells, overexpression of Slack’s C-terminus indeed significantly reduced the current amplitudes of epilepsy-related SlackG288S and SlackR398Q upon co-expression with NaV1.5/6NC (Fig. 7A,B). Subsequently, we evaluated this intervention in an in vivo epilepsy model by introducing the Slack G269S variant into C57BL/6N mice using AAV injection, mimicking the human Slack mutation G288S that we previously identified (Fig. 7C-G).

      New comment:

      The images do not appear to have changed. Consider moving labels above the images so they can be distinguished better. Please label cell layers. Consider adding arrows to the point in the figure the authors want the reader to notice. The study design and timeline are unclear. What is (1) + (3), (2), etc.?

      Answer: We thank the reviewer for pointing this out. We have modified Figure 7 in the revised manuscript and included the cell layer information in the Figure caption.

      Original:

      It is not clear how data were obtained because injection of kainic acid does not lead to a convulsive seizure every 10 min for several hours, which is what appears to be shown. Individual seizures are just at the beginning and then they merge at the start of status epilepticus. After the onset of status epilepticus the animals twitch, have varied movements, sometime rear and fall, but there is not a return to normal behavior. Therefore one can not call them individual seizures. In some strains of mice, however, individual convulsive seizures do occur (even if the EEG shows status epilepticus is occurring) but there are rarely more than 5 over several hours and the graph has many more. Please explain."

      Answer: We apologize for the confusion. Regarding the data acquisition in relation to kainic acid injection, we initiated the timing following intraperitoneal injection of kainic acid and recorded the seizure scores of per mouse at ten-minute intervals, following the methodology described in previous studies4.

      1. Huang Z, Walker MC, Shah MM. Loss of dendritic HCN1 subunits enhances cortical excitability and epileptogenesis. J Neurosci. Sep 2 2009;29(35):10979-88. doi:10.1523/JNEUROSCI.1531-09.2009

      The seizure scores were determined using a modified Racine, Pinal, and Rovner scale5,6: (1) Facial movements; (2) head nodding; (3) forelimb clonus; (4) dorsal extension (rearing); (5) Loss of balance and falling; (6) Repeated rearing and failing; (7) Violent jumping and running; (8) Stage 7 with periods of tonus; (9) Dead.

      1. Pinel JP, Rovner LI. Electrode placement and kindling-induced experimental epilepsy. Exp Neurol. Jan 15 1978;58(2):335-46. doi:10.1016/0014-4886(78)90145-0

      2. Racine RJ. Modification of seizure activity by electrical stimulation. II. Motor seizure. Electroencephalogr Clin Neurophysiol. Mar 1972;32(3):281-94. doi:10.1016/00134694(72)90177-0

      New comment:

      This was clear. Perhaps my question was not clear. The question is how one can count individual seizures if animals have continuous seizures. It seems like the authors did not consider or observe status epilepticus but individual seizures. If that is true the data are hard to believe because too many seizures were counted. Animals do not have nearly this many seizures after kainic acid.

      Answer: We appreciate the reviewer’s clarification. Our methodology involved assessing the maximum seizure scale during 10-minute intervals per mouse as previously described7, rather than counting individual seizures. For instance, a mouse exhibited the loss of balance and falling multiple times within 30-40 minute interval, we recorded the seizure scale as 5 for that time interval.

      1. Kim EC, Zhang J, Tang AY, et al. Spontaneous seizure and memory loss in mice expressing an epileptic encephalopathy variant in the calmodulin-binding domain of Kv7.2. Proc Natl Acad Sci U S A. Dec 21 2021;118(51)doi:10.1073/pnas.2021265118

      Reviewer #3 (Recommendations For The Authors):

      While the authors have improved the manuscript, several outstanding issues still need to be addressed. Some may have been missed because there is no response at all, but others may have been unclear.

      Answer: We thank the reviewer for the criticisms. We have added additional experimental data and analysis to address the concerns.

      Original issue from Public Review:

      1. Immunolabeling of the hippocampus CA1 suggests sodium channels as well as Slack colocalization with AnkG (Fig 3A). Proximity ligation assay for NaV1.6 and Slack or a super-resolution microscopy approach would be needed to increase confidence in the presented colocalization results. Furthermore, coimmunoprecipitation studies on the membrane fraction would bolster the functional relevance of NaV1.6-Slack interaction on the cell surface.

      Answer: We thank the reviewer for good suggestions. We acknowledge that employing proximity ligation assay and high-resolution techniques would significantly enhance our understanding of the localization of the Slack-NaV1.6 coupling.

      At present, the technical capabilities available in our laboratory and institution do not support highresolution testing. However, we are enthusiastic about exploring potential collaborations to address these questions in the future. Furthermore, we fully recognize the importance of conducting coimmunoprecipitation (Co-IP) assays from membrane fractions. While we have already completed Co-IP assays for total protein and quantified the FRET efficiency values between Slack and NaV1.6 in the membrane region, the Co-IP assays on membrane fractions will be conducted in our future investigations.

      New comment from reviewer: so far, the authors have not demonstrated that Nav1.6 and Slack interact on the cell surface.

      Answer: We thank the reviewer for pointing this out. We acknowledgement that our data did not directly demonstrate interaction between NaV1.6 and Slack on the cell surface and we have removed related terminology in the revised manuscript. Notably, our patch-clamp experiments in Fig. 2D,G and Fig. S10B showed a Na+-mediated membrane current coupling of Slack and NaV1.6. Additionally, the FRET efficiency values between Slack and NaV1.6 were quantified in the membrane region. These findings suggest that membrane-near Slack interacts with NaV1.6.

      1. Although hippocampal slices from Scn8a+/- were used for studies in Fig. S8, it is not clear whether Scn8a-/- or Scn8a+/- tissue was used in other studies (Fig 1J & 1K). It will be important to clarify whether genetic manipulation of NaV1.6 expression (Fig. 1K) has an impact on sodiumactivated potassium current, level of surface Slack expression, or that of NaV1.6 near Slack.

      Answer: We thank the reviewer for pointing this out. In Fig. 1G,J,K, primary cortical neurons from homozygous NaV1.6 knockout (Scn8a-/-) mice were used. We will clarify this information in the revised manuscript. In terms of the effects of genetic manipulation of NaV1.6 expression on IKNa and surface Slack expression, we compared the amplitudes of IKNa measured from homozygous NaV1.6 knockout (NaV1.6-KO) neurons and wild-type (WT) neurons. The results showed that homozygous knockout of NaV1.6 does not alter the amplitudes of IKNa (Author response image 4). The level of surface Slack expression will be tested further.

      Author response image 4.

      The amplitudes of IKNa in WT and NaV1.6-KO neurons (data from manuscript Fig. 1K). ns, p > 0.05, unpaired two-tailed Student’s t test.

      New comment from reviewer: The current version of the manuscrip>t does not contain these pertinent details and needs to be updated to include the information pertaining homozygous NaV1.6 knockouts. What age were these homozygous NaV1.6 knockout mice? These details need to be clearly stated in the manuscript.

      Answer: We thank the reviewer for pointing this out. We have included this analysis in the revised manuscript (see Fig. S3A). The age of homozygous NaV1.6 knockout mice are P0-P1 and we have added this detail in the revised manuscript.

      1. Did the epilepsy-related Slack mutations have an impact on NaV1.6-mediated sodium current?

      Answer: We thank the reviewer’s question. We examined the amplitudes of NaV1.6 sodium current upon expression alone or co-expression of NaV1.6 with epilepsy-related Slack mutations (K629N, R950Q, K985N). The results showed that the tested epilepsy-related Slack mutations do not alter the amplitudes of NaV1.6 sodium current (Author response image 5).

      Author response image 5.

      The amplitudes of NaV1.6 sodium currents upon co-expression of NaV1.6 with epilepsy-related Slack mutant variants (SlackK629N, SlackR950Q, and SlackK985N). ns, p>0.05, oneway ANOVA followed by Bonferroni’s post hoc test.

      New comment from reviewer: Figure with the functional effect of co-expression of NaV1.6 with epilepsy-related Slack mutations should be included in the revised manuscript

      Answer: We thank the reviewer for pointing this out. We have included this analysis in the revised manuscript (see Fig. S10A).

      Original issue from Recommendations For The Authors:

      1. A reference to homozygous knockout is made in the abstract; however, only heterozygous mice are mentioned in the methods section. The genotype of the mice needs to be made clear in the manuscript. Furthermore, at what age were these mice used in the study. Since homozygous knockout of NaV1.6 is lethal at a very young age (<4 wks), it would be important to clarify that point as well.

      Answer: We thank the reviewer for pointing this out. In the revised manuscript, we have included information about the source of the primary cortical neurons used in our study. These neurons were obtained from postnatal homozygous NaV1.6 knockout C3HeB/FeJ mice and their wild-type littermate controls.

      New comment from reviewer: The answer that postnatal homozygous NaV1.6 knockout C3HeB/FeJ mice were used is insufficient. What age were these mice? This needs to be clearly stated in the manuscript.

      Answer: We thank the reviewer for pointing this out. The postnatal homozygous NaV1.6 knockout C3HeB/FeJ mice and their wild-type littermate controls are in P0-P1. We have included this information in the revised manuscript.

      1. How long were the cells exposed to quinidine before the functional measurement were performed?

      Answer: We thank the reviewer for pointing this out. The cells were exposed to the bath solution with quinidine for about one minute before applying step pulses.

      New comment from reviewer: This needs to be clearly stated in the manuscript.

      Answer: We thank the reviewer for pointing this out. We have included this information in the revised manuscript (see in Methods section).

      1. In Fig. 6B-D, it is not clear to what extent co-expression of Slack mutants and NaV1.6 increases sodium-activated potassium current.

      Answer: We thank the reviewer for pointing this out. We notice that the current amplitudes of Slack mutants exhibit a considerable degree of variation, ranging from less than 1 nA to over 20 nA (n =58). To accurately measure the effects of NaV1.6 on increasing current amplitudes of Slack mutants, we plan to apply tetrodotoxin in the bath solution to block NaV1.6 sodium currents upon coexpression of Slack mutants with NaV1.6.

      New comment from reviewer: Were these experiments with TTX completed? If so, they should be added to the revised manuscript.

      Answer: We thank the reviewer for pointing this out. We compared the current amplitudes of epilepsy-related Slack mutant (SlackR950Q) before and after bath-application of 100 nM TTX upon co-expression with NaV1.6 in HEK293 cells. The results showed that bath-application of TTX significantly reduced the current amplitudes of SlackR950Q at +100 mV by nearly 40% (Author response image 6), suggesting NaV1.6 contributes to the current amplitudes of SlackR950Q. We have included this data in the revised manuscript (see Fig. S10B).

      Author response image 6.

      The current amplitudes of SlackR950Q before and after bath-application of 100 nM TTX upon co-expression with NaV1.6 in HEK293 cells (n=5). ***p < 0.001, Two-way repeated measures ANOVA followed by Bonferroni’s post hoc test.

      Additionally, we have corrected some errors in the methods and figure captions section:

      1. Line 513, bath solution “5 glucose” should be “10 glucose.”

      2. Figure 3A caption, the description “hippocampus CA1 (left) and neocortex (right)” was flipped and we have corrected it.

      References

      1. Zhang Z, Rosenhouse-Dantsker A, Tang QY, Noskov S, Logothetis DE. The RCK2 domain uses a coordination site present in Kir channels to confer sodium sensitivity to Slo2.2 channels. J Neurosci. Jun 2 2010;30(22):7554-62. doi:10.1523/JNEUROSCI.0525-10.2010

      2. Kaczmarek LK. Slack, Slick and Sodium-Activated Potassium Channels. ISRN Neurosci. Apr 18 2013;2013(2013)doi:10.1155/2013/354262

      3. Budelli G, Hage TA, Wei A, et al. Na+-activated K+ channels express a large delayed outward current in neurons during normal physiology. Nat Neurosci. Jun 2009;12(6):745-50. doi:10.1038/nn.2313

      4. Huang Z, Walker MC, Shah MM. Loss of dendritic HCN1 subunits enhances cortical excitability and epileptogenesis. J Neurosci. Sep 2 2009;29(35):10979-88. doi:10.1523/JNEUROSCI.1531-09.2009

      5. Pinel JP, Rovner LI. Electrode placement and kindling-induced experimental epilepsy. Exp Neurol. Jan 15 1978;58(2):335-46. doi:10.1016/0014-4886(78)90145-0

      6. Racine RJ. Modification of seizure activity by electrical stimulation. II. Motor seizure. Electroencephalogr Clin Neurophysiol. Mar 1972;32(3):281-94. doi:10.1016/0013-4694(72)90177-0

      7. Kim EC, Zhang J, Tang AY, et al. Spontaneous seizure and memory loss in mice expressing an epileptic encephalopathy variant in the calmodulin-binding domain of Kv7.2. Proc Natl Acad Sci U S A. Dec 21 2021;118(51)doi:10.1073/pnas.2021265118

    1. Author response:

      The following is the authors’ response to the previous reviews.

      eLife assessment

      In this important study, the authors report a novel measurement of the Escherichia coli chemotactic response and demonstrate that these bacteria display an attractant response to potassium, which is connected to intracellular pH level. Whilst the experiments are mostly convincing, there are some confounders regards pH changes and fluorescent proteins that remain to be addressed.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      This paper shows that E. coli exhibits a chemotactic response to potassium by measuring both the motor response (using a bead assay) and the intracellular signaling response (CheY phosporylation level via FRET) to step changes in potassium concentration. They find increase in potassium concentration induces a considerable attractant response, with amplitude comparable to aspartate, and cells can quickly adapt (and generally over-adapt). The authors propose that the mechanism for potassium response is through modifying intracellular pH; they find both that potassium modifies pH and other pH modifiers induce similar attractant responses. It is also shown, using Tar- and Tsr-only mutants, that these two chemoreceptors respond to potassium differently. Tsr has a standard attractant response, while Tar has a biphasic response (repellent-like then attractant-like). Finally, the authors use computer simulations to study the swimming response of cells to a periodic potassium signal secreted from a biofilm and find a phase delay that depends on the period of oscillation.

      Strengths:

      The finding that E. coli can sense and adapt to potassium signals and the connection to intracellular pH is quite interesting and this work should stimulate future experimental and theoretical studies regarding the microscopic mechanisms governing this response. The evidence (from both the bead assay and FRET) that potassium induces an attractant response is convincing, as is the proposed mechanism involving modification of intracellular pH. The updated manuscript controls for the impact of pH on the fluorescent protein brightness that can bias the measured FRET signal. After correction the response amplitude and sharpness (hill coefficient) are comparable to conventional chemoattractants (e.g. aspartate), indicating the general mechanisms underlying the response may be similar. The authors suggest that the biphasic response of Tar mutants may be due to pH influencing the activity of other enzymes (CheA, CheR or CheB), which will be an interesting direction for future study.

      Weaknesses:

      The measured response may be biased by adaptation, especially for weak potassium signals. For other attractant stimuli, the response typically shows a low plateau before it recovers (adapts). In the case of potassium, the FRET signal does not have an obvious plateau following the stimuli of small potassium concentrations, perhaps due to the faster adaptation compared to other chemoattractants. It is possible cells have already partially adapted when the response reaches its minimum, so the measured response may be a slight underestimate of the true response. Mutants without adaptation enzymes appear to be sensitive to potassium only at much larger concentrations, where the pH significantly disrupts the FRET signal; more accurate measurements would require development of new mutants and/or measurement techniques.

      We acknowledge and appreciate the reviewer's concerns regarding the potential impact of adaptation on the measured response magnitude. We have estimated the effect of adaptation on the measured response magnitude. The half-time of adaptation at 30 mM KCl was measured to be approximately 80 s, corresponding to a time constant of t = 80/ln(2) = 115.4 s, which is significantly longer than the time required for medium exchange in the flow chamber (less than 10 s). Consequently, the relative effect of adaptation on the measured response magnitude should be less than 1-exp(-10/t) = 8.3%. Even for the fastest adaptation (at the lowest KCl concentration) we measured, the effect should be less than 20%, which is within experimental uncertainties. Nevertheless, we agree that developing new techniques to measure the dose-response curve more precisely would be beneficial.

      Reviewer #2 (Public Review):

      Zhang et al investigated the biophysical mechanism of potassium-mediated chemotactic behavior in E coli. Previously, it was reported by Humphries et al that the potassium waves from oscillating B subtilis biofilm attract P aeruginosa through chemotactic behavior of motile P aeruginosa cells. It was proposed that K+ waves alter PMF of P aeruginosa. However, the mechanism was this behaviour was not elusive. In this study, Zhang et al demonstrated that motile E coli cells accumulate in regions of high potassium levels. They found that this behavior is likely resulting from the chemotaxis signalling pathway, mediated by an elevation of intracellular pH. Overall, a solid body of evidence is provided to support the claims. However, the impacts of pH on the fluorescence proteins need to be better evaluated. In its current form, the evidence is insufficient to say that the fluoresce intensity ratio results from FRET. It may well be an artefact of pH change.

      The authors now carefully evaluated the impact of pH on their FRET sensor by examining the YFP and CFP fluorescence with no-receptor mutant. The authors used this data to correct the impact of pH on their FRET sensor. This is an improvement, but the mathematical operation of this correction needs clarification. This is particularly important because, looking at the data, it is not fully convincing if the correction was done properly. For instance, 3mM KCl gives 0.98 FRET signal both in Fig3 and FigS4, but there is almost no difference between blue and red lines in Fig 3. FigS4 is very informative, but it does not address the concern raised by both reviewers that FRET reporter may not be a reliable tool here due to pH change.

      We apologize for not making the correction process clear. We corrected the impact of pH on the original signals for both CFP and YFP channels by

      where and represent the pH-corrected and original PMT signal (CFP or YFP channel) from the moment of addition of L mM KCl to the moment of its removal, respectively, and  is the correction factor, which is the ratio of PMT signal post- to pre-KCl addition for the no-receptor mutant at L mM KCl, for CFP or YFP channel as shown Fig. S5. The pH-corrected FRET response is then calculated as the ratio of the pH-corrected YFP to the pH-corrected CFP signals, normalized by the pre-stimulus ratio.

      As shown in Author response image1, which represents the same data as Fig. 3A and Fig. S5A, the original normalized FRET responses to 3 mM KCl are 0.967 for the wild-type strain (Fig. 3) and 0.981 for the no-receptor strain (Fig. S5). The standard deviation of the FRET values under steady-state conditions is 0.003. Thus, the difference in responses between the wild-type and no-receptor strains is significant and clearly exceeds the standard deviation. The pH correction factors CpH at 3 mM KCl are 1.004 for the YFP signal and 1.016 for the CFP signal. Consequently, the pH-corrected FRET responses are 0.967´1.016/1.004=0.979 for the wild-type and 0.981´1.016/1.004=0.993 for the no-receptor strain. The reason the pH-corrected FRET response for the no-receptor strain is 0.993 instead of the expected 1.000 is that this value represents the lowest observed response rather than the average value for the FRET response.

      The detailed mathematical operation for correcting the pH impact has now been included in the “FRET assay” section of Materials and Methods.

      Author response image 1.

      Chemotactic response of the wild-type strain (A, HCB1288-pVS88) and the no-receptor strain (B, HCB1414-pVS88) to stepwise addition and removal of KCl. The blue solid line denotes the original normalized signal. Downward and upward arrows indicate the time points of addition and removal of 3 mM KCl, respectively. The horizontal red dashed line denotes the original normalized FRET response value to 3 mM KCl.

      The authors show the FRET data with both KCl and K2SO4, concluding that the chemotactic response mainly resulted from potassium ions. However, this was only measured by FRET. It would be more convincing if the motility assay in Fig1 is also performed with K2SO4. The authors did not address this point. In light of complications associated with the use of the FRET sensor, this experiment is more important.

      We thank the reviewer for the suggestion. We agree that additional confirmation with a motility assay is important. To address this, we have now measured the response of the motor rotational signal to 15 mM K2SO4 using the bead assay and compared it with the response to 30 mM KCl. The results are shown in Fig. S2. The response of motor CW bias to 15 mM K2SO4 exhibited an attractant response, characterized by a decreased CW bias upon the addition of K2SO4, followed by an over-adaptation that is qualitatively similar to the response to 30 mM KCl. However, there were notable differences in the adaptation time and the presence of an overshoot. Specifically, the adaptation time to K2SO4 was shorter compared to that for KCl, and there was a notable overshoot in the CW bias during the adaptation phase. These differences may have resulted from the weaker response to K2SO4 (Fig. S1B) and additional modifications due to CysZ-mediated cellular uptake of sulfate (Zhang et al., Biochimica et Biophysica Acta 1838,1809–1816 (2014)). The faster adaptation and overshoot complicated the chemotactic drift in the microfluidic assay as in Fig. 1, such that we were unable to observe a noticeable drift in a K2SO4 gradient under the same experimental conditions used for the KCl gradient.

      The response of motor rotational signal to 15 mM K2SO4 has been added to Fig. S2.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) The response curve and adaptation level/time in the main text (Fig. 4) should be replaced by the corrected counterparts (currently in Fig. S5). The current version is especially confusing because Fig. 6 shows the corrected response, but the difference from Fig. 4 is not mentioned.

      We thank the reviewer for the suggestion. We have now merged the results of the original Fig. S5 into Fig. 4.

      a. The discussion of the uncorrected response with small hill coefficient and potentially negative cooperativity was left in the text (lines 223-234), but the new measurements show this is not true for the actual response. This should be removed or significantly rephrased.

      We thank the reviewer for the suggestion. We have now removed the statement about potentially negative cooperativity and added the corrected results for the actual response.

      (2) It may be helpful to restate the definition of f_m in the methods (near Eq. 3-4).

      Thank you for the suggestion. We have now restated the definition of fm and fL below Eq. 3-4: “In the denominator on the right-hand side of Eq. 3, the two terms within the parentheses of exponential expression represent the methylation-dependent (fm) and ligand-dependent (fL) free energy, respectively.”

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #2:

      (1) The use of two m<sup>5</sup>C reader proteins is likely a reason for the high number of edits introduced by the DRAM-Seq method. Both ALYREF and YBX1 are ubiquitous proteins with multiple roles in RNA metabolism including splicing and mRNA export. It is reasonable to assume that both ALYREF and YBX1 bind to many mRNAs that do not contain m<sup>5</sup>C.

      To substantiate the author's claim that ALYREF or YBX1 binds m<sup>5</sup>C-modified RNAs to an extent that would allow distinguishing its binding to non-modified RNAs from binding to m<sup>5</sup>C-modified RNAs, it would be recommended to provide data on the affinity of these, supposedly proven, m<sup>5</sup>C readers to non-modified versus m<sup>5</sup>C-modified RNAs. To do so, this reviewer suggests performing experiments as described in Slama et al., 2020 (doi: 10.1016/j.ymeth.2018.10.020). However, using dot blots like in so many published studies to show modification of a specific antibody or protein binding, is insufficient as an argument because no antibody, nor protein, encounters nanograms to micrograms of a specific RNA identity in a cell. This issue remains a major caveat in all studies using so-called RNA modification reader proteins as bait for detecting RNA modifications in epitranscriptomics research. It becomes a pertinent problem if used as a platform for base editing similar to the work presented in this manuscript.

      The authors have tried to address the point made by this reviewer. However, rather than performing an experiment with recombinant ALYREF-fusions and m<sup>5</sup>C-modified to unmodified RNA oligos for testing the enrichment factor of ALYREF in vitro, the authors resorted to citing two manuscripts. One manuscript is cited by everybody when it comes to ALYREF as m<sup>5</sup>C reader, however none of the experiments have been repeated by another laboratory. The other manuscript is reporting on YBX1 binding to m<sup>5</sup>C-containing RNA and mentions PAR-CLiP experiments with ALYREF, the details of which are nowhere to be found in doi: 10.1038/s41556-019-0361-y.<br /> Furthermore, the authors have added RNA pull-down assays that should substitute for the requested experiments. Interestingly, Figure S1E shows that ALYREF binds equally well to unmodified and m<sup>5</sup>C-modified RNA oligos, which contradicts doi:10.1038/cr.2017.55, and supports the conclusion that wild-type ALYREF is not specific m<sup>5</sup>C binder. The necessity of including always an overexpression of ALYREF-mut in parallel DRAM experiments, makes the developed method better controlled but not easy to handle (expression differences of the plasmid-driven proteins etc.)

      Thank you for pointing this out. First, we would like to correct our previous response: the binding ability of ALYREF to m<sup>5</sup>C-modified RNA was initially reported in doi: 10.1038/cr.2017.55, (and not in doi: 10.1038/s41556-019-0361-y), where it was observed through PAR-CLIP analysis that the K171 mutation weakens its binding affinity to m<sup>5</sup>C -modified RNA.

      Our previous experimental approach was not optimal: the protein concentration in the INPUT group was too high, leading to overexposure in the experimental group. Additionally, we did not conduct a quantitative analysis of the results at that time. In response to your suggestion, we performed RNA pull-down experiments with YBX1 and ALYREF, rather than with the pan-DRAM protein, to better validate and reproduce the previously reported findings. Our quantitative analysis revealed that both ALYREF and YBX1 exhibit a stronger affinity for m<sup>5</sup>C -modified RNAs. Furthermore, mutating the key amino acids involved in m<sup>5</sup>C recognition significantly reduced the binding affinity of both readers. These results align with previous studies (doi: 10.1038/cr.2017.55 and doi: 10.1038/s41556-019-0361-y), confirming that ALYREF and YBX1 are specific readers of m<sup>5</sup>C -modified RNAs. However, our detection system has certain limitations. Despite mutating the critical amino acids, both readers retained a weak binding affinity for m<sup>5</sup>C, suggesting that while the mutation helps reduce false positives, it is still challenging to precisely map the distribution of m<sup>5</sup>C modifications. To address this, we plan to further investigate the protein structure and function to obtain a more accurate m<sup>5</sup>C sequencing of the transcriptome in future studies. Accordingly, we have updated our results and conclusions in lines 294-299 and discuss these limitations in lines 109-114.

      In addition, while the m<sup>5</sup>C assay can be performed using only the DRAM system alone, comparing it with the DRAM<sup>mut</sup>C control enhances the accuracy of m<sup>5</sup>C region detection. To minimize the variations in transfection efficiency across experimental groups, it is recommended to use the same batch of transfections. This approach not only ensures more consistent results but also improve the standardization of the DRAM assay, as discussed in the section added on line 308-312.

      (2) Using sodium arsenite treatment of cells as a means to change the m<sup>5</sup>C status of transcripts through the downregulation of the two major m<sup>5</sup>C writer proteins NSUN2 and NSUN6 is problematic and the conclusions from these experiments are not warranted. Sodium arsenite is a chemical that poisons every protein containing thiol groups. Not only do NSUN proteins contain cysteines but also the base editor fusion proteins. Arsenite will inactivate these proteins, hence the editing frequency will drop, as observed in the experiments shown in Figure 5, which the authors explain with fewer m<sup>5</sup>C sites to be detected by the fusion proteins.

      The authors have not addressed the point made by this reviewer. Instead the authors state that they have not addressed that possibility. They claim that they have revised the results section, but this reviewer can only see the point raised in the conclusions. An experiment would have been to purify base editors via the HA tag and then perform some kind of binding/editing assay in vitro before and after arsenite treatment of cells.

      We appreciate the reviewer’s insightful comment. We fully agree with the concern raised. In the original manuscript, our intention was to use sodium arsenite treatment to downregulate NSUN mediated m<sup>5</sup>C levels and subsequently decrease DRAM editing efficiency, with the aim of monitoring m<sup>5</sup>C dynamics through the DRAM system. However, as the reviewer pointed out, sodium arsenite may inactivate both NSUN proteins and the base editor fusion proteins, and any such inactivation would likely result in a reduced DRAM editing. This confounds the interpretation of our experimental data.

      As demonstrated in Appendix A, western blot analysis confirmed that sodium arsenite indeed decreased the expression of fusion proteins. In addition, we attempted in vitro fusion protein purification using multiple fusion tags (HIS, GST, HA, MBP) for DRAM fusion protein expression, but unfortunately, we were unable to obtain purified proteins. However, using the Promega TNT T7 Rapid Coupled In Vitro Transcription/Translation Kit, we successfully purified the DRAM protein (Appendix B). Despite this success, subsequent in vitro deamination experiments did not yield the expected mutation results (Appendix C), indicating that further optimization is required. This issue is further discussed in line 314-315.

      Taken together, the above evidence supports that the experiment of sodium arsenite treatment was confusing and we determined to remove the corresponding results from the main text of the revised manuscript.

      Author response image 1.

      (3) The authors should move high-confidence editing site data contained in Supplementary Tables 2 and 3 into one of the main Figures to substantiate what is discussed in Figure 4A. However, the data needs to be visualized in another way then excel format. Furthermore, Supplementary Table 2 does not contain a description of the columns, while Supplementary Table 3 contains a single row with letters and numbers.

      The authors have not addressed the point made by this reviewer. Figure 3F shows the screening process for DRAM-seq assays and principles for screening high-confidence genes rather than the data contained in Supplementary Tables 2 and 3 of the former version of this manuscript.

      Thank you for your valuable suggestion. We have visualized the data from Supplementary Tables 2 and 3 in Figure 4A as a circlize diagram (described in lines 213-216), illustrating the distribution of mutation sites detected by the DRAM system across each chromosome. Additionally, to improve the presentation and clarity of the data, we have revised Supplementary Tables 2 and 3 by adding column descriptions, merging the DRAM-ABE and DRAM-CBE sites, and including overlapping m<sup>5</sup>C genes from previous datasets.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      Summary:

      Liver cancer shows a high incidence in males than females with incompletely understood causes. This study utilized a mouse model that lacks the bile acid feedback mechanisms (FXR/SHP DKO mice) to study how dysregulation of bile acid homeostasis and a high circulating bile acid may underlie the gender-dependent prevalence and prognosis of HCC. By transcriptomics analysis comparing male and female mice, unique sets of gene signatures were identified and correlated with HCC outcomes in human patients. The study showed that ovariectomy procedure increased HCC incidence in female FXR/SHP DKO mice that were otherwise resistant to agedependent HCC development, and that removing bile acids by blocking intestine bile acid absorption reduced HCC progression in FXR/SHP DKO mice. Based on these findings, the authors suggest that gender-dependent bile acid metabolism may play a role in the male-dominant HCC incidence, and that reducing bile acid level and signaling may be beneficial in HCC treatment. 

      strengths:

      (1) Chronic liver diseases often proceed the development of liver and bile duct cancer. Advanced chronic liver diseases are often associated with dysregulation of bile acid homeostasis and cholestasis. This study takes advantage of a unique FXR/SHP DKO model that develop high organ bile acid exposure and spontaneous age-dependent HCC development in males but not females to identify unique HCC-associated gene signatures. The study showed that the unique gene signature in female DKO mice that had lower HCC incidence also correlated with lower grade HCC and better survival in human HCC patients. 2. The study also suggests that differentially regulated bile acid signaling or gender-dependent response to altered bile acids may contribute to gender-dependent susceptibility to HCC development and/or progression. 3. The sex-dependent differences in bile acidmediated pathology clearly exist but are still not fully understood at the mechanistic level. Female mice have been shown to be more sensitive to bile acid toxicity in a few cholestasis models, while this study showed a male dominance of bile acid promotion of HCC. This study used ovariectomy to demonstrate that female hormones are possible underlying factors. Future studies are needed to understand the interaction of sex hormones, bile acids, and chronic liver diseases and cancer. 

      We thank Reviewer 1 for their positive and thorough assessment of our manuscript

      Weaknesses:

      (1) HCC shows heterogeneity, and it is unclear what tissues (tumor or normal) were used from the DKO mice and human HCC gene expression dataset to obtain the gene signature, and how the authors reconcile these gene signatures with HCC prognosis.

      Mice studies: Aged DKO mice develop aggressive tumors (major and minor nodules, See Figure 1), and the entire liver is burdened with multiple tumor nodules. It is technically challenging to demarcate the tumor boundaries as most of the surrounding tissues do not display normal tissue architecture. Therefore, livers from age- and sexmatched wild-type C57/BL6 mice were used as control tissue. All the mice were inbred in our facility. Spatial transcriptomics and longitudinal studies are ongoing to collect tumors at earlier time points wherein we can differentiate tumor and non-tumor tissue. 

      Human Studies: We mined five separate clinical data sets. The human HCC gene expression comprised of samples from the (i) National Cancer Institute (NCI) cohort (GEO accession numbers, GSE1898 and GSE4024) and (ii) Korea, (iii) Samsung, (iv) Modena, and (v) Fudan cohorts as previously described (GEO accession numbers, GSE14520, GSE16757, GSE43619, GSE36376, and GSE54236). We have added a new supplemental table 4, giving details of these datasets. Depending on the cohort, they are primarily HCC samples- surgical resections of HCC, control samples, with some tumors and paired non-tumor tissues.

      (2) The authors identified a unique set of gene expression signatures that are linked to HCC patient outcomes, but analysis of these gene sets to understand the causes of cancer promotion is still lacking. The studies of urea cycle metabolism and estrogen signaling were preliminary and inconclusive. These mechanistic aspects may be followed up in revision or future studies.

      We agree. Experiments to elicit HCC causality and promotion are complex, given the heterogeneous nature of liver cancer. Moreover, the length of time (12 months) needed to spontaneously develop cancer in this DKO mouse model makes it challenging. As mentioned by the reviewer, mechanistic studies are ongoing, and longitudinal time course experiments are actively being pursued to delineate causality. Having said that, we mined the TCGA LIHC (The Cancer Genome Atlas Liver Hepatocellular Carcinoma) database to examine the expression of the individual urea cycle genes and found them suppressed in liver tumorigenesis (new Supplementary Figure 4). We also evaluated if estrogen receptor  (Er) targets altered in DKO females (DKO_Estrogen) correlate with overall survival in HCC (new Supplementary Figure 6). We note that Er expression per se is reduced in males and females upon liver tumorigenesis. Also, DKO_Estrogen signature positively corroborated with better overall survival (new Supplementary Figure 6). These findings further bolster the relevance of urea cycle metabolism and estrogen signaling during HCC. 

      (3) While high levels of bile acids are convincingly shown to promote HCC progression, their role in HCC initiation is not established. The DKO model may be limited to conditions of extremely high levels of organ bile acid exposure. The DKO mice do not model the human population of HCC patients with various etiology and shared liver pathology (i.e. cirrhosis). Therefore, high circulating bile acids may not fully explain the male prevalence of HCC incidence.

      We agree with this comment that our studies do not show bile acids can initiate HCC and may act as one of the many factors that contribute to the high male prevalence of HCC. This is exactly the reason why throughout the manuscript we do not write about HCC initiation. To clarify further, in the revised discussion of the manuscript, we have added a sentence to highlight this aspect, “while this study demonstrates bile acids promote HCC progression it does not investigate or provide evidence if excess bile acids are sufficient for HCC initiation.”

      (4) The authors showed lower circulating bile acids and increased fecal bile acid excretion in female mice and hypothesized that this may be a mechanism underlying the lower bile acid exposure that contributed to lower HCC incidence in female DKO mice. Additional analysis of organ bile acids within the enterohepatic circulation may be performed because a more accurate interpretation of the circulating bile acids and fecal bile acids can be made in reference to organ bile acids and total bile acid pool changes in these mice.

      As shown in this manuscript- we provide BA compositional analyses from the liver, serum, urine, and feces (Figures 5 and 6, new Supplementary Figure 8, Supplementary Tables 4 and 5). Unfortunately, we did not collect the intestinal tissue or gallbladders for BA analysis in this study. Separate cohorts of mice are being aged for future BA analyses from different organs within the enterohepatic loop. We thank you for this suggestion. Nevertheless, we have previously measured and reported BA values to be elevated in the intestines and the gall bladder of young DKO mice (PMC3007143).

      Reviewer #2 (Public review):

      Weaknesses:

      (1) The translational value to human HCC is not so strong yet. Authors show that there is a correlation between the female-selective gene signature and low-grade tumors and better survival in HCC patients overall. However, these data do not show whether this signature is more highly correlated with female tumor burden and survival. In other words, whether the mechanisms of female protection may be similar between humans and mice. In that respect, it would also be good to elaborate on whether women have higher fecal BA excretion and lower serum BA concentration.

      The reviewer poses an interesting question to test if the DKO female-specific signatures are altered differently in male vs. female HCC samples. As we found the urea cycle and estrogen signaling to be protective and enriched in our mouse model, we tested their expression pattern using the TCGA-LIHC RNA-seq data. We found urea cycle genes and Er transcripts broadly reduced in tumor samples irrespective of the sex (new Supplementary Figure 4 and Supplementary Figure 6), indicating that these pathways are compromised upon tumorigenesis even in the female livers. 

      While prior studies have shown (i) a smaller BA pool w synthesis in men than women (PMID: 22003820), we did not find a study that systematically investigated BA excretion between the sexes in HCC context. The reviewer is spot on in suggesting BA analysis from HCC and unaffected human fecal samples from both sexes. Designing and performing such studies in the future will provide concrete proof of whether BA excretion protects female livers from developing liver cancer. We thank you for these suggestions.

      (2) The authors should perform a thorough spelling and grammar check.

      We apologize for the typos, which have been fixed, and as suggested by the reviewer, we have performed a grammar check.

      (3) There are quite some errors and inaccuracies in the result section, figures, and legends. The authors should correct this.

      We apologize for the inadvertent errors in the manuscript, and we have clarified these inaccuracies in the revised version. Thank you.

      Reviewer#1 (Recommendations for the authors).

      (1) Figures 1A-F, This statement of altered liver steatosis needs to be further supported by measurement of liver triglycerides. Lower magnification images of Sirius red stain should be shown for better evaluation of liver fibrosis.

      Unfortunately, we did not measure liver triglycerides and sirius red stained samples have faded, and lower magnification is unavailable at this juncture. We have modified our results accordingly.  

      We did not take the gross picture of WT female and DKO female livers in the same frame as shown below. Since the manuscript is focused on male and female differences in liver cancer incidence, we provided DKO male and female liver images as Figure 1D in the paper.

      Author response image 1.

      Gross liver images of a year-old WT and DKO mice which show prominent hepatocarcinogenesis in DKO male mice

      (2) Can the authors clarify if the gene transcriptomics was performed with normal or tumor tissues of DKO mice?

      Gene transcriptomics were performed with the tumor tissue of DKO mice. We have previously published data from younger non tumor bearing DKO male mice (PMCID: PMC3007143). 

      (3) Supplementary Figure 3C. Could the authors confirm if this is F vs M or just DKO female since it does not seem to match the result description in the main text? It is better practice to indicate the sub-panels of the Supplementary Figures in the main text while describing the results.

      As the reviewer correctly points out Supplementary Figure 3C is DKO F vs M signature not DKO_female signature and this has been clarified in the text. We have also included DKO_F data now to reduce the confusion.

      (4) Figure 3. Legend, the data presented are not well explained in the Legend, especially the labeling and what is being presented and compared.

      As suggested by the reviewer, we have modified the legend accordingly.

      (5) Supplementary Table 4 does not contain total serum bile acid as described in the main text.

      We agree with the reviewer. We provided primary and secondary BA concentrations, Supplementary Table 4 (currently Supplementary Table 5 in the revised version): Rows 20 and 21. but not their added total. We have modified the text accordingly.

      (6) Method section: many experiments lack descriptions of details.

      We have added details to the animal experimental design, ER ChIP-PCR, schematics of experiments are included within the main and supplemental figures, metabolomics and BA analysis have been expanded. 

      Reviewer #2 (Recommendations For The Authors):

      General:

      (1) The authors are advised to do a thorough grammar and spelling check.

      We have performed spelling and grammar check as suggested using an online platform Grammarly. Thank You.

      Results:

      (1) Figure 1 o The authors should show in Figure 1D female WT and female DKO liver.

      See Figure 1 added in our responses to point 1 of reviewer 1’s comment.

      In the Figure legend, (A-E) should be replaced by (A+D). 

      Thank you. We have modified it accordingly.

      The authors do not refer to 1J in the text, please add this reference.

      Thank you for pointing it. We have referenced 1J in the text.

      The description of 1H does not elaborate on the sex differences in ALT/AST levels, as this is the focus of the manuscript.

      We have added a sentence to show that the injury markers are higher in DKO males, which is consistent with an advanced disease. Thanks.

      The authors should use the correct nomenclature in Figure 1I/1J (gene vs protein and capitals vs non-capitals).

      The Figure 1I and 1J show gene expression of Fxr and Shp and hence we used the non-capital italicized nomenclature. Thanks.

      (2) Figure 2:

      The x-axis length is different in Figures 2A and 2B. Please correct to visualize the differences between males and females better.

      The x axis length has been fixed as suggested. Thanks

      (3) Figure 3:

      The authors should elaborate on how the patients were assigned to each gene signature. This is not fully clear.

      The gene set obtained from the WT and DKO mice were used. The process used is shown as a schematic in Supplemental Fig 2C and the gene list is included  in an excel sheet as Supplemental table 1. 

      We are curious how these data (F3A-C) would look when separating male and female human patients.

      We performed an overall survival analysis with a subgroup of patients and provide it. We segregated the HCC cohort data on sex and age (>55 yr, since we assumed 55 as an age for menopause) and evaluated the DKO gene signature. Similar to the original figure 3, we find that irrespective of sex, and age, DKO FvsM gene signature corresponds with better overall survival in men and in women. These findings align with the combined analysis in overall survival shown in original Figure 3 of the manuscript, and therefore we did not modify it. If deemed necessary, we are happy to include the figure below to reviewers in the main manuscript.

      Author response image 2.

      Correlation of gene signatures obtained from WT and DKO mouse model with the survival data of HCC patients segregated by age and sex. The Kaplan Meier Survival graphs were generated based on WT and DKO transcriptome changes using five HCC clinical cohorts. Analysis of OS (Overall Survival) in patients ((A) Men and (B) Women) using the gene signatures representative of either male WT or male DKO, female WT or female DKO, and unique changes observed in female DKO mice but not in male DKO mice.

      What was used as the control signature in Figure 3C? Please specify this.

      For Figure 3C we compared the DKO_M signature to that of DKOF vs M signature. These genes are listed as an Excel Sheet (Supplementary Table 1).

      The authors claim that DKO female mice display chronic cholestasis, similar to their male counterparts. Please refer to previous work or show the data.

      Serum BA levels are elevated in DKO females are reported in supplementary table 5 and we find comparable hepatic BA composition in Figure 5 F.

      (4) Figure 4: Labels for the x-axis are missing in Figure 4C. Please add legends or labels to the bars.

      The x axis label is included in the top Serum BAs in (M)

      In Figure 4I, the percentage of input is quite low. An IgG control would show whether recruitment of ERalpha to the shown loci is significant above background levels. Also, ChIP on the OVX liver could serve as a negative control.

      We did use IgG as control pull down and the signals above this background were considered. We have not performed this in OVX, which would be an excellent negative control for future studies. Thank You.

      The results and legends refer to ChIP-qPCR, while methods only mention ChIP-seq.Please adapt.

      We sincerely apologize for the mistake. We used published ChIP-seq to identify putative binding site and then performed ChIP PCR to validate it. We have clarified and rectified this error. Thank You.

      Significance indications in the figure legend do not correspond with significance indications in the figure. Please explain the used significance symbols in the figure in the legend.

      Thank You. The legends and their significance have been matched.

      (5) Figure 5:

      Authors claim lowered total serum BA in females compared to males, and reference to Supplementary Table 4. However, these data are not provided, only percentages and ratios are displayed.

      In the revised version, this has become Table 5. See response to the same concern noted by Reviewer 1, Point 5 above.

      Figure 5D: Are sulphated BA also elevated in WT females? Please provide these data.

      There is no significant urinary excretion of BAs in WT control animals. We have previously measured and found none. But under cholestatic conditions BAs are observed in urine. Therefore, sulphated BA levels were found only in the DKO mice. 

      Figure 5H: Is the fecal BA excretion in WT females also proportionally higher than in males? Please provide these data.

      We were unable to perform the untargeted metabolomics profiling of WT fecal samples. When we measured for BAs in the feces, as expected very low conc were present irrespective of the sex (~0.01 M) and we did not find any sex difference.  Also, prior studies in 129SVJ strain exhibited comparable fecal excretion (PMC150802). We did not find any clinical studies that measured fecal BA between the sexes.

      (6) Figure 6:

      References in the text of the result section to Figure 6 are wrong. The authors should change this.

      Thank You. This has been rectified.

      Significance indications in the legend do not correspond with significance indications in the figure. Please explain the used significance symbols in the figure in the legend.

      Thank You. The legends and their significance have been matched.

      (7) Supplemental Figure 3:

      Please adapt the title of this figure; the sentence is incorrect. The description of this figure is very poor.

      We have modified the legend and the title of the Supplemental Figure 3 to make it more appropriate. Thanks

      Please explain what the blue and red dots represent.

      Each dot in blue and yellow indicate the Bayesian probability generated from our BCCP model.

      What are the bold horizontal lines representing? Why are there no dots in some box plots? Please elaborate.

      The box represents the interquartile range (IQR), encompassing the middle 50% of the data. The bottom and top edges correspond to the 25th and 75th percentiles, respectively, while the bold horizontal line indicates the median value.

      The absence of visible dots in certain categories—particularly in higher CLIP and TNM stages—is due to the small number of patients, all of whom had similar Bayesian prediction probabilities. As these values cluster tightly around the median, the individual dots may be overlapped and hidden behind the median line.

      The figure is not visually easy to understand, please reconsider the representation.  

      We hope the modified figure legends with the explanation of the lines and the points in the graphs increases the clarity and makes them acceptable.

      Please add the DKO_female signature plot.

      We have added these graph to Supplemental figure 3

      (8) Supplemental 4A:

      Fold change at Z-score is missing. This should be added.

      Thank you we have added this information

      (9) Supplemental 5:

      The scale bar is missing. This should be included.

      The figure is now supplemental figure 8 and the scale bar has been added.

      Methods:

      (1) Did the authors use ChIP-sequencing or ChIP-qPCR? Please describe the correct method.

      We apologize for the error. We have used ChIP-PCR and rectified it in our methods and in our response to a figure 4 query.

      (2) It is unclear how the mouse model was generated. Please refer to earlier publications.

      The mice were generated in house at UIUC, and we have added this sentence to the Methods section. The original reference has been cited in the text (PMCID: PMC3007143).

      Discussion:

      (1) The authors claim in the discussion: 'consistently higher recruitment of ER to the classical BA synthetic genes ...' This is not shown in Figure 4I, only ER recruitment to Cyp7a1 is significantly higher in females. Please rephrase.

      We agree and we have modified the sentence Cyp7A1 accounts for ~75% of BA synthesis and is a rate-limiting gene in the classical BA synthesis pathway. 

      (2) The authors could make their statements stronger if they could elaborate on whether women have more fecal BA excretion, and if there are differences in serum BA concentration in HCC between male and female patients. 

      Unfortunately, we were unable to find clinical studies with appropriate controls which examined and reported serum BA in HCC in a sex specific manner.

      In addition, to understand whether the female-specific protections in humans are similar to mice, it would be nice to show correlations of the female-specific mouse signature with male and female liver signatures.

      At this time, we do not have large n numbers of control or precancerous early-stage patient datasets from both sexes to make such comparisons. Nevertheless, there is translational relevance of these sex-specific signature. Figure 2 included in the reviewer response shows that DKO male signature correlates with poor overall survival in males, whereas neither DKO male nor DKO female signature predict outcome in females. In contrast, DKO female-specific gene signature (DKOFvsM) correlates with better overall survival in both men and in women. 

      (3) The authors state in the discussion: 'Currently we do not know how to reconcile this data other than indicating a potential ER independent mechanism.' We do not understand the reasoning behind this statement. Please clarify.

      We find that increased Erα expression in DKO coincides with CA-mediated suppression of BA synthesis genes in the absence of Fxr and Shp. But we also noticed that in OVX DKO mice, Erα expression is blunted, and so is basal BA synthesis gene expression. Putting together these data, it is intriguing that Erα expression correlates both positively and negatively with BA synthesis genes. To reconcile these contrasting results, we have written the following sentence in the discussion.

      “These findings suggest Erα expression is linked to both positive and negative regulation of BA synthesis genes. But we do not know how ER elicits these differential effects on BA synthesis.”

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Response to Reviewer 1

      Thank you for your recognition of our revised work.

      Response to Reviewer 2

      It would be useful to have a demonstration of where this model outperforms SaProt systematically, and a discussion about what the success of this model teaches us given there is a similar, previously successful model, SaProt.

      As two concurrent works, ProtSSN and SaProt employ different methods to incorporate the structure information of proteins. Generally speaking, for two deep learning models that are developed during a close period, it is challenging to conclude that one model is systematically superior to another. Nonetheless, on DTm and DDG (the two low-throughput datasets that we constructed), ProtSSN demonstrates better empirical performance than SaProt.  

      Moreover, ProtSSN is more efficient in both training and inference compared to SaProt. In terms of training cost, SaProt uses 40 million protein structures for pretraining (requiring 64 A100 GPUs for three months), whereas ProtSSN requires only about 30,000 crystal structures from the CATH database (trained on a single 3090 GPU for two days). Despite SaProt’s significantly higher training cost, its pretrained version does not exhibit superior performance on low-throughput datasets such as DTm, DDG, and Clinvar. Furthermore, the high training cost limits many users from retraining or fine-tuning the model for specific needs or datasets.

      Regarding the inference cost, ProtSSN requires only one embedding computation for a wild-type protein, regardless of the number of mutants (n). In contrast, SaProt computes a separate embedding and score for each mutant. For instance, when evaluating the scoring performance on ProteinGym, ProtSSN only needs 217 inferences, while SaProt needs more than 2M inferences. This inference speed is important in practice, such as high-throughput design and screening.

      Please remove the reference to previous methods as "few shot". This typically refers to their being trained on experimental data, not their using MSAs. A "few shot" model would be ProteinNPT.

      The definition of "few-shot" we used here is following ESM1v [1]. This concept originates from providing a certain number of examples as input to GPT-3 [2]. In the context of protein deep learning models, MSA serves as the wild-type protein examples.

      Also, Reviewer 1 uses the concept in the same way. 

      “Readers should note that methods labelled as "few-shot" in comparisons do not make use of experimental labels, but rather use sequences inferred as homologous; these sequences are also often available even if the protein has never been experimentally tested.”

      In the main text, we also included this definition as well as the reference of ESM-1v in lines 457-458.

      “We extend the evaluation on ProteinGym v0 to include a comparison of our zero-shot ProtSSN with few-shot learning methods that leverage MSA information of proteins (Meier et al., 2021).”

      (1) Meier J, Rao R, Verkuil R, et al. Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in Neural Information Processing Systems, 2021.

      (2) Brown T, Mann B, Ryder N, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 2020.

      Furthermore, I don't think it is fair to state that your method is not comparable to these models -- one can run an MSA just as one can predict a structure. A fairer comparison would be to highlight particular assays for which getting an MSA could be challenging -- Transcription did this by showing that they outperform EVE when MSAs are shallow.

      We recognize that there are often differences in the definitions and classifications of various methodologies. Here, we follow the definitions provided by ProteinGym. As the most comprehensive and large scale open benchmark in the community, we believe this classification scheme should be widely accepted. All classifications are available on the official website of ProteinGym (https://proteingym.org/benchmarks), which categorizes methods into PLMs, Structure-based models, and Alignment-based models. For example, GEMME is classified as an alignment-based model, and MSA Transformer is considered a hybrid model combining alignment and PLM features.

      We believe that methodologies with different inputs and architectures can lead to inherent unfairness. Also, it is generally believed that models including evolutionary relationships tend to outperform end-to-end models due to the extra information and efforts involved during the training phase. Some empirical evidence and discussions are in the ablation studies of retrieval factors in Tranception [3]. Moreover, the choice of MSA search parameters can introduce uncertainty, which could have positive or negative impacts. 

      We showcase the impact of MSA depth on model performance with an additional analysis below. Author response image 1 visualizes the Spearman’s correlation between the scores of each model and the number of MSAs on 217 ProteinGym assays, where each point represents one of 217 assays. The summary correlation of each model with respect to all assays are reported in Author response table 1. These results demonstrate no clear correlation between MSA depth and model performance even for MSA-based models.

      Author response image 1.

      Scatter plots of the number of MSA sequences and spearman’s correlation.

      Author response table 1.

      Spearmar’s score of the number of MSA sequences and the model’s performance.

      (3) Notin P, Dias M, Frazer J, et al. Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. International Conference on Machine Learning, 2022.

      The authors state that DTm and DDG are conceptually appealing because they come from low-throughput assays with lower experimental noise and are also mutations that are particularly chosen to represent the most interesting regions of the protein. I agree with the conceptual appeal but I don't think these claims have been demonstrated in practice. The cited comparison with Frazer as a particularly noisy source of data I think is particularly unconvincing: ClinVar labels are not only rigorously determined from multiple sources of evidence, Frazer et al demonstrates that these labels are actually more reliable than experiment in some cases. They also state that ProteinGym data doesn't come with environmental conditions, but these can be retrieved from the papers the assays came from. The paper would be strengthened by a demonstration of the conceptual benefit of these new datasets, say a comparison of mutations and signal for a protein that may be in one of these datasets vs ProteinGym.

      In the work by Frazer et al. [4], they mentioned that

      "However, these technologies do not easily scale to thousands of proteins, especially not to combinations of variants, and depend critically on the availability of assays that are relevant to or at least associated with human disease phenotypes." 

      It points out that the results of high-throughput experiments are usually based on the design of specific genes (such as BRCA1 and TP53.) and cannot be easily extended to thousands of other genes. At the same time, due to the complexity of the experiment, there may be problems with reproducibility or deviations from clinical relevance.

      This statement aligns with our perspective that high-throughput experiments inherently involve a significant amount of noise and error. It is important to clarify that the noise we discuss here arises from the limitations of high-throughput experiments themselves, instead of from the reliability of the data sources, such as systematic errors in experimental measurements. This latter issue is a complex problem common to all wetlab experiments and falls outside the scope of our study.

      Under this premise, low-throughput datasets like DTm and DDG can be considered to have less noise than high-throughput datasets, as they have undergone manual curation. As for your suggestion, while valuable, unfortunately, we were unable to identify datasets in DTM and DDG that align with those in ProteinGym after a careful search. Thus, we are unable to conduct this comparative experiment at this stage.

      (4) Frazer J, Notin P, Dias M, et al. Disease variant prediction with deep generative models of evolutionary data. Nature, 2021.

    1. Author Response

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #2 (Public Review):

      I would like to express my appreciation for the authors' dedication to revising the manuscript. It is evident that they have thoughtfully addressed numerous concerns I previously raised, significantly contributing to the overall improvement of the manuscript.

      Response: We appreciate the reviewers’ recognition of our efforts in revising the manuscript.

      My primary concern regarding the authors' framing of their findings within the realm of habitual and goal-directed action control persists. I will try explain my point of view and perhaps clarify my concerns. While acknowledging the historical tendency to equate procedural learning with habits, I believe a consensus has gradually emerged among scientists, recognizing a meaningful distinction between habits and skills or procedural learning. I think this distinction is crucial for a comprehensive understanding of human action control. While these constructs share similarities, they should not be used interchangeably. Procedural learning and motor skills can manifest either through intentional and planned actions (i.e., goal-directed) or autonomously and involuntarily (habitual responses).

      Response: We would like to clarify that, contrary to the reviewer’s assertion of a scientific consensus on this matter, the discussion surrounding the similarities and differences between habits and skills remains an ongoing and unresolved topic of interest among scientists (Balleine and Dezfouli, 2019; Du and Haith, 2023; Graybiel and Grafton, 2015; Haith and Krakauer, 2018; Hardwick et al., 2019; Kruglanski and Szumowska, 2020; Robbins and Costa, 2017). We absolutely agree with the reviewer that “Procedural learning and motor skills can manifest either through intentional and planned actions (i.e., goal-directed) or autonomously and involuntarily (habitual responses)”. But so do habits. Some researchers also highlight the intentional/goal-directed nature of habits (e.g., Du and Haith, 2023, “Habits are not automatic” (preprint) or Kruglanski and Szumowska, 2020, “Habitual behavior is goal-driven”: “definitions of habits that include goal independence as a foundational attribute of habits are begging the question; they effectively define away, and hence dispose of, the issue of whether habits are goal-driven (p 1258).” Therefore, there is no clear consensus concerning the concept of habit.

      While we acknowledge the meaningful distinctions between habits and skills, we also recognize a substantial body of literature supporting the overlap between these concepts (cited in our manuscript), particularly at the neural level. The literature clearly indicates that both habits and skills are mediated by subcortical circuits, with a progressive disengagement of cognitive control hubs in frontal and cingulate cortices as repetition evolves. We do not use these concepts interchangeably. Instead, we simply present evidence supporting the assertion that our trained app sequences meet several criteria for their habitual nature.

      Our choice of Balleine and Dezfouli (2018)'s criteria stemmed from the comprehensive nature of their definitions, which effectively synthesized insights from various researchers (Mazar and Wood, 2018; Verplanken et al., 1998; Wood, 2017, etc). Importantly, their list highlights the positive features of habits that were previously overlooked. However, these authors still included a controversial criterion ("habits as insensitive to changes in their relationship to their individual consequences and the value of those consequences"), even though they acknowledged the problems of using outcome devaluation methods and of relying on a null-effect. According to Kruglanski and Szumowska (2020), this criterion is highly problematic as “If, by definition, habits are goalindependent, then any behavior found to be goal-dependent could not be a habit on sheer logical grounds” (p. 1257). In their definition, “habitual behavior is sensitive to the value of the reward (i.e., the goal) it is expected to mediate and is sensitive to the expectancy of goal attainment (i.e., obtainment of the reward via the behavior, p.1265). In fact, some recent analyses of habitual behavior are not using devaluation or revaluation as a criterion (Du and Haith, 2023). This article, for example, ascertains habits using different criteria and provides supporting evidence for trained action sequences being understood as skills, with both goal-directed and habitual components.

      In the discussion of our manuscript, we explicitly acknowledge that the app sequences can be considered habitual or goal-directed in nature and that this terminology does not alter the fact that our overtrained sequences exhibit clear habitual features.

      Watson et al. (2022) aptly detailed my concerns in the following statements: "Defining habits as fluid and quickly deployed movement sequences overlaps with definitions of skills and procedural learning, which are seen by associative learning theorists as different behaviors and fields of research, distinct from habits."

      "...the risk of calling any fluid behavioral repertoire 'habit' is that clarity on what exactly is under investigation and what associative structure underpins the behavior may be lost." I strongly encourage the authors, at the very least, to consider Watson et al.'s (2022) suggestion: "Clearer terminology as to the type of habit under investigation may be required by researchers to ensure that others can assess at a glance what exactly is under investigation (e.g., devaluationinsensitive habits vs. procedural habits)", and to refine their terminology accordingly (to make this distinction clear). I believe adopting clearer terminology in these respects would enhance the positioning of this work within the relevant knowledge landscape and facilitate future investigations in the field.

      Response: We would like to highlight that we have indeed followed Watson et al (2022)’s recommendations on focusing on other features/criteria of habits at the expense of the outcome devaluation/contingency degradation paradigm, which has been more controversial in the human literature. Our manuscript clearly aligns with Watson et al. (2022) ‘s recommendations: “there are many other features of habits that are not captured by the key metrics from outcome devaluation/contingency degradation paradigms such as the speed at which actions are performed and the refined and invariant characteristics of movement sequences (Balleine and Dezfouli, 2019). Attempts are being made to develop novel behavioral tasks that tap into these positive features of habits, and this should be encouraged as should be tasks that are not designed to assess whether that behavior is sensitive to outcome devaluation, but capture the definition of habits through other measures”.

      Regarding the authors' use of Balleine and Dezfouli's (2018) criteria to frame recorded behavior as habitual, as well as to acknowledgment the study's limitations, it's important to highlight that while the authors labelled the fourth criterion (which they were not fulfilling) as "resistance to devaluation," Balleine and Dezfouli (2018) define it as "insensitive to changes in their relationship to their individual consequences and the value of those consequences." In my understanding, this definition is potentially aligned with the authors' re-evaluation test, namely, it is conceptually adequate for evaluating the fourth criterion (which is the most accepted in the field and probably the one that differentiate habits from skills). Notably, during this test, participants exhibited goaldirected behavior.

      The authors characterized this test as possibly assessing arbitration between goal-directed and habitual behavior, stating that participants in both groups "demonstrated the ability to arbitrate between prior automatic actions and new goal-directed ones." In my perspective, there is no justification for calling it a test of arbitration. Notably, the authors inferred that participants were habitual before the test based on some criteria, but then transitioned to goal-directed behavior based on a different criterion. While I agree with the authors' comment that: "Whether the initiation of the trained motor sequences in experiment 3 (arbitration) is underpinned by an action-outcome association (or not) has no bearing on whether those sequences were under stimulus-response control after training (experiment 1)." they implicitly assert a shift from habit to goal-directed behavior without providing evidence that relies on the same probed mechanism. Therefore, I think it would be more cautious to refer to this test as solely an outcome revaluation test. Again, the results of this test, if anything, provide evidence that the fourth criterion was tested but not met, suggesting participants have not become habitual (or at least undermines this option).

      Response: In our previously revised manuscript, we duly acknowledged that the conventional (perhaps nowadays considered outdated) goal devaluation criterion was not met, primarily due to constraints in designing the second part of the study. We did cite evidence from another similar study that had used devaluation app-trained action sequences to demonstrate habitual qualities (but the reviewer ignored this).

      The reviewer points out that we did use a manipulation of goal revaluation in one of the follow-up tests conducted (although this was not a conventional goal revaluation test inasmuch that it was conducted in a novel context). In this test, please note that we used 2 manipulations: monetary and physical effort. Although we did show that subjects, including OCD patients, were apparently goaldirected in the monetary reward manipulation, this was not so clear when goal re-evaluation involved the physical effort expended. In this effort manipulation, participants were less goaloriented and OCD patients preferred to perform the longer, familiar, to the shorter, novel sequence, thus exhibiting significantly greater habitual tendencies, as compared to controls. Hence, we cannot decisively conclude that the action sequence is goal-directed as the reviewer is arguing. In fact, the evidence is equivocal and may reflect both habitual and goal-directed qualities in the performance of this sequence, consistent with recent interpretations of skilled/habitual sequences (Du and Haith, 2023). Relying solely on this partially met criterion to conclude that the app-trained sequences are goal-directed, and therefore not habitual, would be an inaccurate assessment for several reasons: 1) the action sequences did satisfy all other criteria for being habitual; 2) this approach would rest on a problematic foundation for defining habits, as emphasized by Kruglanski & Szumowska (2020); and 3) it would succumb to the pitfall of subscribing to a zero-sum game perspective, as cautioned by various researchers, including the review by Watson et al. (2022) cited by the referee, thus oversimplifying the nuanced nature of human behavior.

      While we have previously complied with the reviewer’s suggestion on relabelling our follow-up test as a “revaluation test” instead of an “arbitration test”, we have now explicitly removed all mentions of the term “arbitration” (which seems to raise concerns) throughout the manuscript. As the reviewer has suggested, we now use a more refined terminology by explicitly referring to the measured behavior as "procedural habits", as he/she suggested. We have also extensively revised the discussion section of our manuscript to incorporate the reviewer’s viewpoint. We hope that these adjustments enhance the clarity and accuracy of our manuscript, addressing the concerns raised during this review process.

      In essence, this is an ontological and semantic matter, that does not alter our findings in any way. Whether the sequences are consider habitual or goal directed, does not change our findings that 1) Both groups displayed equivalent procedural learning and automaticity attainment; 2) OCD patients exhibit greater subjective habitual tendencies via self-reported questionnaires; 3) Patients who had elevated compulsivity and habitual self-reported tendencies engaged significantly more with the motor habit-training app, practiced more and reported symptom relief at the end of the study; 4) these particular patients also show an augmented inclination to attribute higher intrinsic value to familiar actions, a possible mechanism underlying compulsions.

      Reviewer #2 (Recommendations For The Authors):

      A few more small comments (with reference to the point numbers indicated in the rebuttal):

      (14) I am not entirely sure why the suggested analysis is deemed impractical (i.e., why it cannot be performed by "pretending" participants received the points they should have received according to their performance). This can further support (or undermine) the idea of effect of reward on performance rather than just performance on performance.

      Response: We have now conducted this analysis, generating scores for each trial of practices after day 20, when participants no longer gained points for their performance. This analysis assesses whether participants trial-wise behavioral changes exhibit a similar pattern following simulated relative increases or decrease in scores, as if they had been receiving points at this stage. Note that this analysis has fewer trials available, around 50% less on average.

      Before presenting our results, we wish to emphasize the importance of distinguishing between the effects of performance on performance and the effects of reward on performance. In response to a reviewer's suggestion, we assessed the former in the first revision of our manuscript. We normalized the movement time variable and evaluated how normalized behavioral changes responded to score increments and decrements. The results from the original analyses were consistent with those from the normalized data.

      Regarding the phase where participants no longer received scores, we believe this phase primarily helps us understand the impact of 'predicted' or 'learned' rewards on performance. Once participants have learned the simple association between faster performance and larger scores, they can be expected to continue exhibiting the reward sensitivity effects described in our main analysis. We consider it is not feasible to assess the effects of performance on performance during the reward removal phase, which occurs after 20 days. Therefore, the following results pertain to how the learned associations between faster movement times and scores persist in influencing behavior, even when explicit scores are no longer displayed on the screen.

      Results: The main results of the effect of reward on behavioral changes persist, supporting that relative increases or decreases in scores (real or imagined/inferred) modulate behavioral adaptations trial-by-trial in a consistent manner across both cohorts. The direction of the effects of reward is the same as in the main analyses presented in the manuscript: larger mean behavioral changes (smaller std) following ∆R- . First, concerning changes in “normalized” movement time (MT) trial-by-trial, we conducted a 2 x 2 factorial analysis of the centroid of the Gaussian distributions with the same factors Reward, Group and Bin. This analysis demonstrated a significant main effect of Reward (P = 2e-16), but not of Group (P = 0.974) or Bin (P = 0.281). There were no significant interactions between factors. The main Reward effect can be observed in the top panel of the figure below. The same analysis applied to the spread (std) of the Gaussian distributions revealed a significant main effect of Reward (P = 0.000213), with no additional main effects or interactions.

      Author response image 1.

      Next, conducting the same 2 x 2 factorial analyses on the centroid and spread of the Gaussian distributions fitted to the Consistency data, we also obtained a robust significant main effect of Reward. For the centroid variable, we obtained a significant main effect of Reward (P = 0.0109) and Group (P = 0.0294), while Bin and the factor interactions were non-significant. See the top panel of the figure below.

      On the other hand, Reward also modulated significantly the spread of the Gaussian distributions fitted to the Consistency data, P = 0.00498. There were no additional significant main effects or interactions. See the bottom panel in the figure below.

      Note that here the factorial analysis was performed on the logarithmic transformation of the std.

      Author response image 2.

      (16) I find this result interesting and I think it might be worthwhile to include it in the paper.

      Response: We have now included this result in our revised manuscript (page 28)

      (18) I referred to this sentence: "The app preferred sequence was their preferred putative habitual sequence while the 'any 6' or 'any 3'-move sequences were the goal-seeking sequences." In my understanding, this implies one choice is habitual and another indicates goal-directedness.

      One last small comment:
In the Discussion it is stated: "Moreover, when faced with a choice between the familiar and a new, less effort-demanding sequence, the OCD group leaned toward the former, likely due to its inherent value. These insights align with the theory of goal-direction/habit imbalance in OCD (Gillan et al., 2016), underscoring the dominance of habits in particular settings where they might hold intrinsic value."

      This could equally be interpreted as goal-directed behavior, so I do not think there is conclusive support for this claim.

      Response: The choice of the familiar/trained sequence, as opposed to the 'any 6' or 'any 3'-move sequences cannot be explicitly considered goal-directed: firstly, because the app familiar sequences were associated with less monetary reward (in the any-6 condition), and secondly, because participants would clearly need more effort and time to perform them. Even though these were automatic, it would still be much easier and faster to simply tap one finger sequentially 6 times (any6) or 3 times (any-3). Therefore, the choice for the app-sequence would not be optimal/goaldirected. In this sense, that choice aligns with the current theory of goal-direction/habit imbalance of OCD. We found that OCD patients prefer to perform the trained app sequences in the physical effort manipulation (any-3 condition). While this, on one hand cannot be explicitly considered a goal-directed choice, we agree that there is another possible goal involved here, which links to the intrinsic value associated to the familiar sequence. In this sense the action could potentially be considered goal-directed. This highlights the difficulty of this concept of value and agrees with: 1) Hommel and Wiers (2017): “Human behavior is commonly not driven by one but by many overlapping motives . . . and actions are commonly embedded into larger-scale activities with multiple goals defined at different levels. As a consequence, even successful satiation of one goal or motive is unlikely to also eliminate all the others(p. 942) and 2) Kruglanski & Szumowska (2020)’s account that “habits that may be unwanted from the perspective of an outsider and hence “irrational” or purposeless, may be highly wanted from the perspective of the individual for whom a habit is functional in achieving some goal” (p. 1262) and therefore habits are goal-driven.

      References:

      Balleine BW, Dezfouli A. 2019. Hierarchical Action Control: Adaptive Collaboration Between Actions and Habits. Front Psychol 10:2735. doi:10.3389/fpsyg.2019.02735

      Du Y, Haith A. 2023. Habits are not automatic. doi:10.31234/osf.io/gncsf Graybiel AM, Grafton ST. 2015. The Striatum: Where Skills and Habits Meet. Cold Spring Harb Perspect Biol 7:a021691. doi:10.1101/cshperspect.a021691

      Haith AM, Krakauer JW. 2018. The multiple effects of practice: skill, habit and reduced cognitive load. Current Opinion in Behavioral Sciences 20:196–201. doi:10.1016/j.cobeha.2018.01.015

      Hardwick RM, Forrence AD, Krakauer JW, Haith AM. 2019. Time-dependent competition between goal-directed and habitual response preparation. Nat Hum Behav 1–11. doi:10.1038/s41562019-0725-0

      Hommel B, Wiers RW. 2017. Towards a Unitary Approach to Human Action Control. Trends Cogn Sci 21:940–949. doi:10.1016/j.tics.2017.09.009

      Kruglanski AW, Szumowska E. 2020. Habitual Behavior Is Goal-Driven. Perspect Psychol Sci 15:1256– 1271. doi:10.1177/1745691620917676

      Mazar A, Wood W. 2018. Defining Habit in Psychology In: Verplanken B, editor. The Psychology of Habit: Theory, Mechanisms, Change, and Contexts. Cham: Springer International Publishing. pp. 13–29. doi:10.1007/978-3-319-97529-0_2

      Robbins TW, Costa RM. 2017. Habits. Current Biology 27:R1200–R1206. doi:10.1016/j.cub.2017.09.060

      Verplanken B, Aarts H, van Knippenberg A, Moonen A. 1998. Habit versus planned behaviour: a field experiment. Br J Soc Psychol 37 ( Pt 1):111–128. doi:10.1111/j.2044-8309.1998.tb01160.x

      Watson P, O’Callaghan C, Perkes I, Bradfield L, Turner K. 2022. Making habits measurable beyond what they are not: A focus on associative dual-process models. Neurosci Biobehav Rev 142:104869. doi:10.1016/j.neubiorev.2022.104869

      Wood W. 2017. Habit in Personality and Social Psychology. Pers Soc Psychol Rev 21:389–403. doi:10.1177/1088868317720362

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #2 (Public Review):

      Summary:

      The manuscript by Kelbert et al. presents results on the involvement of the yeast transcription factor Sfp1 in the stabilisation of transcripts whose synthesis it stimulates. Sfp1 is known to affect the synthesis of a number of important cellular transcripts, such as many of those that code for ribosomal proteins. The hypothesis that a transcription factor can remain bound to the nascent transcript and affect its cytoplasmic half-life is attractive. However, the association of Sfp1 with cytoplasmic transcripts remains to be validated, as explained in the following comments:

      A two-hybrid based assay for protein-protein interactions identified Sfp1, a transcription factor known for its effects on ribosomal protein gene expression, as interacting with Rpb4, a subunit of RNA polymerase II. Classical two-hybrid experiments depend on the presence of the tested proteins in the nucleus of yeast cells, suggesting that the observed interaction occurs in the nucleus. Unfortunately, the two-hybrid method cannot determine whether the interaction is direct or mediated by nucleic acids. The revised version of the manuscript now states that the observed interaction could be indirect.

      To understand to which RNA Sfp1 might bind, the authors used an N-terminally tagged fusion protein in a cross-linking and purification experiment. This method identified 264 transcripts for which the CRAC signal was considered positive and which mostly correspond to abundant mRNAs, including 74 ribosomal protein mRNAs or metabolic enzyme-abundant mRNAs such as PGK1. The authors did not provide evidence for the specificity of the observed CRAC signal, in particular what would be the background of a similar experiment performed without UV cross-linking. This is crucial, as Figure S2G shows very localized and sharp peaks for the CRAC signal, often associated with over-amplification of weak signal during sequencing library preparation.

      (1) To rule out possible PCR artifacts, we used a UMI (Unique Molecular Identifier) scan. UMIs are short, random sequences added to each molecule by the 5’ adapter to uniquely tag them. After PCR amplification and alignment to the reference genome, groups of reads with identical UMIs represent only one unique original molecule. Thus, UMIs allow distinguishing between original molecules and PCR duplicates, effectively eliminating the duplicates.

      (2) Looking closely at the peaks using the IGV browser, we noticed that the reads are by no means identical. Each carrying a mutation [probably due to the cross-linking] in a different position and having different length. Note that the reads are highly reproducible in two replicate.

      (3) CRAC+ genes do not all fall into the category of highly transcribed genes.  On the contrary, as depicted in Figure 6A (green dots), it is evident that CRAC+ genes exhibit a diverse range of Rpb3 ChIP and GRO signals. Furthermore, as illustrated in Figure 7A, when comparing CRAC+ to Q1 (the most highly transcribed genes), it becomes evident that the Rpb4/Rpb3 profile of CRAC+ genes is not a result of high transcription levels.

      (4) Only a portion of the RiBi mRNAs binds Sfp1, despite similar expression of all RiBi.

      (5) The CRAC+ genes represent a distinct group with many unique features. Moreover, many CRAC+ genes do not fall into the category of highly transcribed genes.

      (6) The biological significance of the 262 CRAC+ mRNAs was demonstrated by various experiments; all are inconsistent with technical flaws. Some examples are:

      a) Fig. 2a and B show that most reads of CRAC+ mRNA were mapped to specific location – close the pA sites.

      b) Fig. 2C shows that most reads of CRAC+ mRNA were mapped to specific RNA motif.

      c) Most RiBi CRAC+ promoter contain Rap1 binding sites (p= 1.9x10-22), whereas the vast majority of RiBi CRAC- promoters do not contain Rap1 binding site. (Fig. 3C).

      d) Fig. 4A shows that RiBi CRAC+ mRNAs become destabilized due to Sfp1 deletion, whereas RiBi CRAC- mRNAs do not. Fig. 4B shows similar results due to

      e) Fig. 6B shows that the impact of Sfp1 on backtracking is substantially higher for CRAC+ than for CRAC- genes. This is most clearly visible in RiBi genes.

      f) Fig. 7A shows that the Sfp1-dependent changes along the transcription units is substantially more rigorous for CRAC+ than for CRAC-.

      g) Fig. S4B Shows that chromatin binding profile of Sfp1 is different for CRAC+ and CRAC- genes

      In a validation experiment, the presence of several mRNAs in a purified SFP1 fraction was measured at levels that reflect the relative levels of RNA in a total RNA extract. Negative controls showing that abundant mRNAs not found in the CRAC experiment were clearly depleted from the purified fraction with Sfp1 would be crucial to assess the specificity of the observed protein-RNA interactions (to complement Fig. 2D).

      GPP1, a highly expressed genes, is not to be pulled down by Sfp1 (Fig. 2D). GPP1 (alias RHR2) was included in our Table S2 as one of the 264 CRAC+ genes, having a low CRAC value. However, when we inspected GPP1 results using the IGV browser, we realized that the few reads mapped to GPP1 are actually anti-sense to GPP1 (perhaps they belong to the neighboring RPL34B genes, which is convergently transcribed to GPP1) (see Fig. 1 at the bottom of the document). Thus, GPP1 is not a CRAC+ gene and would now serve as a control. See  We changed the text accordingly (see page 11 blue sentences). In light of this observation, we checked other CRAC genes and found that, except for ALG2, they all contain sense reads (some contain both sense and anti-sense reads). ALG2 and GPP1 were removed leaving 262 CRAC+ genes.

      The CRAC-selected mRNAs were enriched for genes whose expression was previously shown to be upregulated upon Sfp1 overexpression (Albert et al., 2019). The presence of unspliced RPL30 pre-mRNA in the Sfp1 purification was interpreted as a sign of co-transcriptional assembly of Sfp1 into mRNA, but in the absence of valid negative controls, this hypothesis would require further experimental validation. Also, whether the fraction of mRNA bound by Sfp1 is nuclear or cytoplasmic is unclear.

      Further experimental validation was provided in some of our figures (e.g., Fig. 5C, Fig. 3B).

      We argue that Sfp1 binds RNA co-transcriptionally and accompanies the mRNA till its demise in the cytoplasm: Co-transcriptional binding is shown in: (I) a drop in the Sfp1 ChIP-exo signal that coincides with the position of Sfp1 binding site in the RNA (Fig. 5C), demonstrating a movement of Sfp1 from chromatin to the transcript, (II) the dependence of Sfp1 RNA-binding on the promoter (Fig. 3B) and binding of intron-containing RNA. Taken together these 3 different experiments demonstrate that Sfp1 binds Pol II transcript co-transcriptionally.  Association of Sfp1 with cytoplasmic mRNAs is shown in the following experiments: (I) Figure 2D shows that Sfp1 pulled down full length RNA, strongly suggesting that these RNA are mature cytoplasmic mRNAs. (II) mRNA encoding ribosomal proteins, which belong to the CRAC+ mRNAs group are degraded by Xrn1 in the cytoplasm (Bresson et al., Mol Cell 2020). The capacity of Sfp1 to regulates this process (Fig. 4A-D) is therefore consistent with cytoplasmic activity of Sfp1. (III) The effect of Sfp1 on deadenylation (Fig. 4D), a cytoplasmic process, is also consistent with cytoplasmic activity of Sfp1. 

      To address the important question of whether co-transcriptional assembly of Spf1 with transcripts could alter their stability, the authors first used a reporter system in which the RPL30 transcription unit is transferred to vectors under different transcriptional contexts, as previously described by the Choder laboratory (Bregman et al. 2011). While RPL30 expressed under an ACT1 promoter was barely detectable, the highest levels of RNA were observed in the context of the native upstream RPL30 sequence when Rap1 binding sites were also present. Sfp1 showed better association with reporter mRNAs containing Rap1 binding sites in the promoter region. Removal of the Rap1 binding sites from the reporter vector also led to a drastic decrease in reporter mRNA levels. Co-purification of reporter RNA with Sfp1 was only observed when Rap1 binding sites were included in the reporter. Negative controls for all the purification experiments might be useful.

      In the swapping experiment, the plasmid lacking RapBS serves as the control for the one with RapBS and vice versa (see Bregman et al., 2011). Remember, that all these contracts give rise to identical RNA. Indeed, RabBS affects both mRNA synthesis and decay, therefore the controls are not ideal. However, see next section.

      More importantly, in Fig. 3B “Input” panel, one can see that the RNA level of “construct F” was higher than the level of “construct E”. Despite this difference, only the RNA encoded by construct E was detected in the IP panel. This clearly shows that the detection of the RNA was not merely a result of its expression level.

      To complement the biochemical data presented in the first part of the manuscript, the authors turned to the deletion or rapid depletion of SFP1 and used labelling experiments to assess changes in the rate of synthesis, abundance and decay of mRNAs under these conditions. An important observation was that in the absence of Sfp1, mRNAs encoding ribosomal protein genes not only had a reduced synthesis rate, but also an increased degradation rate. This important observation needs careful validation,

      Indeed, we do provide validations in Fig. 4C Fig. 4D Fig. S3A and during the revision we included an  additional validation as Fig. S3B. Of note, we strongly suspect that GRO is among the most reliable approaches to determine half-lives (see our response in the first revision letter).

      As genomic run-on experiments were used to measure half-lives, and this particular method was found to give results that correlated poorly with other measures of half-life in yeast (e.g. Chappelboim et al., 2022 for a comparison). As an additional validation, a temperature shift to 42{degree sign}C was used to show that , for specific ribosomal protein mRNA, the degradation was faster, assuming that transcription stops at that temperature. It would be important to cite and discuss the work from the Tollervey laboratory showing that a temperature shift to 42{degree sign}C leads to a strong and specific decrease in ribosomal protein mRNA levels, probably through an accelerated RNA degradation (Bresson et al., Mol Cell 2020, e.g. Fig 5E).

      This was cited. Thank you. 

      Finally, the conclusion that mRNA deadenylation rate is altered in the absence of Sfp1, is difficult to assess from the presented results (Fig. 3D).

      This type of experiment was popular in the past. The results in the literature are similar to ours (in fact, ours are nicer). Please check the papers cited in our MS and a number of papers by Roy Parker.

      The effects of SFP1 on transcription were investigated by chromatin purification with Rpb3, a subunit of RNA polymerase, and the results were compared with synthesis rates determined by genomic run-on experiments. The decrease in polII presence on transcripts in the absence of SFP1 was not accompanied by a marked decrease in transcript output, suggesting an effect of Sfp1 in ensuring robust transcription and avoiding RNA polymerase backtracking. To further investigate the phenotypes associated with the depletion or absence of Sfp1, the authors examined the presence of Rpb4 along transcription units compared to Rpb3. An effect of spf1 deficiency was that this ratio, which decreased from the start of transcription towards the end of transcripts, increased slightly. To what extent this result is important for the main message of the manuscript is unclear.

      Suggestions: a) please clearly indicate in the figures when they correspond to reanalyses of published results.

      This was done.

      b) In table S2, it would be important to mention what the results represent and what statistics were used for the selection of "positive" hits. 

      This was discussed in the text.

      Strengths:

      - Diversity of experimental approaches used.

      - Validation of large-scale results with appropriate reporters.

      Weaknesses:

      - Lack of controls for the CRAC results and lack of negative controls for the co-purification experiments that were used to validate specific mRNA targets potentially bound by Sfp1.

      - Several conclusions are derived from complex correlative analyses that fully depend on the validity of the aforementioned Sfp1-mRNA interactions.

      We hope that our responses to Reviewer 2's thoughtful comments have rulled out concerns regarding the lack of controls.

      Recommendations for the authors:

      Reviewer #2 (Recommendations For The Authors):

      Please review the text for spelling errors. While not mandatory, wig or begraph files for the CRAC results would be very useful for the readers.

      Author response image 1.

      A snapshot of IGV GPP1 locus showing that all the reads are anti-sense (pointing at the opposite direction of the gene (the gene arrows [white arrows over blue, at the bottom] are pointing to the right whereas the reads’ orientations are pointing to the left).

    1. Author Response

      The following is the authors’ response to the current reviews.

      We confirm that that “count-down” parameter, mentioned by reviewer 1, is indeed counted from the first lockdown day and increases continuously, even when we do not have any data – and that this is clearly written in the manuscript.


      The following is the authors’ response to the original reviews.

      Reviewer 1:

      (Note, while these authors do reference Derryberry et al., I thought that there could have been much more direct comparison between the results of the two approaches).

      We added some more discussion of the differences between the papers.

      One important drawback of the approach, which potentially calls into question the authors' conclusions, is that the acoustic sampling only occurred during the pandemic: for several lockdown periods and then for a period of 10 days immediately after the end of the final lockdown period in May of 2020. Several relevant things changed from March to May of 2020, most notably the shift from spring to summer, and the accompanying shift into and through the breeding season (differing for each of the three focal species). Although the statistical methods included an attempt to address this, neither the inclusion of the "count down" variable nor the temperature variable could account for any non-linear effects of breeding phenology on vocal activity. I found the reliance on temperature particularly troubling, because despite the authors' claims that it was "a good proxy of seasonality", an examination of the temperature data revealed a considerable non-linear pattern across much of the study duration. In addition, using a period immediately after the lockdowns as a "no-lockdown" control meant that any lingering or delayed effects of human activity changes in the preceding two months could still have been relevant (not to mention the fact that despite the end of an official lockdown, the pandemic still had dramatic effects on human activity during late May 2020).

      In general, the reviewer is correct, and we reformulated some of the text to more carefully address these points. However, we would like to note two things: (1) Changes occurred rapidly with birds rapidly changing their behavior – this is one of the main conclusions of our study, i.e., that urban dwelling animals are highly plastic in behavior. So that lingering effects were unlikely. (2) Changes occurred in both directions, and thus seasonality (which is expected to have a uni-directional effect) cannot explain everything we observed. We are not sure what the reviewer means by ‘considerable non-linear patterns’ when referring to the temperature. Except for ~5 days with temperatures that exceeded the expected average by 3-4 degrees, the temperature increased approximately linearly during the period as expected from seasonality (see Author response image 1). Following the reviewer’s comment, we tested whether exclusion of data from these days changes the results and found no change.

      We would like to note that in terms of breeding, all birds were within the same state during both the lockdown and the non-lockdown periods. Parakeets and crows have a long breeding season Feb-end of June with one cycle. They will stay around the nest throughout this season and especially in the peak of the season March-May. Prinias start slightly later at the beginning of March with 2-3 cycles till end of June.

      Regarding the comment about human activity, as we now also note in the manuscript, reality in Israel was actually the opposite of the reviewer’s suggestion with people returning to normal behavior towards the end of the lockdown (even before its official removal). We believe that this added noise to our results, and that the effect of the lockdown was probably higher than we observed.

      Author response image 1.

      Another weakness of the current version of the manuscript is the use of a supposed "contradiction" in the existing literature to create the context for the present study. Although the various studies cited do have many differences in their results, those other papers lay out many nuanced hypotheses for those differences. Almost none of the studies cited in this manuscript actually reported blanket increases or decreases in urban birds, as suggested here, and each of those papers includes examples of species that showed different responses. To suggest that they are on opposite sides of a supposed dichotomy is a misrepresentation. Many of those other studies also included a larger number of different species, whereas this study focused on three. Finally, this study was completed at a much finer spatial scale than most others and was examining micro-habitat differences rather than patterns apparent across landscapes. I believe that highlighting differences in scale to explain nuanced differences among studies is a much better approach that more accurately adds to the body of literature.

      We thank the reviewer for this good feedback and revised the manuscript, accordingly, placing more emphasis on the micro-scale of this study.

      Finally a note on L244-247: I would recommend against discounting the possibility that lockdowns resulted in changes to the birds' vocal acoustics, as Derryberry et al. 2020 found, especially while suggesting that their results were the effects of signal processing artifacts. Audio analysis is not my area of expertise, but isn't it possible that the birds did increase call intensity, but were simply not willing (or able) to increase it to the same degree as the additional ambient noise?

      This is an important question. The fact is that when ambient noise increases (at the relevant frequency channels), then the measured vocalizations will also increase. There is no way to separate the two effects. Thus, as scientists, when we cannot measure an effect, it is safer not to suggest an effect. Unfortunately, most studies that claim an increase in vocalizations’ intensity in noise, do not account for this potential artifact (and most of them do not estimate noise at a species-specific level as we have done). This has created a lot of “noise” in the field. We do not want to criticize the Derryberry results without analyzing the data, but from reading their methods it does not seem like they took the noise into account in their acoustic measurements. But if you look at their figure 4A you will see a lot of variability in measuring the minimum frequency – which could be strongly affected by ambient noise.

      In light of the above, we thus prefer to be careful and not to state changes that are probably false. We added some of this information to the manuscript. We also added the linear equations to the graph (in the caption of figure 3) where it can be seen that the slope is always <=1.

      Reviewer 2:

      The explanation of methods can be improved. For example, it is not clear if data were low-pass filtered before resampling to avoid aliasing.

      We edited the methods and hopefully they are clearer now. Regarding the specific question – yes, an LPF was applied to prevent aliasing before the resampling. This information was added to the manuscript.

      It is quite possible that birds move into the trees and further from the recorders with human activity. Since sound level decreases by the square of the distance of the source from the recorders, this could significantly affect the data. As indicated in the Discussion, this is a significant parameter that could not be controlled.

      The reviewer is correct, and we addressed this point. Such biases could arise with any type of surveying including manual transects (except for perhaps, placing tags on the animals). We note that we only analyzed high SNR signals and that the species we selected somewhat overcome this bias – both crows and parakeets are not shy and Prinias are anyway shy and prefer to not be out in the open. We would also expect to see a stronger effect for human speech if this was a central phenomenon, and we did not see this, but of course this might have affected our results.

      In interpreting the data, the authors mention the effect of human activity on bird vocalizations in the context of inter-species predator-prey interactions; however, the presence of humans could also modify intraspecies interactions by acting as triggers for communication of warning and alarm, and/or food calls (as may sometimes be the case) to conspecifics. Along the same lines, it is important to have a better understanding of the behavioral significance of the syllables used to monitor animal activity in the present study.

      We agree with this point and added more discussion of both this potential bias and the type of syllables that were analyzed.

      Another potential effect that may influence the results but is difficult to study, relates to the examination of vocalizations near to the ambient noise level. This is the bandwidth of sound levels where most significant changes may occur, for example, due to the Lombard effect demonstrated in bird and bat species. However, as indicated, these are also more difficult to track and quantify. Moreover, human generated noise, other than speech, may be a more relevant factor in influencing acoustic activity of different bird species. Speech, per se, similar to the vocalizations of many other species, may simply enrich the acoustic environment so that the effects observed in the present study may be transient without significant long-term consequences.

      We note that we already included a noise parameter (in addition to human speech) in the original manuscript. Following the reviewer’s comment, we examined another factor, namely we replaced the previous ambient noise parameter with an estimate of ambient noise under 1kHz which should reflect most anthropogenic noise (not restricted to human speech). This model gave very similar results to the previous one (which is not very surprising as noise is usually correlated). We added this information to the revised manuscript, and we now also added examples of anthropogenic noise to the supplementary materials (Fig. S8). In general, we accept the comments made by the reviewer, but would like to emphasize that we only analyze high SNR vocalization (and not vocalizations that were close to the noise level). This strategy should have overcome biases that resulted from slight changes in ambient noise.

      In general, the authors achieved their aim of illustrating the complexity of the effect of human activity on animal behavior. At the same time, their study also made it clear that estimating such effects is not simple given the dynamics of animal behavior. For example, seasonality, temperature changes, animal migration and movement, as well as interspecies interactions, such as related to predator-prey behavior, and inter/intra-species competition in other respects can all play into site-specific changes in the vocal activity of a particular species.

      We completely agree and tried to further emphasize this in the revised manuscript. This is one of the main conclusions of this study – we should be careful when reaching conclusions.

      Although the methods used in the present study are statistically rigorous, a multivariate approach and visualization techniques afforded by principal components analysis and multidimensional scaling methods may be more effective in communicating the overall results.

      Following this comment, we ran a discriminant function analysis with the parameters of the best model (site category, ambient noise, human activity, temperature and lockdown state) with the task of classifying the level of bird activity. The DFA analysis managed to classify activity significantly above chance and the weights of the parameters revealed some insight about their relative importance. We added this information to the revised manuscript

      Suggestions for improvement:

      In Figure 2, the labeling of the Y-axis in the right panel should be moved to the left, similar to A and C. This will provide clear separation between the two side-to-side panels.

      Revised

      In Figure 3, it will be good to see the regression lines (as dashed lines) separately for the lockdown and no-lockdown conditions in addition to the overall effect.

      Revised

      Editor:

      Limitations

      Scale: The study's limited spatial and temporal scale was not addressed by the authors, which contrasts with the broader scope of other cited studies. To enhance the significance of the study, acknowledging and clearly highlighting this limitation, along with its potential caveats, modifications in the language used throughout the text would be beneficial. Furthermore, although the authors examined slight variations in habitat, it is important to note that all sites were primarily located within an urban landscape.

      We revised the manuscript accordingly.

      Control period: The control period is significantly shorter than the lockdown treatment period and occurs at a different time of year, potentially impacting the vocalization patterns of birds due to different annual cycle stages. It is crucial to consider that the control period falls within the pandemic timeframe despite being shortly after the lockdowns ended.

      Revised – we included a control comparison to periods of equal length within the lockdown. People gradually stopped obeying the lockdown regulations before its removal so in fact, the official removal date is probably an overestimate for the effect of the lockdown. We now explain this.

      Recommendations

      Human-generated noise, beyond speech, might have a greater influence on the acoustic activity of various bird species, but previous studies lacked detailed human activity data. Instead of solely noting the number of human talkers, the authors could quantify other aspects of human activity such as vehicles or overall anthropogenic noise volume. Exploring the relationships between these factors and bird activity at a fine scale, while disentangling them from bird detection, would be compelling. It is important to consider the potential difficulty in resolving other anthropogenic sounds within a specific bandwidth, which could be demonstrated to readers through spectrograms and potential post-pandemic changes. Such information, including daily coefficient of variation/fluctuation rather than absolute frequency spectra, could provide valuable insights.

      We note that we have already included an ambient noise factor (in addition to human speech) in the previous version. Following the reviewers’ comments, we examined another factor, namely we replaced the current ambient noise parameter with the ambient noise under 1kHz which should reflect most of anthropogenic noise (not restricted to human speech). This model gave very similar results to the previous one (which is not surprising as noise is usually correlated). We also added several spectrograms in the Supplementary material that show examples of different types of noise.

      Authors should limit their data interpretation to the impact of lockdown on behavioral responses within small-scale variations in habitat. A key critique is the assumption that activity changes solely resulted from the lockdown, disregarding other environmental factors and phenology.

      Following the editor comment we realized that our conclusion\assertations were not clear. We never claimed that activity changes solely resulted from the lockdown. While revsing the mansucirpt we ensurred that we show a significant effect of temperature, ambient noise and human activity – all of which are not dependent on lockdown. We made an effort to emphasize the complexity of the system. We show that the lockdown seemed to have an additional impact, but we never claimed it was the only factor.

      To address this, the authors could compare acoustic monitoring data within a shorter timeframe before and after the lockdown (20 days), while also controlling for temperature effects, to strengthen the validity of their claims. They would need to explain in their discussion, however, that such a comparison may still be confounded by any carry-over effects from the 10 days of treatment.

      This analysis would be difficult because although the lockdown was officially removed at a specific date, it was gradually less respected by the citizens and thus the last period of the lockdown was somewhere between lockdown and no-lockdown. This is why we chose the approach of taking 10 days randomly from within the lockdown period and comparing them with the 10 post-lockdown days. We now clarify the reason better.

      An option is that authors could frame their analysis as a study of the behavior of wildlife coming out of a lockdown, to draw a distinction from other studies that compared pre-pandemic data to pandemic data.

      Good idea – revised.

    1. Author Response

      The following is the authors’ response to the previous reviews.

      Point to point response for the editors

      We are deeply grateful for the time you have devoted to reviewing this manuscript, and we sincerely thank you. Your insightful feedback has been instrumental in enhancing the quality of our work.

      In the revised version of the manuscript, we have carefully addressed each of the concerns you raised. Below, you will find a detailed summary of how your feedback has been incorporated to improve the overall content and clarity of the document.

      1. P2RX7 effects: In Figure 2, the vehicle treated P2RX7 knockout (panel M) shows an Ashcroft score of about 1.5 after BLM. Comparing this to the Ashcroft score of 3 after BLM in the wildtype (panel C) suggests that P2RX7 deletion is an effective way to reduce fibrosis by half!.

      The argument that HEI3090 also reduces fibrosis by activating P2RX7 is of course very difficult to convey and it seems contradictory that P2RX7 deletion and P2RX7 activation can be both anti-fibrotic. This is an unusual claim and confuses the reviewers as well as the future readers.

      This has many important health implications because activating an inflammatory pathway via P2RX7 and IL-18 could be risky in terms of a fibrosis treatment as inflammatory activation can also worsen fibrosis. The authors' own P2RX7 KO data (untreated vehicle groups) indeed confirms that P2RX7 can be pro-fibrotic.

      We thank the editors for their comment highlighting the lack of clarity in our message. Indeed, we verified whether the antifibrotic action of HEI3090 depends on the expression of P2RX7 by inducing lung fibrosis in P2RX7 KO mice. In doing so, we initially observed that P2RX7 plays a role in the development of BLM-induced lung fibrosis. This is illustrated by a decrease of 50% in the Ashcroft score, as shown in Figure 2M and Supplemental Figure 2C of the revised manuscript.

      To increase the clarity of your message, we added in the text the following paragraph:

      "We further verified whether the antifibrotic action of HEI3090 depends on the expression of P2RX7 by inducing lung fibrosis in p2rx7 knockout (KO) mice. In doing so, we initially observed that P2RX7 plays a role in the development of BLM-induced lung fibrosis. This is illustrated by a decrease of 50% in the Ashcroft score, with a mean value of 1.7 in P2RX7 knockout mice compared to 3 in wild-type mice (Figure 2M and Supplemental Figure 2C). It is important to note that p2rx7 -/- mice still exhibit signs of lung fibrosis, such as thickening of the alveolar wall and a reduction in free air space, in comparison to naïve mice that received PBS instead of BLM (see Supplemental Figure 2A). This result confirms a previous report indicating that BLM-induced lung fibrosis partially depends on the activation of the P2RX7/pannexin-1 axis, leading to the production of IL-1β in the lung. Additionally, in contrast to the observations in WT mice, HEI3090 failed to attenuate the remaining lung fibrosis in p2rx7 -/- mice, as measured by the Ashcroft score (Figure 2M), the percentage of lung tissue with fibrotic lesions, or the intensity of collagen fibers (Supplemental Figure 2D). These results show that P2RX7 alone participates in fibrosis and that HEI3090 exerts a specific antifibrotic effect through this receptor (see Supplemental Figure 2C)."

      Since we used the HEI3090 compound in this study and to be closer to the results, we have replaced the title of 2 chapters in the results section as followed:

      “HEI3090 inhibits the onset of pulmonary fibrosis in the bleomycin mouse model” instead of P2RX7 activation inhibits the onset of pulmonary fibrosis in the bleomycin mouse model and “HEI3090 shapes immune cell infiltration in the lungs" instead of P2RX7 activation shapes immune cell infiltration in the lungs

      We concur that the observation of both anti-fibrotic effects following P2RX7 deletion and P2RX7 activation appears contradictory. This specific aspect has been thoroughly addressed and extensively discussed in the revised manuscript.

      “A major unmet need in the field of IPF is new treatment to fight this uncurable disease. In this preclinical study, we demonstrate the ability of immune cells to limit lung fibrosis progression. Based on the hypothesis that a local activation of a T cell immune response and upregulation of IFN-γ production has antifibrotic proprieties, we used the HEI3090 positive modulator of the purinergic receptor P2RX7, previously developed in our laboratory (Douguet et al., 2021), to demonstrate that activation of the P2RX7/IL-18 pathway attenuates lung fibrosis in the bleomycin mouse model. We have demonstrated that lung fibrosis progression is inhibited by HEI3090 in the fibrotic phase but also in the acute phase of the BLM fibrosis mouse model, i.e. during the period of inflammation. This lung fibrosis mouse model commonly employed in preclinical investigations, has recently been recognized as the optimal model for studying IPF (Jenkins et al., 2017). In this model, the intrapulmonary administration of BLM induces DNA damage in alveolar epithelial type 1 cells, triggering cellular demise and the release of ATP. The extracellular release of ATP from injured cells activates the P2RX7/pannexin 1 axis, initiating the maturation of IL1β and subsequent induction of inflammation and fibrosis. In line with this, mice lacking P2RX7 exhibited reduced neutrophil counts in their bronchoalveolar fluids and decreased levels of IL1β in their lungs compared to WT mice (Riteau et al., 2010). Based on these findings, Riteau and colleagues postulated that the inhibition of P2RX7 activity may offer a potential strategy for the therapeutic control of fibrosis in lung injury. In the present study we provided strong evidence showing that selective activation of P2RX7 on immune cells, through the use of HEI3090, can dampen inflammation and fibrosis by releasing IL-18. The efficacy of HEI3090 to inhibit lung fibrosis was evaluated histologically on the whole lung’s surface by evaluating the severity of fibrosis using three independent approaches applied to the whole lung, the Ashcroft score, quantification of fibroblasts/myofibroblasts (CD140a) and polarized-light microscopy of Sirius Red staining to quantify collagen fibers. All these methods of fibrosis assessment revealed that HEI3090 exerts an inhibitory effect on lung fibrosis, underscoring the necessity for a thorough pre-clinical assessment of HEI3090's mode of action. Notably, HEI3090 functions as an activator, rather than an inhibitor, of P2RX7, further emphasizing the importance of elucidating its intricate mechanisms.”

      We trust that the detailed explanation provided therein will adequately persuade both the reviewers and future readers.

      1. The statistical concerns are based on the phrasing of "the experiment was stopped when significantly statistical results were observed". This is different from the power analysis approach that the authors describe in their latest rebuttal. However, it raises the question why the power analysis was performed using "on a one-way ANOVA analysis comparing in each experiment the vehicle and the treated group". The analyses in the manuscript use the Mann-Whitney test for several comparisons which ahs the assumption that the samples do NOT have a normal distribution. An ANOVA and t-tests have the assumption that samples are normally distributed. If the power analysis and "statistical forecasting" assumed a normal distribution and used an ANOVA, then shouldn't all the analyses also use a statistical test appropriate for normally distributed samples such as ANOVA and t-tests?

      Several of the data points in the figures seem to be normally distributed and therefore t-test for two group comparisons would be more appropriate. The most rigorous approach would be to check for normal distribution before choosing the correct statistical test and using the t-test/ANOVA in normally distributed data as well as Mann-Whitney for non-normally distributed data.

      We described in the Material and Method section of the revised manuscript our approach to determine the size of experimental group.

      “The determination of experimental group sizes involved conducting a pilot experiment with four mice in each group. Subsequently, a power analysis, based on the pilot experiment's findings (which revealed a 40% difference with a standard error of 0.9, α risk of 0.05, and power of 0.8), was performed to ascertain the appropriate group size for studying the effects of HEI3090 on BLM-induced lung fibrosis. The results of the pilot experiment and power analysis indicated that a group size of four mice was sufficient to characterize the observed effects. For each full-scale experiment, we initiated the study with 6 to 8 mice per group, ensuring a minimum of 5 mice in each group for robust statistical analysis. Additionally, we systematically employed the ROULT method to identify and subsequently exclude any outliers present in each experiment before conducting statistical analyses”.

      We now described in the Material and Method section how we carried out the statistical analyses.

      “Quantitative data were described and presented graphically as medians and interquartiles or means and standard deviations. The distribution normality was tested with the Shapiro's test and homoscedasticity with a Bartlett's test. For two categories, statistical comparisons were performed using the Student's t-test or the Mann–Whitney's test. For three and more categories, analysis of variance (ANOVA) or non-parametric data with Kruskal–Wallis was performed to test variables expressed as categories versus continuous variables. If this test was significant, we used the Tukey's test to compare these categories and the Bonferroni’s test to adjust the significant threshold. For the Gene Set Enrichment Analyses (GSEA), bilateral Kolmogorov–Smirnov test, and false discovery rate (FDR) were used. All statistical analyses were performed by biostatistician using Prism8 program from GraphPad software. Tests of significance was two-tailed and considered significant with an alpha level of P < 0.05. (graphically: * for P < 0.05, ** for P < 0.01, *** for P < 0.001).”

      We also added in the legend of each figure, the statistical analysis used to determine each p-values.

      1. Adoptive transfer: The concerns of the reviewers include an unclear analysis of the effects of adoptive transfer itself and the approaches used to analyze the data independent of the HEI3090 effect. For example, in Figure 4, the adoptive transfer IL18-/- cells (vehicle group) leads to an Ashcroft score of about 1 and among the lowest of the BLM exposed mice. Does that mean that IL18 is pro-fibrotic and that its absence is beneficial? If yes, it would go against the core premise of the study that IL18 is beneficial. Statistical comparisons of the all the vehicle conditions in the adoptive transfer would help clarify whether adoptive transfer of NLRP3-/-, IL18-/- in wild-type and P2RX7-/- mice reduces or increases fibrosis. Such multiple comparisons are necessary to fully understand the adoptive transfer studies and would also require the appropriate statistical test with corrections for multiple comparisons such as Kruskal-Wallis for data without normal distribution and ANOVA with post hoc correction for normal distribution.

      We added a new paragraph in the revised version of the manuscript to explain the adoptive transfer approach.

      “We wanted to further investigate the mechanism of action of HEI3090 by identifying the cellular compartment and signaling pathway required for its activity. Since the expression of P2RX7 and the P2RX7-dependent release of IL-18 are mostly associated with immune cells (Ferrari et al., 2006), and since HEI3090 shapes the lung immune landscape (Figure 3), we investigated whether immune cells were required for the antifibrotic effect of HEI3090. To do so, we conducted adoptive transfer experiments wherein immune cells from a donor mouse were intravenously injected one day before BLM administration into an acceptor mouse. The intravenous injection route was chosen as it is a standard method for targeting the lungs, as previously documented (Wei and Zhao, 2014). This approach was previously used with success in our laboratory (Douguet et al., 2021). It is noteworthy that this adoptive transfer approach did not influence the response to HEI3090. This was observed consistently in both p2rx7 -/- mice and p2rx7 -/- mice that received splenocytes of the same genetic background. In both cases, HEI3090 failed to mitigate lung fibrosis, as depicted in Figure 2M and Supplemental Figures 2D and 6A and B.”

      We added the Supplemental Figure 7 showing that the genetic background does not impact lung fibrosis at steady step levels where p-values were analyzed by one-way ANOVA, with Kruskal-Wallis test for multiple comparisons.

      Author response image 1.

      Supplemental Figure 7 : The genetic background does not impact lung fibrosis at steady step levels. p2rx7-/- mice were given 3.106 WT, nlrp3-/ , i118-/ or illb -l- splenocytes i_v_ one day prior to BLM delivery (i_n_ 2.5 LJ/kg) p2rx7-/- mice or p2rx7-/- mice adoptively transferred with splenocytes from indicated genetic background were treated daily i.p with mg/kg HE13090 or vehicle for 14 days. Fibrosis score assessed by the Ashcroft method. P-values were analyzed on all treated and non treated groups by one-way ANOVA, with Kruskal-Wallis test for multiple comparisons. The violin plot illustrates the distribution of Ashcroft scores across indicated experimental groups. The width of the violin at each point represents the density of data, and the central line indicates the median expression level. Each point represents one biological replicate. ns, not significant

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #1 (Public review):

      In this study, Marocco and colleages perform a deep characterization of the complex molecular mechanism guiding the recognition of a particular CELLmotif previously identified in hepatocytes in another publication. Having miR-155-3p with or without this CELLmotif as initial focus, authors identify 21 proteins differentially binding to these two miRNA versions. From these, they decided to focus on PCBP2. They elegantly demonstrate PCBP2 binding to miR-155-3p WT version but not to CELLmotif-mutated version. miR-155-3p contains a hEXOmotif identified in a different report, whose recognition is largely mediated by another RNA-binding protein called SYNCRIP. Interestingly, mutation of the hEXOmotif contained in miR-155-3p did not only blunt SYNCRIP binding, but also PCBP2 binding despite the maintenance of the CELLmotif. This indicates that somehow SYNCRIP binding is a pre-requisite for PCBP2 binding. EMSA assay confirms that SYNCRIP is necessary for PCBP2 binding to miR-155-3p, while PCBP2 is not needed for SYNCRIP binding. Then authors aim to extend these finding to other miRNAs containing both motifs. For that, they perform a small-RNA-Seq of EVs released from cells knockdown for PCBP2 versus control cells, identifying a subset of miRNAs whose expression either increases or decreases. The assumption is that those miRNAs containing PCBP2-binding CELLmotif should now be less retained in the cell and go more to extracellular vesicles, thus reflecting a higher EV expression. The specific subset of miRNAs having both the CELLmotif and hEXOmotif (9 miRNAs) whose expressions increase in EVs due to PCBP2 reduction is also affected by knocking-down SYNCRIP in the sense that reduction of SYNCRIP leads to lower EV sorting. Further experiments confirm that PCBP2 and SYNCRIP bind to these 9 miRNAs and that knocking down SYNCRIP impairs their EV sorting.

      In the revised manuscript, the authors have addressed most of my concerns and questions. I believe the new experiments provide stronger support for their claims. My only remaining concern is the lack of clarity in the replicates for the EMSA experiment. The one shown in the manuscript is clear; however, the other three replicates hardly show that knocking down SYNCRIP has an effect on PCBP2 binding. Even worse is the fact that these replicates do not support at all that PCBP2 silencing has no effect on SYNCRIP binding, as the bands for those types of samples are, in most of the cases, not visible. I think the authors should work on repeating a couple of times EMSA experiment.

      We thank this Reviewer for having appreciated the novelty and the robustness of our data. In accordance with the Reviewer’s concern, we repeated the EMSA assay, specifically to address the PCBP2-independent SYNCRIP binding. In Author response image 1, we report the new EMSA replicates (top), the quantification of each signal (bottom) and the mean of EMSA signals relative to the three independent experiments (right). We hope that the new evidence will meet the required standards.

      Author response image 1.

      Reviewer #2 (Public review):

      Summary:

      The author of this manuscript aimed to uncover the mechanisms behind miRNA retention within cells. They identified PCBP2 as a crucial factor in this process, revealing a novel role for RNAbinding proteins. Additionally, the study discovered that SYNCRIP is essential for PCBP2's function, demonstrating the cooperative interaction between these two proteins. This research not only sheds light on the intricate dynamics of miRNA retention but also emphasizes the importance of protein interactions in regulating miRNA behavior within cells.

      Strengths:

      This paper makes important progress in understanding how miRNAs are kept inside cells. It identifies PCBP2 as a key player in this process, showing a new role for proteins that bind RNA. The study also finds that SYNCRIP is needed for PCBP2 to work, highlighting how these proteins work together. These discoveries not only improve our knowledge of miRNA behavior but also suggest new ways to develop treatments by controlling miRNA locations to influence cell communication in diseases. The use of liver cell models and thorough experiments ensures the results are reliable and show their potential for RNA-based therapies

      Weaknesses:

      The manuscript is well-structured and presents compelling data, but I noticed a few minor corrections that could further enhance its clarity:

      Figure References: In the response to Reviewer 1, the comment states, "It's not Panel C, it's Panel A of Figure 1"-this should be cross-checked for consistency.

      Supplementary Figure 2 is labeled as "Panel A"-please verify if additional panels (B, C, etc.) are intended.

      Western Blot Quality: The Alix WB shows some background noise. A repeat with optimized conditions (or inclusion of a cleaner replicate) would strengthen the data. Adding statistical analysis for all WBs would also reinforce robustness.

      These are relatively small refinements, and the manuscript is already in excellent shape. With these adjustments, it will be even stronger.

      We deeply thank this Reviewer for having considered this new version of the manuscript and for having described its shape as excellent. In order to address the Reviewer’s concerns, we crosschecked the consistency of the described figures’ panels described in the text accordingly. Regarding the qualitative analysis of EV markers, we repeated the western blot analysis with optimized conditions as suggested and included the new panel (Author response image 2) in the supplementary figure 2, allowing to appreciate the signal relative to ALIX expression.

      Author response image 2.

       

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      Careful reading is required to rectify typo errors.

      We thank the Reviewer for this suggestion. We amended the text to rectify typo errors.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public Review): 

      The reviewer retained most of their comments from the previous reviewing round. In order to meet these comments and to further examine the dynamic nature of threat omission-related fMRI responses, we now re-analyzed our fMRI results using the single trial estimates. The results of these additional analyses are added below in our response to the recommendations for the authors of reviewer 1. However, we do want to reiterate that there was a factually incorrect statement concerning our design in the reviewer’s initial comments. Specifically, the reviewer wrote that “25% of shocks are omitted, regardless of whether subjects are told that the probability is 100%, 75%, 50%, 25%, or 0%.” We want to repeat that this is not what we did. 100% trials were always reinforced (100% reinforcement rate); 0% trials were never reinforced (0% reinforcement rate). For all other instructed probability levels (25%, 50%, 75%), the stimulation was delivered in 25% of the trials (25% reinforcement rate). We have elaborated on this misconception in our previous letter and have added this information more explicitly in the previous revision of the manuscript (e.g., lines 125-129; 223-224; 486-492).   

      Reviewer #1 (Recommendations For The Authors): 

      I do not have any further recommendations, although I believe an analysis of learning-related changes is still possible with the trial-wise estimates from unreinforced trials. The authors' response does not clarify whether they tested for interactions with run, and thus the fact that there are main effects does not preclude learning. I kept my original comments regarding limitations, with the exception of the suggestion to modify the title. 

      We thank the reviewer for this recommendation. In line with their suggestion, we have now reanalyzed our main ROI results using the trial-by-trial estimates we obtained from the firstlevel omission>baseline contrasts. Specifically, we extracted beta-estimates from each ROI and entered them into the same Probability x Intensity x Run LMM we used for the relief and SCR analyses. Results from these analyses (in the full sample) were similar to our main results. For the VTA/SN model, we found main effects of Probability (F = 3.12, p = .04), and Intensity (F = 7.15, p < .001) (in the model where influential outliers were rescored to 2SD from mean). There was no main effect of Run (F = 0.92, p = .43) and no Probability x Run interaction (F = 1.24, p = .28). If the experienced contingency would have interfered with the instructions, there should have been a Probability x Run interaction (with the effect of Probability only being present in the first runs). Since we did not observe such an interaction, our results indicate that even though some learning might still have taken place, the main effect of Probability remained present throughout the task.  

      There is an important side note regarding these analyses: For the first level GLM estimation, we concatenated the functional runs and accounted for baseline differences between runs by adding run-specific intercepts as regressors of no-interest. Hence, any potential main effect of run was likely modeled out at first level. This might explain why, in contrast to the rating and SCR results (see Supplemental Figure 5), we found no main effect of Run. Nevertheless, interaction effects should not be affected by including these run-specific intercepts.

      Note that when we ran the single-trial analysis for the ventral putamen ROI, the effect of intensity became significant (F = 3.89, p = .02). Results neither changed for the NAc, nor the vmPFC ROIs.  

      Reviewer #2 (Public Review): 

      Comments on revised version: 

      I want to thank the authors for their thorough and comprehensive work in revising this manuscript. I agree with the authors that learning paradigms might not be a necessity when it comes to study the PE signals, but I don't particularly agree with some of the responses in the rebuttal letter ("Furthermore, conditioning paradigms generally only include one level of aversive outcome: the electrical stimulation is either delivered or omitted."). This is of course correct description for the conditioning paradigm, but the same can be said for an instructed design: the aversive outcome was either delivered or not. That being said, adopting the instructed design itself is legitimate in my opinion. 

      We thank the reviewer for this comment. We have now modified the phrasing of this argument to clarify our reasoning (see lines 102-104: “First, these only included one level of aversive outcome: the electrical stimulation was either delivered at a fixed intensity, or omitted; but the intensity of the stimulation was never experimentally manipulated within the same task.”).  

      The reason why we mentioned that “the aversive outcome is either delivered or omitted” is because in most contemporary conditioning paradigms only one level of aversive US is used. In these cases, it is therefore not possible to investigate the effect of US Intensity. In our paradigm, we included multiple levels of aversive US, allowing us to assess how the level of aversiveness influences threat omission responding. It is indeed true that each level was delivered or not. However, our data clearly (and robustly across experiments, see Willems & Vervliet, 2021) demonstrate that the effects of the instructed and perceived unpleasantness of the US (as operationalized by the mean reported US unpleasantness during the task) on the reported relief and the omission fMRI responses are stronger than the effect of instructed probability.  

      My main concern, which the authors spent quite some length in the rebuttal letter to address, still remains about the validity for different instructed probabilities. Although subjects were told that the trials were independent, the big difference between 75% and 25% would more than likely confuse the subjects, especially given that most of us would fall prey to the Gambler's fallacy (or the law of small numbers) to some degree. When the instruction and subjective experience collides, some form of inference or learning must have occurred, making the otherwise straightforward analysis more complex. Therefore, I believe that a more rigorous/quantitative learning modeling work can dramatically improve the validity of the results. Of course, I also realize how much extra work is needed to append the computational part but without it there is always a theoretical loophole in the current experimental design. 

      We agree with the reviewer that some learning may have occurred in our task. However, we believe the most important question in relation to our study is: to what extent did this learning influence our manipulations of interest?  

      In our reply to reviewer 1, we already showed that a re-analysis of the fMRI results using the trial-by-trial estimates of the omission contrasts revealed no Probability x Run interaction, suggesting that – overall – the probability effect remained stable over the course of the experiment. However, inspired by the alternative explanation that was proposed by this reviewer, we now also assessed the role of the Gambler’s fallacy in a separate set of analyses. Indeed, it is possible that participants start to expect a stimulation more after more time has passed since the last stimulation was experienced. To test this alternative hypothesis, we specified two new regressors that calculated for each trial of each participant how many trials had passed since the last stimulation (or since the beginning of the experiment) either overall (across all trials of all probability types; hence called the overall-lag regressor) or per probability level (across trials of each probability type separately; hence called the lag-per-probability regressor). For both regressors a value of 0 indicates that the previous trial was either a stimulation trial or the start of experiment, a value of 1 means that the last stimulation trial was 2 trials ago, etc.  

      The results of these additional analyses are added in a supplemental note (see supplemental note 6), and referred to in the main text (see lines 231-236: “Likewise, a post-hoc trial-by-trial analysis of the omission-related fMRI activations confirmed that the Probability effect for the VTA/SN activations was stable over the course of the experiment (no Probability x Run interaction) and remained present when accounting for the Gambler’s fallacy (i.e., the possibility that participants start to expect a stimulation more when more time has passed since the last stimulation was experienced) (see supplemental note 6). Overall, these post-hoc analyses further confirm the PE-profile of omission-related VTA/SN responses”.  

      Addition to supplemental material (pages 16-18)

      Supplemental Note 6: The effect of Run and the Gambler’s Fallacy 

      A question that was raised by the reviewers was whether omission-related responses could be influenced by dynamical learning or the Gambler’s Fallacy, which might have affected the effectiveness of the Probability manipulation.  

      Inspired by this question, we exploratorily assessed the role of the Gambler’s Fallacy and the effects of Run in a separate set of analyses. Indeed, it is possible that participants start to expect a stimulation more when more time has passed since the last stimulation was experienced. To test this alternative hypothesis, we specified two new regressors that calculated for each trial of each participant how many trials had passed since the last stimulation (or since the beginning of the experiment) either overall (across all trials of all probability types; hence called the overall-lag regressor) or per probability level (across trials of each probability type separately; hence called the lag-per-probability regressor). For both regressors a value of 0 indicates that the previous trial was either a stimulation trial or the start of experiment, a value of 1 means that the last stimulation trial was 2 trials ago, etc.  

      The new models including these regressors for each omission response type (i.e., omission-related activations for each ROI, relief, and omission-SCR) were specified as follows:   

      (1) For the overall lag:

      Omission response ~ Probability * Intensity * Run + US-unpleasantness + Overall-lag + (1|Subject).  

      (2) For the lag per probability level:

      Omission response ~ Probability * Intensity * Run + US-unpleasantness + Lag-perprobability : Probability + (1|Subject).  

      Where US-unpleasantness scores were mean-centered across participants; “*” represents main effects and interactions, and “:” represents an interaction (without main effect). Note that we only included an interaction for the lag-per-probability model to estimate separate lag-parameters for each probability level.  

      The results of these analyses are presented in the tables below. Overall, we found that adding these lag-regressors to the model did not alter our main results. That is: for the VTA/SN, relief and omission-SCR, the main effects of Probability and Intensity remained. Interestingly, the overall-lag-effect itself was significant for VTA/SN activations and omission SCR, indicating that VTA/SN activations were larger when more time had passed since the last stimulation (beta = 0.19), whereas SCR were smaller when more time had passed (beta = -0.03). This pattern is reminiscent of the Perruchet effect, namely that the explicit expectancy of a US increases over a run of non-reinforced trials (in line with the gambler’s fallacy effect) whereas the conditioned physiological response to the conditional stimulus declines (in line with an extinction effect, Perruchet, 1985; McAndrew, Jones, McLaren, & McLaren, 2012). Thus, the observed dissociation between the VTA/SN activations and omission SCR might similarly point to two distinctive processes where VTA/SN activations are more dependent on a consciously controlled process that is subjected to the gambler’s fallacy, whereas the strength of the omission SCR responses is more dependent on an automatic associative process that is subjected to extinction. Importantly, however, even though the temporal distance to the last stimulation had these opposing effects on VTA/SN activations and omission SCRs, the main effects of the probability manipulation remained significant for both outcome variables. This means that the core results of our study still hold.   

      Next to the overall-lag effect, the lag-per-probability regressor was only significant for the vmPFC. A follow-up of the beta estimates of the lag-per-probability regressors for each probability level revealed that vmPFC activations increased with increasing temporal distance from the stimulation, but only for the 50% trials (beta = 0.47, t = 2.75, p < .01), and not the 25% (beta = 0.25, t = 1.49, p = .14) or the 75% trials (beta = 0.28, t = 1.62, p = .10).

      Author response table 1.

      F-statistics and corresponding p-values from the overall lag model. (*) F-test and p-values were based on the model where outliers were rescored to 2SD from the mean. Note that when retaining the influential outliers for this model, the p-value of the probability effect was p = .06. For all other outcome variables, rescoring the outliers did not change the results. Significant effects are indicated in bold.

      Author response table 2.

      F-statistics and corresponding p-values from the lag per probability level model. (*) F-test and p-values were based on the model where outliers were rescored to 2SD from the mean. Note that when retaining the influential outliers for this model, the p-value of the Intensity x Run interaction was p = .05. For all other outcome variables, rescoring the outliers did not change the results. Significant effects are indicated in bold.

      As the authors mentioned in the rebuttal letter, "selecting participants only if their anticipatory SCR monotonically increased with each increase in instructed probability 0% < 25% < 50% < 75% < 100%, N = 11 participants", only ~1/3 of the subjects actually showed strong evidence for the validity of the instructions. This further raises the question of whether the instructed design, due to the interference of false instruction and the dynamic learning among trials, is solid enough to test the hypothesis .  

      We agree with the reviewer that a monotonic increase in anticipatory SCR with increasing probability instructions would provide the strongest evidence that the manipulation worked. However, it is well known that SCR is a noisy measure, and so the chances to see this monotonic increase are rather small, even if the underlying threat anticipation increases monotonically. Furthermore, between-subject variation is substantial in physiological measures, and it is not uncommon to observe, e.g., differential fear conditioning in one measure, but not in another (Lonsdorf & Merz, 2017). It is therefore not so surprising that ‘only’ 1/3 of our participants showed the perfect pattern of monotonically increasing SCR with increasing probability instructions. That being said, it is also important to note that not all participants were considered for these follow-up analyses because valid SCR data was not always available.

      Specifically, N = 4 participants were identified as anticipation non-responders (i.e. participant with smaller average SCR to the clock on 100% than on 0% trials; pre-registered criterium) and were excluded from the SCR-related analyses, and N = 1 participant had missing data due to technical difficulties. This means that only 26 (and not 31) participants were considered for the post hoc analyses. Taking this information into account, this means that 21 out of 26 participants (approximately 80%) showed stronger anticipatory SCR following 75% instructions compared to 25% instructions and that  11 out of 26 participants (approximately 40%) even showed the monotonical increase in their anticipatory SCR (see supplemental figure 4). Furthermore, although anticipatory SCR gradually decreased over the course of the experiment, there was no Run x Probability interaction, indicating that the instructions remained stable throughout the task (see supplemental figure 3).  

      Reviewer #2 (Recommendations For The Authors):

      A more operational approach might be to break the trials into different sections along the timeline and examine how much the results might have been affected across time. I expect the manipulation checks would hold for the first one or two runs and the authors then would have good reasons to focus on the behavioral and imaging results for those runs. 

      This recommendation resembles the recommendation by reviewer 1. In our reply to reviewer 1, we showed the results of a re-analysis of the fMRI data using the trial-by-trial estimates of the omission contrasts, which revealed no Probability x Run interaction, suggesting that – overall - the probability effect remained (more or less) stable over the course of the experiment.  For a more in depth discussion of the results of this additional analysis, we refer to our answer to reviewer 1.  

      Reviewer #3 (Public Review): 

      Comments on revised version: 

      The authors were extremely responsive to the comments and provided a comprehensive rebuttal letter with a lot of detail to address the comments. The authors clarified their methodology, and rationale for their task design, which required some more explanation (at least for me) to understand. Some of the design elements were not clear to me in the original paper. 

      The initial framing for their study is still in the domain of learning. The paper starts off with a description of extinction as the prime example of when threat is omitted. This could lead a reader to think the paper would speak to the role of prediction errors in extinction learning processes. But this is not their goal, as they emphasize repeatedly in their rebuttal letter. The revision also now details how using a conditioning/extinction framework doesn't suit their experimental needs. 

      We thank the reviewer for pointing out this potential cause of confusion. We have now rewritten the starting paragraph of the introduction to more closely focus on prediction errors, and only discuss fear extinction as a potential paradigm that has been used to study the role of threat omission PE for fear extinction learning (see lines 40-55). We hope that these adaptations are sufficient to prevent any false expectations. However, as we have mentioned in our previous response letter, not talking about fear extinction at all would also not make sense in our opinion, since most of the knowledge we have gained about threat omission prediction errors to date is based on studies that employed these paradigms.  

      Adaptation in the revised manuscript (lines 40-55):  

      “We experience pleasurable relief when an expected threat stays away1. This relief indicates that the outcome we experienced (“nothing”) was better than we expected it to be (“threat”). Such a mismatch between expectation and outcome is generally regarded as the trigger for new learning, and is typically formalized as the prediction error (PE) that determines how much there can be learned in any given situation2. Over the last two decades, the PE elicited by the absence of expected threat (threat omission PE) has received increasing scientific interest, because it is thought to play a central role in learning of safety. Impaired safety learning is one of the core features of clinical anxiety4. A better understanding of how the threat omission PE is processed in the brain may therefore be key to optimizing therapeutic efforts to boost safety learning. Yet, despite its theoretical and clinical importance, research on how the threat omission PE is computed in the brain is only emerging.  

      To date, the threat omission PE has mainly been studied using fear extinction paradigms that mimic safety learning by repeatedly confronting a human or animal with a threat predicting cue (conditional stimulus, CS; e.g. a tone) in the absence of a previously associated aversive event (unconditional stimulus, US; e.g., an electrical stimulation). These (primarily non-human) studies have revealed that there are striking similarities between the PE elicited by unexpected threat omission and the PE elicited by unexpected reward.”

      It is reasonable to develop a new task to answer their experimental questions. By no means is there a requirement to use a conditioning/extinction paradigm to address their questions. As they say, "it is not necessary to adopt a learning paradigm to study omission responses", which I agree with.  But the authors seem to want to have it both ways: they frame their paper around how important prediction errors are to extinction processes, but then go out of their way to say how they can't test their hypotheses with a learning paradigm.

      Part of their argument that they needed to develop their own task "outside of a learning context" goes as follows: 

      (1) "...conditioning paradigms generally only include one level of aversive outcome: the electrical stimulation is either delivered or omitted. As a result, the magnitude-related axiom cannot be tested." 

      (2) "....in conditioning tasks people generally learn fast, rendering relatively few trials on which the prediction is violated. As a result, there is generally little intra-individual variability in the PE responses" 

      (3) "...because of the relatively low signal to noise ratio in fMRI measures, fear extinction studies often pool across trials to compare omission-related activity between early and late extinction, which further reduces the necessary variability to properly evaluate the probability axiom" 

      These points seem to hinge on how tasks are "generally" constructed. However, there are many adaptations to learning tasks:

      (1) There is no rule that conditioning can't include different levels of aversive outcomes following different cues. In fact, their own design uses multiple cues that signal different intensities and probabilities. Saying that conditioning "generally only include one level of aversive outcome" is not an explanation for why "these paradigms are not tailored" for their research purposes. There are also several conditioning studies that have used different cues to signal different outcome probabilities. This is not uncommon, and in fact is what they use in their study, only with an instruction rather than through learning through experience, per se.

      (2) Conditioning/extinction doesn't have to occur fast. Just because people "generally learn fast" doesn't mean this has to be the case. Experiments can be designed to make learning more challenging or take longer (e.g., partial reinforcement). And there can be intra-individual differences in conditioning and extinction, especially if some cues have a lower probability of predicting the US than others. Again, because most conditioning tasks are usually constructed in a fairly simplistic manner doesn't negate the utility of learning paradigms to address PEaxioms.

      (3) Many studies have tracked trial-by-trial BOLD signal in learning studies (e.g., using parametric modulation). Again, just because other studies "often pool across trials" is not an explanation for these paradigms being ill-suited to study prediction errors. Indeed, most computational models used in fMRI are predicated on analyzing data at the trial level. 

      We thank the reviewer for these remarks. The “fear conditioning and extinction paradigms” that we were referring to in this paragraph were the ones that have been used to study threat omission PE responses in previous research (e.g., Raczka et al., 2011; Thiele et al. 2021; Lange et al. 2020; Esser et al., 2021; Papalini et al., 2021; Vervliet et al. 2017). These studies have mainly used differential/multiple-cue protocols where either one (or two) CS+  and one CS- are trained in an acquisition phase and extinguished in the next phase. Thus, in these paradigms: (1) only one level of aversive US is used; and (2) as safety learning develops over the course of extinction, there are relatively few omission trials during which “large” threat omission PEs can be observed (e.g. from the 24 CS+ trials that were used during extinction in Esser et al., the steepest decreases in expectancy – and thus the largest PE – were found in first 6 trials); and (3) there was never absolute certainty that the stimulation will no longer follow. Some of these studies have indeed estimated the threat omission PE during the extinction phase based on learning models, and have entered these estimates as parametric modulators to CS-offset regressors. This is very informative. However, the exact model that was used differed per study (e.g. Rescorla-Wagner in Raczka et al. and Thiele et al.; or a Rescorla- Wagner–Pearce- Hall hybrid model in Esser et al.). We wanted to analyze threat omission-responses without commitment to a particular learning model. Thus, in order to examine how threat omissionresponses vary as a function of probability-related expectations, a paradigm that has multiple probability levels is recommended (e.g. Rutledge et al., 2010; Ojala et al., 2022)

      The reviewer rightfully pointed out that conditioning paradigms (more generally) can be tailored to fit our purposes as well. Still, when doing so, the same adaptations as we outlined above need to be considered: i.e. include different levels of US intensity; different levels of probability; and conditions with full certainty about the US (non)occurrence. In our attempt to keep the experimental design as simple and straightforward as possible, we decided to rely on instructions for this purpose, rather than to train 3 (US levels) x 5 (reinforcement levels) = 15 different CSs. It is certainly possible to train multiple CSs of varying reinforcement rates (e.g. Grings et al. 1971, Ojala et al., 2022). However, given that US-expectation on each trial would primarily depend on the individual learning processes of the participants, using a conditioning task would make it more difficult to maintain experimental control over the level of USexpectation elicited by each CS. As a result, this would likely require more extensive training, and thus prolong the study procedure considerably. Furthermore, even though previous studies have trained different CSs for different reinforcement rates, most of these studies have only used one level of US. Thus, in order to not complexify our task to much, we decided to rely on instructions rather than to train CSs for multiple US levels (in addition to multiple reinforcement rates).

      We have tried to clarify our reasoning in the revised version of the manuscript (see introduction, lines 100-113):  

      “The previously discussed fear conditioning and extinction studies have been invaluable for clarifying the role of the threat omission PE within a learning context. However, these studies were not tailored to create the varying intensity and probability-related conditions that are required to systematically evaluate the threat omission PE in the light of the PE axioms. First, these only included one level of aversive outcome: the electrical stimulation was either delivered or omitted; but the intensity of the stimulation was never experimentally manipulated within the same task. As a result, the magnitude-related axiom could not be tested. Second, as safety learning progressively developed over the course of extinction learning, the most informative trials to evaluate the probability axiom (i.e. the trials with the largest PE) were restricted to the first few CS+ offsets of the extinction phase, and the exact number of these informative trials likely differed across participants as a result of individually varying learning rates. This limited the experimental control and necessary variability to systematically evaluate the probability axiom. Third, because CS-US contingencies changed over the course of the task (e.g. from acquisition to extinction), there was never complete certainty about whether the US would (not) follow. This precluded a direct comparison of fully predicted outcomes. Finally, within a learning context, it remains unclear whether brain responses to the threat omission are in fact responses to the violation of expectancy itself, or whether they are the result of subsequent expectancy updating.”

      Again, the authors are free to develop their own task design that they think is best suited to address their experimental questions. For instance, if they truly believe that omission-related responses should be studied independent of updating. The question I'm still left puzzling is why the paper is so strongly framed around extinction (the word appears several times in the main body of the paper), which is a learning process, and yet the authors go out of their way to say that they can only test their hypotheses outside of a learning paradigm. 

      As we have mentioned before, the reason why we refer to extinction studies is because most evidence on threat omission PE to date comes from fear extinction paradigms.  

      The authors did address other areas of concern, to varying extents. Some of these issues were somewhat glossed over in the rebuttal letter by noting them as limitations. For example, the issue with comparing 100% stimulation to 0% stimulation, when the shock contaminates the fMRI signal. This was noted as a limitation that should be addressed in future studies, bypassing the critical point. 

      It is unclear to us what the reviewer means with “bypassing the critical point”. We argued in the manuscript that the contrast we initially specified and preregistered to study axiom 3 (fully predicted outcomes elicit equivalent activation) could not be used for this purpose, as it was confounded by the delivery of the stimulation. Because 100% trials aways included the stimulation and 0% trials never included stimulation, there was no way to disentangle activations related to full predictability from activations related to the stimulation as such.   

      Reviewer #3 (Recommendations For The Authors): 

      I'm not sure the new paragraph explaining why they can't use a learning task to test their hypotheses is very convincing, as I noted in my review. Again, it is not a problem to develop a new task to address their questions. They can justify why they want to use their task without describing (incorrectly in my opinion) that other tasks "generally" are constructed in a way that doesn't suit their needs. 

      For an overview of the changes we made in response to this recommendation, we refer to our reply to the public review.   

      We look forward to your reply and are happy to provide answers to any further questions or comments you may have.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #3:

      Comments on current version:

      As mentioned in my first review, this work is significantly underpowered for the following reasons: 1) n=4 for each treatment group.; 2) no randomization of the surgical sites receiving treatments; 3) implants surgically inserted without precision/guided surgery. The authors have not addressed these concerns.

      On a minor note: not sure why the authors present a methodology to evaluate the dynamic bone formation (line 272) but do not present results (i.e. by means of histomorphometrical analyses) utilizing this methodology.

      We sincerely appreciate your thorough review and valuable feedback. We have carefully considered your comments and would like to address them as follows:

      As mentioned in my first review, this work is significantly underpowered for the following reasons:

      (1) n=4 for each treatment group.;

      We acknowledge your concern regarding the limited sample size (n=4 per group). While we understand this may affect statistical power, our choice was influenced by ethical considerations in animal experimentation and resource constraints. Increasing the sample size would undoubtedly strengthen the statistical power of our study. However, the logistical and ethical constraints associated with using a larger number of animals in such invasive procedures were significant limiting factors. Specifically, increasing the number of medium to large experimental animals could raise ethical issues, so we used the minimum number possible. Additionally, our study design was reviewed and approved by the animal IRB, which dictated the minimum number of animals we could use. Nevertheless, we conducted power analysis to ensure that our sample size, although limited, was sufficient to detect significant differences given the high variability typically observed in biological responses. The results obtained from our n=4 samples showed consistent trends and significant differences between groups, indicating the robustness of our findings. I will include this point in the limitations section of the discussion. Thank you.

      (2) no randomization of the surgical sites receiving treatments;

      Thank you for pointing out this issue. We agree that randomization is essential when considering individual differences and the anatomical variations of the jawbone, such as those found in humans. However, this study is an animal experiment where other conditions were controlled, and the interventions were applied after complete bone healing following tooth extraction. Therefore, the impact of randomization of surgical sites was likely minimal, and it is challenging to determine whether it significantly influenced the experimental results. Of course, twelve female OVX beagles were randomly designated into three groups. (Methods section, line 298) However regarding your concern, we would like to present the robustness of histological results from different surgical sites as shown below. Also we will include this point in the limitations section of the discussion.

      Histologic analysis of the different surgical sites showed significant differences in bone formation and osseointegration among the three treatment groups: vehicle control, rhPTH(1-34), and dimeric Cys25PTH(1-34). Goldner trichrome staining (Figure A-C) showed enhanced bone formation in both the rhPTH(1-34) and dimeric Cys25PTH(1-34) groups compared to the vehicle control group. The rhPTH(1-34) group showed the most pronounced bone mass gain around the implant. Both treatment groups showed improved bone-to-implant contact compared to the control group, as indicated by the red arrows.

      Masson trichrome staining (Figure D-F) further confirmed these results, showing an increase in bone matrix (blue staining) in the rhPTH(1-34) and dimeric Cys25PTH(1-34) groups, with the dimeric rhPTH(1-34) group showing the most extensive and dense bone formation.

      TRAP staining (Figure G-I and G'-I') was used to assess osteoclast activity. Interestingly, both the rhPTH(1-34) and dimeric Cys25PTH(1-34) groups showed an increase in TRAP-positive cells compared to the vehicle control, suggesting enhanced bone remodeling activity. The highest number of TRAP-positive cells was observed in the rhPTH(1-34) group and the highest trabecular number, indicating the most active bone remodeling.

      To summarize the results, histological analyses revealed that both rhPTH(1-34) and dimeric Cys25PTH(1-34) treatments significantly enhanced osseointegration and bone formation around titanium implants in a postmenopausal osteoporosis model compared to the control. The rhPTH(1-34) group demonstrated superior outcomes, exhibiting the most substantial increase in bone volume, bone-to-implant contact, and osteoclastic activity, indicating its greater efficacy in promoting bone regeneration and implant integration in this experimental context.

      Author response image 1.

      Histological analysis using Goldner trichrome, Masson trichrome, and TRAP staining

      (3) implants surgically inserted without precision/guided surgery. The authors have not addressed these concerns.

      The primary purpose of precision guides is to prevent damage to various anatomical structures and to ensure perfect placement at the desired location. Even disregarding the potential inaccuracies of precision guides in actual clinical settings, the primary goal of this animal experiment was not to achieve perfect placement or prevent damage to anatomical structures. Instead, the objective was to histologically measure the integrity of the bone surrounding titanium fixture's platform after pharmacological intervention, ensuring it was fully seated in the alveolar bone. To this end, we secured sufficient visibility through periosteal dissection to confirm the perfect placement of the implant and adhered to the principle of maintaining sufficient mesiodistal distance between each fixture. Using such precision guides in this animal experiment, which is not an evaluation of 'implant precision guides,' could potentially introduce inaccuracies and contradict the experimental objectives. Furthermore, since this experiment was conducted on an edentulous ridge where all teeth had been extracted, achieving the same placement as in the presurgical simulation would be impossible, even with the use of precision guides. Thank you once again for your constructive feedback. We will include this point in the limitations section of the discussion.

      On a minor note: not sure why the authors present a methodology to evaluate the dynamic bone formation (line 272) but do not present results (i.e. by means of histomorphometrical analyses) utilizing this methodology.

      As the reviewer mentioned, we confirmed that the sentence was included in the Methods section despite the analysis not actually being performed. We sincerely apologize for this oversight and will make the necessary corrections immediately. Thank you very much for your keen observation.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public Review):

      Summary:

      Tian et al. describe how TIPE regulates melanoma progression, stemness, and glycolysis. The authors link high TIPE expression to increased melanoma cell proliferation and tumor growth. TIPE causes dimerization of PKM2, as well as translocation of PKM2 to the nucleus, thereby activating HIF-1alpha. TIPE promotes the phosphorylation of S37 on PKM2 in an ERK-dependent manner. TIPE is shown to increase stem-like phenotype markers. The expression of TIPE is positively correlated with the levels of PKM2 Ser37 phosphorylation in murine and clinical tissue samples. Taken together, the authors demonstrate how TIPE impacts melanoma progression, stemness, and glycolysis through dimeric PKM2 and HIF-1alpha crosstalk.

      Strengths:

      The authors manipulated TIPE expression using both shRNA and overexpression approaches throughout the manuscript. Using these models, they provide strong evidence of the involvement of TIPE in mediating PKM2 Ser37 phosphorylation and dimerization. The authors also used mutants of PKM2 at S37A to block its interaction with TIPE and HIF-1alpha. In addition, an ERK inhibitor (U0126) was used to block the phosphorylation of Ser37 on PKM2. The authors show how dimerization of PKM2 by TIPE causes nuclear import of PKM2 and activation of HIF-1alpha and target genes. Pyridoxine was used to induce PKM2 dimer formation, while TEPP-46 was used to suppress PKM2 dimer formation. TIPE maintains stem cell phenotypes by increasing the expression of stem-like markers. Furthermore, the relationship between TIPE and Ser37 PKM2 was demonstrated in murine and clinical tissue samples.

      Weaknesses:

      The evaluation of how TIPE causes metabolic reprogramming can be better assessed using isotope tracing experiments and improved bioenergetic analysis.

      Thank you immensely for your invaluable suggestions. Regrettably, we encountered a significant obstacle in completing the isotope tracing experiments due to an unfortunate shortage of necessary instruments. Furthermore, despite our efforts to consult with several companies, we were unable to secure their assistance, which unfortunately hindered the completion of these experiments. We deeply apologize for this imperfection in our experimental design and have thoroughly discussed this limitation in our manuscript.

      Additionally, we acknowledge our oversight in the previous versions of our manuscripts, where only three metabolites were presented. To rectify this and provide a more comprehensive understanding of the metabolic reprogramming induced by TIPE, we have conducted routine untargeted metabolomics analysis. We are pleased to announce that we have incorporated the detailed results of this analysis into our work as a new supplementary figure, designated as Figure S3. This figure specifically highlights the notable decrease in the glycolysis pathway, particularly in pyruvate and lactic acid levels, following TIPE interference.

      Reviewer #2 (Public Review):

      In this article, Tian et al present a convincing analysis of the molecular mechanisms underpinning TIPE-mediated regulation of glycolysis and tumor growth in melanoma. The authors begin by confirming TIPE expression in melanoma cell lines and identify "high" and "low" expressing models for functional analysis. They show that TIPE depletion slows tumour growth in vivo, and using both knockdown and over-expression approaches, show that this is associated with changes in glycolysis in vitro. Compelling data using multiple independent approaches is presented to support an interaction between TIPE and the glycolysis regulator PKM2, and the over-expression of TIPE-promoted nuclear translocation of PKM2 dimers. Mechanistically, the authors also demonstrate that PKM2 is required for TIPE-mediated activation of HIF1a transcriptional activity, as assessed using an HRE-promoter reporter assay, and that TIPE-mediated PKM2 dimerization is p-ERK dependent. Finally, the dependence of TIPE activity on PKM2 dimerization was demonstrated on tumor growth in vivo and in the regulation of glycolysis in vitro, and ectopic expression of HIF1a could rescue the inhibition of PKM2 dimerization in TIPE overexpressing cells and reduced induction of general cancer stem cell markers, showing a clear role for HIF1a in this pathway. The main conclusions of this paper are well supported by data, but some aspects of the experiments need clarification and some data panels are difficult to read and interpret as currently presented.

      The detailed mechanistic analysis of TIPE-mediated regulation of PKM2 to control aerobic glycolysis and tumor growth is a major strength of the study and provides new insights into the molecular mechanisms that underpin the Warburg effect in cancer cells. However, despite these strengths, some weaknesses were noted, which if addressed will further strengthen the study.

      (1) The analysis of patient samples should be expanded to more directly measure the relationship between TIPE levels and melanoma patient outcome and progression (primary vs metastasis), to build on the association between TIPE levels and proliferation (Ki67) and hypoxia gene sets that are currently shown.

      Thanks for your suggestions. We have expanded the analysis to include the relationship between TIPE levels and melanoma progression, specifically distinguishing between non-lymph node metastasis and lymph node metastasis. In addition, we added the association between TIPE and Ki67 or LDH levels as your advised, as shown in Figure 7.

      However, the relationship between TIPE levels and melanoma patient outcome is not presented in this article. One reason is that the tissue microarray lack of the survival data. Interestingly, the TCGA dataset showed that the higher TIPE expression has a favorable prognosis for melanoma. We are also very curious about this. Our following study indicated that TIPE might serve as a positive regulator of PD-L1. Therefore, the higher expression of TIPE presents more sensitive tendency to immunotherapy, resulting in a favorable prognosis in melanoma. The detailed mechanisms will be discussed in our following article, and we hope that it might as a continuous research topic for TIPE in melanoma.

      We just only disclose a little information that TIPE shares similar survival and immune signature to PD-L1 and PD-1 in melanoma as following:

      Author response image 1.

      (2) The duration of the in vivo experiments was not clearly defined in the figures, however, it was clear from the tumor volume measurements that they ended well before standard ethical endpoints in some of the experiments. A rationale for this should be provided because longer-duration experiments might significantly change the interpretation of the data. For example, does TIPE depletion transiently reduce or lead to sustained reductions in tumor growth?

      Thanks for your suggestions. Actually, we have performed a pre-experiment before the formal experiments, and all the time points were referred to this. Furthermore, we have added the detailed time points into the figure legends as you suggested.

      (3) The analysis of general cancer stem cell markers is solid and interesting, however inclusion of neural crest stem cell markers that are more relevant to melanoma biology would greatly strengthen this aspect of the study.

      Thanks for your advices. We have selected two neural crest stem cell markers including Nestin and Sox10 to test their expression after overexpression of TIPE in G361 cells or interference of TIPE in A375 cells.

      (4) The authors should take care that all data panels are clearly readable in the figures to facilitate appropriate interpretation by the reader.

      Thanks for your suggestions. We have amended the data panels according to you advises to ensure it is clear and professionally presented.

      Reviewer #1 (Recommendations for the authors):

      It would be suggested to improve the image quality of certain panels (please refer to Fig.1A and Fig.S3B-D).

      Thank you for your expert advice. We have optimized the quality of certain panels according to your suggestions.

      Reviewer #2 (Recommendations for the authors):

      Major comments:

      - TCGA survival/patient outcome data relative to TIPE levels should be provided in the supplementary figures, together with TIPE correlation with PKM2.

      - Suggest revising how this point is described in the discussion.

      We have added the results of TIPE expression and prognosis of melanoma patients from the TCGA database as required by the expert, and discussed it appropriately in the article. In addition, the correlation between TIPE and PKM2 expression has already been described in Supplementary Figure 6.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      We would like to first thank the Editor as well as the three reviewers for their enthusiasm and conducting another careful evaluation of our manuscript. We appreciate their thoughtful and constructive comments and suggestions. Some concerns regarding experimental design, data analysis, and over-interpretation of our findings still remains unresolved after the initial revision. Here we endeavored to address these remaining concerns through further refinement of our writing, and inclusion of these concerns in the discussion session. We hope our response can better explain the rationale of our experimental design and data interpretation. In addition, we also acknowledge the limitations of our present study, so that it will benefit future investigations into this topic. Our detail responses are provided below.

      Reviewer #1 (Public Review):

      This study examines whether the human brain uses a hexagonal grid-like representation to navigate in a non-spatial space constructed by competence and trustworthiness. To test this, the authors asked human participants to learn the levels of competence and trustworthiness for six faces by associating them with specific lengths of bar graphs that indicate their levels in each trait. After learning, participants were asked to extrapolate the location from the partially observed morphing bar graphs. Using fMRI, the authors identified brain areas where activity is modulated by the angles of morphing trajectories in six-fold symmetry. The strength of this paper lies in the question it attempts to address. Specifically, the question of whether and how the human brain uses grid-like representations not only for spatial navigation but also for navigating abstract concepts, such as social space, and guiding everyday decision-making. This question is of emerging importance.

      I acknowledge the authors' efforts to address the comments received. However, my concerns persist:

      Thanks very much again for the re-evaluation and comments. Please find our revision plans to each comment below.

      (1) The authors contend that shorter reaction times correlated with increased distances between individuals in social space imply that participants construct and utilize two-dimensional representations. This method is adapted from a previous study by Park et al. Yet, there is a fundamental distinction between the two studies. In the prior work, participants learned relationships between adjacent individuals, receiving feedback on their decisions, akin to learning spatial locations during navigation. This setup leads to two different predictions: If participants rely on memory to infer relationships, recalling more pairs would be necessary for distant individuals than for closer ones. Conversely, if participants can directly gauge distances using a cognitive map, they would estimate distances between far individuals as quickly as for closer ones. Consequently, as the authors suggest, reaction times ought to decrease with increasing decision value, which, in this context, corresponds to distances. However, the current study allowed participants to compare all possible pairs without restricting learning experiences, rendering the application of the same methodology for testing two-dimensional representations inappropriate. In this study, the results could be interpreted as participants not forming and utilizing two-dimensional representations.

      We apologize for not being clear enough about our task design, we have made relevant changes in the methodology section in the manuscript to make it clearer. The reviewer’s concern is that participants learned about all the pairs in the comparison task which makes the distance effect invalid. We would like to clarify that during all the memory test tasks (the comparison task, the collect task and the recall task outside and inside scanner), participants never received feedback on whether their responses were correct or not. Therefore, the comparison task in our study is similar to the previous study by Park et al. (2021). Participants do not have access to correct responses for all possible pairs of comparison prior to or during this task, they would need to make inference based on memory retrieval.

      (2) The confounding of visual features with the value of social decision-making complicates the interpretation of this study's results. It remains unclear whether the observed grid-like effects are due to visual features or are genuinely indicative of value-based decision-making, as argued by the authors. Contrary to the authors' argument, this issue was not present in the previous study (Constantinescu et al.). In that study, participants associated specific stimuli with the identities of hidden items, but these stimuli were not linked to decision-making values (i.e., no image was considered superior to another). The current study's paradigm is more akin to that of Bao et al., which the authors mention in the context of RSA analysis. Indeed, Bao et al. controlled the length of the bars specifically to address the problem highlighted here. Regrettably, in the current paradigm, this conflation remains inseparable.

      We’d like to thank the reviewer for facilitating the discussion on the question of ‘social space’ vs. ‘sensory space’. The task in scanner did not require value-based decision making. It is akin to both the Bao et al. (2019) study and Constantinescu et al. (2016) study in a sense that all three tasks are trying to ask participants to imagine moving along a trajectory in an abstract, non-physical space and the trajectory is grounded in sensory cue. Participants were trained to associate the sensory cue with abstract (social/nonsocial) concepts. We think that the paradigm is a relatively faithful replication of the study by Constantinescu et al. Nonetheless, we agreed that a design similar to Bao et al. (2019) which controls for sensory confounds would be more ideal to address this concern, or adopting a value-based decision-making task in the scanner similar to that by Park et al. (2021), and we have included this limitation in the discussion section.

      (3) While the authors have responded to comments in the public review, my concerns noted in the Recommendation section remain unaddressed. As indicated in my recommendations, there are aspects of the authors' methodology and results that I find difficult to comprehend. Resolving these issues is imperative to facilitate an appropriate review in subsequent stages.

      Considering that the issues raised in the previous comments remain unresolved, I have retained my earlier comments below for review.

      We apologize for not addressing the recommendations properly, please find detailed our response and plans for revision.

      I have some comments. I hope that these can help.

      (1) While the explanation of Fig.4A-C is lacking in both the main text and figure legend, I am not sure if I understand this finding correctly. Did the authors find the effects of hexagonal modulation in the medial temporal gyrus and lingual gyrus correlate with the individual differences in the extent to which their reaction times were associated with the distances between faces when choosing a better collaborator? If so, I am not sure what argument the authors try to draw from these findings. Do the authors argue that these brain areas show hexagonal modulation, which was not supported in the previous analysis (Fig.3)? What is the level of correlation between these behavioral measures and the grid consistency effects in the vmPFC and EC, where the authors found actual grid-like activity? How do the authors interpret this finding? More importantly, how does this finding associate with other findings and the argument of the study?

      We apologize for not being clear enough in the manuscript and we will improve the clarity in our revision. This exploratory analysis reported in Figure 4 aims to use whole-brain analysis to examine: 1) if there is any correlation between the strength of grid-like representation of social value map and behavioral indicators of map-like representation; and 2) if there are any correlation between the strength of grid-like representation of this social value map and participants’ social trait.

      To be more specific, for the behavioral indicator, we used the distance effect in the reaction time of the comparison task outside the scanner. We interpreted stronger distance effect as a behavioral index of having better internal map-like representation. We interpreted stronger grid consistency effect as a neural index of better representation of the 2D social space. Therefore, we’d like to see if there exists correlation between behavioral and neural indices of map-like representation.

      To achieve this goal, behavioral indicators are entered as covariates in second-level analysis of the GLM testing grid consistency effect (GLM2). Figure3 showed results from GLM2 without the covariates. Figure4 showed results of clusters whose neural indices of map-like representation covaried with that from behavior and survived multiple-comparison correction. Indeed, in these regions, the grid consistency effect was not significant at group level (so not shown in Figure 3). We tried to interpret this finding in our discussion (line 374-289 for temporal lobe correlation, line 395-404 for precuneus correlation).

      Finally, we would like to point out that including the covariates in GLM2 did not change results in Figure3, the clusters in Figure3 still survives correction. Meanwhile, these clusters in Figure 3 did not show correlation with behavioral indicators of map-like representation.

      Author response image 1.

      (2) There are no behavioral results provided. How accurately did participants perform each of the tasks? How are the effects of grid consistency associated with the level of accuracy in the map test?

      Why did participants perform the recall task again outside the scanner?

      We will endeavor to improve signposting the corresponding figures in the main text. For the behavioral results, we reported the stats in section “Participants construct social value map after associative learning of avatars and corresponding characteristics” in the main text, and the plots are shown in Figure 1. Particularly, figure 1F showed accuracy of tasks in training, as well as the recall task in the scanner. For the correlation, we did not find significant correlation between behavioural accuracy and grid consistency effect. We will make it clearer in the result section.

      (3) The methods did not explain how the grid orientation was estimated and what the regressors were in GLM2. I don't think equations 2 and 3 are quite right.

      For the grid orientation estimation method, we provided detailed description in the Supplementary methods 2.2.2. We will add links to this section in the main text.

      Equation 2 and 3 describes how the parametric regressors entered into GLM2 were formed and provided prerequisites on calculation of grid orientations. Equation 2 was the results of directly applying the angle addition and subtraction theorems so they should be correct. We will try to make the rationale clearer in the supplementary text.

      (4) With the increase in navigation distances, more grid cells would activate. Therefore, in theory, the activity in the entorhinal cortex should increase with the Euclidean distances, which has not been found here. I wonder if there was enough variability in the Euclidean distances that can be captured by neural correlates. This would require including the distributions of Euclidean distances according to their trajectory angles. Regarding how Fig.1E is generated, I don't understand what this heat map indicates. Additionally, it needs to be confirmed if the grid effects remain while controlling for the Euclidean distances of navigation trajectories.

      We did not specifically control for the trajectory length, we only controlled for the distribution of trajectory to be uniform. We have included a figure of the distribution of Euclidean distances in Figure S9 and the distribution of trajectory direction in Figure S8.

      Author response image 2.

      As for Figure 1E, we aim to reproduce the findings from Figure 1F in Constantinescu et al. (2016) where they showed that participants progressively refined the locations of the outcomes through training. We divided the space into 15×15 subregions and computed the amount of time spent in each subregion and plotted Figure 1E. Brighter color in Figure 1E indicate greater amount of time spent in the corresponding subregion. Note that all these timing indices were computed as a percentage of the total time spent in the explore task in a given session. If participants were well-acquainted with the space and avatars, they would spend more time at the avatar (brighter color in avatar locations) in the review session compared to the learning session.

      As for the effect of distances on grid-like representation, we did not include the distance as a parametric modulator in grid consistency effect GLM (GLM2) due to insufficient trials in each bin (6-8 trials). But there is side evidence that could potentially rule out this confound. In the distance representation analysis, we did not find distance representation in any of the clusters that have significant grid-like representation (regions in Figure 2).

      Reviewer #2 (Public Review):

      Summary:

      In this work, Liang et al. investigate whether an abstract social space is neurally represented by a grid-like code. They trained participants to 'navigate' around a two-dimensional space of social agents characterized by the traits warmth and competence, then measured neural activity as participants imagined navigating through this space. The primary neural analysis consisted of three procedures: 1) identifying brain regions exhibiting the hexagonal modulation characteristic of a grid-like code, 2) estimating the orientation of each region's grid, and 3) testing whether the strength of the univariate neural signal increases when a participant is navigating in a direction aligned with the grid, compared to a direction that is misaligned with the grid. From these analyses, the authors find the clearest evidence of a grid-like code in the prefrontal cortex and weaker evidence in the entorhinal cortex.

      Strengths:

      The work demonstrates the existence of a grid-like neural code for a socially-relevant task, providing evidence that such coding schemes may be relevant for a variety of two-dimensional task spaces.

      Weaknesses:

      In the revised manuscript, the authors soften their claims about finding a grid code in the entorhinal cortex and provide additional caveats about limitations in their findings. It seems that the authors and reviewers are in agreement about the following weaknesses, which were part of my original review: Claims about a grid code in the entorhinal cortex are not well-supported by the analyses presented. The whole-brain analysis does not suggest that the entorhinal cortex exhibits hexagonal modulation; the strength of the entorhinal BOLD signal does not track the putative alignment of the grid code there; multivariate analyses do not reveal any evidence of a grid-like representational geometry.

      In the authors' response to reviews, they provide additional clarification about their exploratory analyses examining whether behavior (i.e., reaction times) and individual difference measures (i.e., social anxiety and avoidance) can be predicted by the hexagonal modulation strength in some region X, conditional on region X having a similar estimated grid alignment with some other region Y. My guess is that readers would find it useful if some of this language were included in the main text, especially with regard to an explanation regarding the rationale for these exploratory studies.

      Thank you very much again for your careful re-evaluation and suggestions. We have tried to improve our writing and incorporate the suggestions in the new revision.

      Reviewer #3 (Public Review):

      Liang and colleagues set out to test whether the human brain uses distance and grid-like codes in social knowledge using a design where participants had to navigate in a two-dimensional social space based on competence and warmth during an fMRI scan. They showed that participants were able to navigate the social space and found distance-based codes as well as grid-like codes in various brain regions, and the grid-like code correlated with behavior (reaction times).

      On the whole, the experiment is designed appropriately for testing for distant-based and grid-like codes, and is relatively well powered for this type of study, with a large amount of behavioral training per participant. They revealed that a number of brain regions correlated positively or negatively with distance in the social space, and found grid-like codes in the frontal polar cortex and posterior medial entorhinal cortex, the latter in line with prior findings on grid-like activity in entorhinal cortex. The current paper seems quite similar conceptually and in design to previous work, most notably Park et al., 2021, Nature Neuroscience.

      (1) The authors claim that this study provides evidence that humans use a spatial / grid code for abstract knowledge like social knowledge.

      This data does specifically not add anything new to this argument. As with almost all studies that test for a grid code in a similar "conceptual" space (not only the current study), the problem is that, when the space is not a uniform, square/circular space, and 2-dimensional then there is no reason the code will be perfectly grid like, i.e., show six-fold symmetry. In real world scenarios of social space (as well as navigation, semantic concepts), it must be higher dimensional - or at least more than two dimensional. It is unclear if this generalizes to larger spaces where not all part of the space is relevant. Modelling work from Tim Behrens' lab (e.g., Whittington et al., 2020) and Bradley Love's lab (e.g., Mok & Love, 2019) have shown/argued this to be the case. In experimental work, like in mazes from the Mosers' labs (e.g., Derdikman et al., 2009), or trapezoid environments from the O'Keefe lab (Krupic et al., 2015), there are distortions in mEC cells, and would not pass as grid cells in terms of the six-fold symmetry criterion.

      The authors briefly discuss the limitations of this at the very end but do not really say how this speaks to the goal of their study and the claim that social space or knowledge is organized as a grid code and if it is in fact used in the brain in their study and beyond. This issue deserves to be discussed in more depth, possibly referring to prior work that addressed this, and raise the issue for future work to address the problem - or if the authors think it is a problem at all.

      Thanks very much again for your careful re-evaluation and comments. We have tried to incorporate some of the suggested papers into our discussion. In summary, we agree that there is more to six-fold symmetric code that can be utilized to represent “conceptual space”. We think that the next step for a stronger claim would be to find the representation of more spontaneous non-spatial maps.

      References

      Bao, X., Gjorgieva, E., Shanahan, L. K., Howard, J. D., Kahnt, T., & Gottfried, J. A. (2019). Grid-like Neural Representations Support Olfactory Navigation of a Two-Dimensional Odor Space. Neuron, 102(5), 1066-1075 e1065. https://doi.org/10.1016/j.neuron.2019.03.034

      Constantinescu, A. O., O'Reilly, J. X., & Behrens, T. E. J. (2016). Organizing conceptual knowledge in humans with a gridlike code. Science, 352(6292), 1464-1468. https://doi.org/10.1126/science.aaf0941

      Park, S. A., Miller, D. S., & Boorman, E. D. (2021). Inferences on a multidimensional social hierarchy use a grid-like code. Nat Neurosci, 24(9), 1292-1301. https://doi.org/10.1038/s41593-02100916-3

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public Review):

      Summary:

      This paper presents a compelling and comprehensive study of decision-making under uncertainty. It addresses a fundamental distinction between belief-based (cognitive neuroscience) formulations of choice behavior with reward-based (behavioral psychology) accounts. Specifically, it asks whether active inference provides a better account of planning and decision making, relative to reinforcement learning. To do this, the authors use a simple but elegant paradigm that includes choices about whether to seek both information and rewards. They then assess the evidence for active inference and reinforcement learning models of choice behavior, respectively. After demonstrating that active inference provides a better explanation of behavioral responses, the neuronal correlates of epistemic and instrumental value (under an optimized active inference model) are characterized using EEG. Significant neuronal correlates of both kinds of value were found in sensor and source space. The source space correlates are then discussed sensibly, in relation to the existing literature on the functional anatomy of perceptual and instrumental decision-making under uncertainty.

      We are deeply grateful for your careful review of our work and your suggestions. Your insights have helped us identify areas where we can strengthen the arguments and clarify the methodology. We hope to apply the idea of active inference to our future work, emphasizing the integrity of perception and action.

      Reviewer #1 (Recommendations For The Authors):

      Many thanks for attending to my previous suggestions. I think your presentation is now much clearer and nicely aligned with the active inference literature.

      There is one outstanding issue. I think you have overinterpreted the two components of epistemic value in Equation 8. The two components that you have called the value of reducing risk and the value of reducing ambiguity are not consistent with the normal interpretation. These two components are KL divergences that measure the expected information gain about parameters and states respectively.

      If you read the Schwartenbeck et al paper carefully, you will see that the first (expected information gain about parameters) is usually called novelty, while the second (expected information gain about states) is usually called salience.

      This means you can replace "the value of reducing ambiguity" with "novelty" and "the value of reducing risk" with "salience".

      For your interest, "risk" and "ambiguity" are alternative ways of decomposing expected free energy. In other words, you can decompose expected free energy into (negative) expected information gain and expected value (as you have done). Alternatively, you can rearrange the terms and express expected free energy as risk and ambiguity. Look at the top panel of Figure 4 in:

      https://www.sciencedirect.com/science/article/pii/S0022249620300857

      I hope that this helps.

      We deeply thank you for your recommendations about the interpretation of the epistemic value in Equation 8. We have now corrected them to Novelty and Salience:

      In addition, in order to avoid terminology conflicts with active inference and to describe these two different uncertainties, we replaced Ambiguity in the article with Novelty, referring to the uncertainty that can be reduced by sampling, and replaced Risk with Variability, referring to the uncertainty inherent in the environment (variance).

      Reviewer # 2 (Public Review):

      Summary:

      Zhang and colleagues use a combination of behavioral, neural, and computational analyses to test an active inference model of exploration in a novel reinforcement learning task..

      Strengths:

      The paper addresses an important question (validation of active inference models of exploration). The combination of behavior, neuroimaging, and modeling is potentially powerful for answering this question.

      I appreciate the addition of details about model fitting, comparison, and recovery, as well as the change in some of the methods.

      We are deeply grateful for your careful review of our work and your suggestions. And we are also very sorry that in our last responses, there were a few suggestions from you that we did not respond them appropriately in our manuscript. We hope to be able to respond to these suggestions well in this revision. Thank you for your contribution to ensuring the scientificity and reproducibility of the work.

      The authors do not cite what is probably the most relevant contextual bandit study, by Collins & Frank (2018, PNAS), which uses EEG.

      The authors cite Collins & Molinaro as a form of contextual bandit, but that's not the case (what they call "context" is just the choice set). They should look at the earlier work from Collins, starting with Collins & Frank (2012, EJN).

      We deeply thank you for your comments. Now we add the relevant citations in the manuscript (line 46):

      “These studies utilized different forms of multi-armed bandit tasks, e.g the restless multi-armed bandit tasks (Daw et al., 2006; Guha et al., 2010), risky/safe bandit tasks (Tomov et al., 2020; Fan et al., 2022; Payzan et al., 2013), contextual multi-armed bandit tasks (Collins & Frank, 2018; Schulz et al., 2015; Collins & Frank, 2012)”

      Daw, N. D., O'doherty, J. P., Dayan, P., Seymour, B., & Dolan, R. J. (2006). Cortical substrates for exploratory decisions in humans. Nature, 441(7095), 876-879.

      Guha, S., Munagala, K., & Shi, P. (2010). Approximation algorithms for restless bandit problems. Journal of the ACM (JACM), 58(1), 1-50.

      Tomov, M. S., Truong, V. Q., Hundia, R. A., & Gershman, S. J. (2020). Dissociable neural correlates of uncertainty underlie different exploration strategies. Nature communications, 11(1), 2371.

      Fan, H., Gershman, S. J., & Phelps, E. A. (2023). Trait somatic anxiety is associated with reduced directed exploration and underestimation of uncertainty. Nature Human Behaviour, 7(1), 102-113.

      Payzan-LeNestour, E., Dunne, S., Bossaerts, P., & O’Doherty, J. P. (2013). The neural representation of unexpected uncertainty during value-based decision making. Neuron, 79(1), 191-201.

      Collins, A. G., & Frank, M. J. (2018). Within-and across-trial dynamics of human EEG reveal cooperative interplay between reinforcement learning and working memory. Proceedings of the National Academy of Sciences, 115(10), 2502-2507.

      Schulz, E., Konstantinidis, E., & Speekenbrink, M. (2015, April). Exploration-exploitation in a contextual multi-armed bandit task. In International conference on cognitive modeling (pp. 118-123).

      Collins, A. G., & Frank, M. J. (2012). How much of reinforcement learning is working memory, not reinforcement learning? A behavioral, computational, and neurogenetic analysis. European Journal of Neuroscience, 35(7), 1024-1035.

      Placing statistical information in a GitHub repository is not appropriate. This needs to be in the main text of the paper. I don't understand why the authors refer to space limitations; there are none for eLife, as far as I'm aware.

      We deeply thank you for your comments. We calculated the average t-value of the brain regions with significant results over the significant time, and added the t-value results to the main text and supplementary materials.

      In answer to my question about multiple comparisons, the authors have added the following: "Note that we did not attempt to correct for multiple comparisons; largely, because the correlations observed were sustained over considerable time periods, which would be almost impossible under the null hypothesis of no correlations." I'm sorry, but this does not make sense. Either the authors are doing multiple comparisons, in which case multiple comparison correction is relevant, or they are doing a single test on the extended timeseries, in which case they need to report that. There exist tools for this kind of analysis (e.g., Gershman et al., 2014, NeuroImage). I'm not suggesting that the authors should necessarily do this, only that their statistical approach should be coherent. As a reference point, the authors might look at the aforementioned Collins & Frank (2018) study.

      We deeply thank you for your comments. We have now replaced all our results with the results after false discovery rate correction and added relevant descriptions (line 357,358):

      “The significant results after false discovery rate (FDR) (Benjamini et al., 1995, Gershman et al., 2014) correction were shown in shaded regions. Additional regression results can be found in Supplementary Materials.”

      Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), 57(1), 289-300.

      Gershman, S. J., Blei, D. M., Norman, K. A., & Sederberg, P. B. (2014). Decomposing spatiotemporal brain patterns into topographic latent sources. NeuroImage, 98, 91-102.

      After FDR correction, our results have changed slightly. We have updated our Results and Discussion section.

      It should be acknowledged that the changes in these results may represent a certain degree of error in our data (perhaps because the EEG data is too noisy or because of the average template we used, ‘fsaverage’). Therefore, we added relevant discussion in the Discussion section (line527-529):

      “It should be acknowledged that our EEG-based regression results are somewhat unstable, and the brain regions with significant regression are inconsistent before and after FDR correction. In future work, we should collect more precise neural data to reduce this instability.”

      I asked the authors to show more descriptive comparison between the model and the data. Their response was that this is not possible, which I find odd given that they are able to use the model to define a probability distribution on choices. All I'm asking about here is to show predictive checks which build confidence in the model fit. The additional simulations do not address this. The authors refer to figures 3 and 4, but these do not show any direct comparison between human data and the model beyond model comparison metrics.

      We deeply thank you for your comments. We now compare the participants’ behavioral data and the model’s predictions trial by trial (Figure 5). We can clearly see the participants’ behavioral strategies in different states and trials and the model’s prediction accuracy. We have added the discussion related to Figure 5 (line 309-318):

      “Figure 5 shows the comparison between the active inference model and the behavioral data, where we can see that the model can fit the participants behavioral strategies well. In the “Stay-Cue" choice, participants always tend to choose to ask the ranger and rarely choose not to ask. When the context was unknown, participants chose the “Safe" option or the “Risky" option very randomly, and they did not show any aversion to variability. When given “Context 1", where the “Risky" option gave participants a high average reward, participants almost exclusively chose the “Risky" option, which provided more information in the early trials and was found to provide more rewards in the later rounds. When given “Context 2", where the “Risky" option gave participants a low average reward, participants initially chose the “Risky" option and then tended to choose the “Safe" option. We can see that participants still occasionally chose the “Risky" option in the later trials of the experiment, which the model does not capture. This may be due to the influence of forgetting. Participants chose the “Risky" option again to establish an estimate of the reward distribution.”

      Reviewer # 2 (Recommendations For The Authors):

      In the supplement, there are missing references ("[?]").

      Thank you very much for pointing out this. We have now fixed this error.

      Reviewer # 3 (Public review):

      Summary:

      This paper aims to investigate how the human brain represents different forms of value and uncertainty that participate in active inference within a free-energy framework, in a two-stage decision task involving contextual information sampling, and choices between safe and risky rewards, which promotes shifting between exploration and exploitation. They examine neural correlates by recording EEG and comparing activity in the first vs second half of trials and between trials in which subjects did and did not sample contextual information, and perform a regression with free-energy-related regressors against data "mapped to source space."

      Strengths:

      This two-stage paradigm is cleverly designed to incorporate several important processes of learning, exploration/exploitation and information sampling that pertain to active inference. Although scalp/brain regions showing sensitivity to the active-inference related quantities do not necessary suggest what role they play, they are illuminating and useful as candidate regions for further investigation. The aims are ambitious, and the methodologies impressive. The paper lays out an extensive introduction to the free energy principle and active inference to make the findings accessible to a broad readership.

      Weaknesses:

      In its revised form the paper is complete in providing the important details. Though not a serious weakness, it is important to note that the high lower-cutoff of 1 Hz in the bandpass filter, included to reduce the impact of EEG noise, would remove from the EEG any sustained, iteratively updated representation that evolves with learning across trials, or choice-related processes that unfold slowly over the course of the 2-second task windows.

      We are deeply grateful for your careful review of our work and your suggestions. We are very sorry that we did not modify our filter frequency (it would be a lot of work to modify it). Thank you very much for pointing this out. We noticed the shortcoming of the high lower-cutoff of 1 Hz in the bandpass filter. We will carefully consider the filter frequency when preprocessing data in future work. Thank you very much!

    1. Author Response

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      The proposed study provides an innovative framework for the identification of muscle synergies taking into account their task relevance. State-of-the-art techniques for extracting muscle interactions use unsupervised machine-learning algorithms applied to the envelopes of the electromyographic signals without taking into account the information related to the task being performed. In this work, the authors suggest including the task parameters in extracting muscle synergies using a network information framework previously proposed. This allows the identification of muscle interactions that are relevant, irrelevant, or redundant to the parameters of the task executed.

      The proposed framework is a powerful tool to understand and identify muscle interactions for specific task parameters and it may be used to improve man-machine interfaces for the control of prostheses and robotic exoskeletons.

      With respect to the network information framework recently published, this work added an important part to estimate the relevance of specific muscle interactions to the parameters of the task executed. However, the authors should better explain what is the added value of this contribution with respect to the previous one, also in terms of computational methods.

      It is not clear how the well-known phenomenon of cross-talk during the recording of electromyographic muscle activity may affect the performance of the proposed technique and how it may bias the overall outcomes of the framework.

      We thank reviewer 1 for their useful commentary on this manuscript.

      Reviewer #2 (Public Review):

      This paper is an attempt to extend or augment muscle synergy and motor primitive ideas with task measures. The authors idea is to use information metrics (mutual information, co-information) in 'synergy' constraint creation that includes task information directly. By using task related information and muscle information sources and then sparsification, the methods construct task relevant network communities among muscles, together with task redundant communities, and task irrelevant communities. This process of creating network communities may then constrain and help to guide subsequent synergy identification using the authors published sNM3F algorithm to detect spatial and temporal synergies.

      The revised paper is much clearer and examples are helpful in various ways. However, figure 2 as presented does not convincingly show why task muscle mutual information helps in separating synergies, though it is helpful in defining the various network communities used in the toy example.

      The impact of the information theoretic constraints developed as network communities on subsequent synergy separation are posited to be benign and to improve over other methods (e.g., NNMF). However, not fully addressed are the possible impacts of the methods on compositionality links with physiological bases, and the possibility remains of the methods sometimes instead leading to modules that represent more descriptive ML frameworks that may not support physiological work easily. Accordingly, there is a caveat. This is recognized and acknowledged by the authors in their rebuttal of the prior review. It will remain for other work to explore this issue, likely through testing on detailed high degree of freedom artificial neuromechanical models and tasks. This possible issue with the strategy here likely needs to be fully acknowledged in the paper.

      The approach of the methods seeks to identify task relevant coordinative couplings. This is a meta problem for more classical synergy analyses. Classical analyses seek compositional elements stable across tasks. These elements may then be explored in causal experiments and generative simulations of coupling and control strategies. However, task-based understanding of synergy roles and functional uses is significant and is clearly likely to be aided by methods in this study.

      Information based separation has been used in muscle synergy analyses using infomax ICA, which is information based at core. Though linear mixing of sources is assumed in ICA, minimized mutual information among source (synergy) drives is the basis of the separation and detects low variance synergy contributions (e.g., see Yang, Logan, Giszter, 2019). In the work in this paper, instead, mutual information approaches are used to cluster muscles and task features into network communities preceding the SNM3F algorithm use for separation, rather than using minimized information in separation. This contrast of an accretive or agglomerative mutual information strategy here used to cluster into networks, versus a minimizing mutual information source separation used in infomax ICA epitomizes a key difference in approach here.

      Physiological causal testing of synergy ideas is neglected in the literature reviews in the paper. Although these are only in animal work (Hart and Giszter, 2010; Takei and Seki, 2017), the clear connection of muscle synergy analysis choices to physiology is important, and eventually these issues need to be better managed and understood in relation to the new methods proposed here, even if not in this paper.

      Analyses of synergies using the methods the paper has proposed will likely be very much dependent on the number and quality of task variables included and how these are managed, and the impacts of these on the ensuing sparsification and network communities used prior to SNM3F. The authors acknowledge this in their response. This caveat should likely be made very explicit in the paper.

      It would be useful in the future to explore the approach described with a range of simulated data to better understand the caveats, and optimizations for best practices in this approach.

      A key component of the reviewers’ arguments here is their reductionist view of muscle synergies vs the emergentist view presented in our work here. In the reductionist lens, muscle groupings are the units (‘building blocks’) of coordinated movement and thus the space of intermuscular interactions is of particular interest for understanding movement construction. On the other hand, the emergentist view suggests that muscle groupings emerge from interactions between constituent parts (as quantified here using information theory, synergistic information is the information found when both activities are observed together). This is in line with recent work in the field showing modular control at the intramuscular level, exemplifying a scale-free phenomena. Nonetheless, we consider these approaches to muscle synergy research as complementary and beneficial for the field overall going forward.

      Reviewer #3 (Public Review):

      In this study, the authors developed and tested a novel framework for extracting muscle synergies. The approach aims at removing some limitations and constraints typical of previous approaches used in the field. In particular, the authors propose a mathematical formulation that removes constraints of linearity and couples the synergies to their motor outcome, supporting the concept of functional synergies and distinguishing the task-related performance related to each synergy. While some concepts behind this work were already introduced in recent work in the field, the methodology provided here encapsulates all these features in an original formulation providing a step forward with respect to the currently available algorithms. The authors also successfully demonstrated the applicability of their method to previously available datasets of multi-joint movements.

      Preliminary results positively support the scientific soundness of the presented approach and its potential. The added values of the method should be documented more in future work to understand how the presented formulation relates to previous approaches and what novel insights can be achieved in practical scenarios and confirm/exploit the potential of the theoretical findings.

      In their revision, the authors have implemented major revisions and improved their paper. The work was already of good quality and now it has improved further. The authors were able to successfully:

      • improve the clarity of the writing (e.g.: better explaining the rationale and the aims of the paper);

      • extend the clarification of some of the key novel concepts introduced in their work, like the redundant synergies;

      • show a scenario in which their approach might be useful for increasing the understanding of motor control in patients with respect to traditional algorithms such as NMF. In particular, their example illustrates why considering the task space is a fundamental step forward when extracting muscle synergies, improving the practical and physiological interpretation of the results.

      We thank reviewer 3 for their constructive commentary on this manuscript.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Figure 3 should report the distances between reaching points in panel A and the actual length distances of the walking paths in panel C.

      The caption of fig.3 concerning the experimental setup of the datasets analysed has been updated with the following for dataset 1: “(A) Dataset 1 consisted of participants executing table-top point-to-point reaching movements (40cm distance from starting point P0) across four targets in forward (P1-P4) and backwards (P5-P8) directions at both fast and slow speeds (40 repetitions per task) [25]. The muscles recorded included the finger extensors (FE), brachioradialis (BR), biceps brachii (BI), medial-triceps (TM), lateral-triceps (TL), anterior deltoid (AD), posterior deltoid (PD), pectoralis major (PE), latissimus dorsi (LD) of the right, reaching arm.”. For dataset 3, to the best of the authors knowledge, this information was not given in the original paper.

      Figure 4, what is the unit of the data shown?

      The unit of bits is now mentioned in the toy example figure caption and in the caption of fig.5

      Figure 4, the characteristics of the interactions are not fully clear, and the graphical representation should be improved.

      We have made steps to improve the clarity of the figures presented.

      For dataset 3, τ was the movement kinematics, but it is not specified how the task parameters were formulated. Did the authors use the data from all 32 kinematic markers, 4 IMUs, and force plates? If yes, it should be specified why all these signals were used. For sure, there will be signals included that are not relevant to the specific task. Did the authors select specific signals based on their relevance to the task (e.g., ankle kinematics)?

      We have now clarified this in the text as follows: “For datasets 1 and 2, we determine the MI between vectors with respect to several discrete task parameters representing specific task attributes (e.g. reaching direction, speed etc.), while for dataset 3 we determined the task-relevant and -irrelevant muscles couplings in an unassuming way by quantifying them with respect to all available kinematic, dynamic and inertial motion unit (IMU) features.”

      How did the authors endure that crosstalk did not affect their analysis, particularly between, e.g., finger extensors and brachioradialis and posterior deltoid and anterior deltoid (dataset 1)?

      We have addressed this point in the previous round of reviews and made an explicit statement regarding cross-talk in the discussion section: “Although distinguishing task-irrelevant muscle couplings may capture artifacts such as EMG crosstalk, our results convey several physiological objectives of muscles including gross motor functions [66], the maintenance of internal joint mechanics and reciprocal inhibition of contralateral limbs [19,51].”

      It would be informative to add some examples of not trivial/obvious task-related synergistic muscle combinations that have been extracted in the three datasets. Most of the examples reported in the manuscript are well-known biomechanically and quite intuitive, so they do not improve our understanding of synergistic muscle control in humans.

      Our framework improves our understanding of synergistic motor control by enabling the formal quantification of synergistic muscle interactions, a capability not present among current approaches. Regarding the implications of this advance in terms of concrete examples, we have further clarified our examples presented in the results section, for example:

      “Across datasets, many the muscle networks could be characterised by the transmission of complementary task information between functionally specialised muscle groups, many of which identified among the task-redundant representations (Fig.9-10 and Supp. Fig.2). The most obvious example of this is the S3 synergist muscle network of dataset 2 (Fig.11), which captures the complementary interaction between task-redundant submodules identified previously (S3 (Fig.9)).”

      The description shows how our framework can extract the cross-module interactions that align with the higher-level objectives of the system, here the synergistic connectivity between the upper and lower body modules. Current approaches can only capture redundant and task-irrelevant interactions. Thus our framework provides additional insight into movement control.

      The number of participations in dataset 2 is very limited and should be increased. We appreciate the reviewer's comment and would like to point out that for dataset 2 our aim was to increase the number of muscles (30), tasks (72) and trials for each task (30) which produced a very large dataset for each participant. This came at the expense of low number of participants, however all our statistical analyses here can be performed at the single-participant level. Furthermore, dataset 3 includes 25 participants and it enables us to demonstrate the reliability of the findings across participants.

      Reviewer #2 (Recommendations For The Authors):

      I believe it is important in the future to explore the approach proposed with a range of simulation data and neuromechanical models, to explore the issues I have raised and that you have acknowledged, though I agree it is likely out of scope for the paper here.

      We agree with the reviewer that this would be valuable future work and indeed plan to do this in our future research.

      The Github code for this paper should likely include the various data sets used in the paper and figures, appropriately anonymized, in order to allow the data to be explored and analyses replicated and package demonstrated to be exercised fully by a new user.

      We thank the reviewer for this suggestion. Dataset3 is already available online at https://doi.org/10.1016/j.jbiomech.2021.110320. We will also make the other 2 datasets publicly available on our lab website very soon. Until then, as stated in the manuscript, we will make them available to anyone upon reasonable request.

      Reviewer #3 (Recommendations For The Authors):

      I have the following open points to suggest to the authors:

      First, I recommend improving the quality of the figures: in the pdf version I downloaded, some writings are impossible to read.

      We fully agree with the reviewer and note that in the pdf version of the paper, the figures are a lot worse than in the submitted word document submitted. Nevertheless, we will make further improvements on the figures as requested.

      Even though the manuscript has improved, I still feel that some points were not addressed or were only partially addressed. In particular:

      • The proposed comparison with NMF helps understanding why incorporating the task space is useful (and I fully agree with the authors about this point as the main reason to propose their contribution). However, the comparison does not help the reader to understand whether the synergies incorporating the task space are biased by the introduction of the task variables.

      This question can be also reformulated as: are muscle synergies modified when task space variables are incorporated? Is the "weight" on task coefficients affecting the composition of muscle synergies? If so, the added interpretational power is achieved at the cost of losing the information regarding the neural substrate of synergies? I understand this point is not immediate to show, but it would increase the quality of the work.

      • Reference to previous approaches that aimed at including task variables into synergy extraction are still missing in the paper. Even though it is not required to provide quantitative comparisons with other available approaches, there are at most 2-3 available algorithms in the literature (kinematics-EMG; force-EMG), that should not be neglected in this work. What did previous approaches achieve? What was improved with this approach? What was not improved?

      Previous attempts of extracting synergies with non-linear approaches could also be described more.

      In the latest version of the manuscript, we have referenced both the mixed NMF and autoencoders based algorithms. In both the introduction and discussion section of the manuscript, we also specify that our framework quantifies and decomposes muscle interactions in a novel way that cannot be done by other current approaches. In the results section we use examples from 3 different datasets to make this point clear, providing intuition on the use cases of our framework.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer 1:

      Comments on revisions:

      This manuscript is in some ways improved - mainly by toning down the conclusions - but a few major weaknesses have not been addressed. I do not agree that it is not justified to perform experiments to investigate the sterility of single CDK8 knockout mice since this could be important and given that the new data show that while there is some overlap in expression of the two prologues, there are also significant differences in the testis. At the least, it would have been interesting and easy to do to show the expression of CDK8 and CDK19 in the single cell transcriptomics, since this might help to identify the different populations.

      Certainly, we tried to analyse Cdk8/Cdk19 in single cell transcriptomics. However, we were unable to draw a clear conclusion. Due to a limited sensitivity of single cell sequencing, especially for low abundant transcripts, such as transcription factors (for 10x technology used in our study) (Chuang et al., 2024), it is challenging to establish with certainty CDK8/19 positive and -negative tissues from single cell data because both transcripts are minor. Nevertheless, the majority of cell types showed some expression of CDK8/19, with maximum expression in pachytene/diplotene spermatocytes. We do not include these data to the manuscript particularly as we were successful to assess Cdk8/19 expression patterns using IF approaches.

      Author response image 1.

      The only definitive way of concluding a kinase-independent phenotype is to rescue with a kinase dead mutant. While I agree that the inhibitors have been well validated, since they did not have any effects, it is hard to be sure that they actually reached their targets in the tissue concerned. This could have been done by cell thermal shift assay. In the absence of any data on this, the conclusion of a kinase-independent effect is weak.

      We totally agree with this point, but it takes several years to produce mice with inducible expression of KD CDK8 mice on the DKO background. These experiments are already underway in our lab, however, their results will be published in our future works.

      Figure 2 legend includes (G) between (B) and (C), and appears to, in fact, refer to Fig 1E, for which the legend is missing the description.

      Thank you, we corrected this.

      Finally, Figure S1C appears wrong. Goblet cells are not in the crypt but on the villi (so the graph axis label is wrong), and there are normally between 5 and 15 per villus, so the iDKO figure is normal, but there are a surprisingly high number of goblet cells in the controls. And normally there are 10-15 Paneth cells/crypt, so it looks like these have been underestimated everywhere. I wonder how the counting was done - if it is from images such as those shown here then I am not surprised as the quality is insufficient for quantification. How many crypts and villi were counted? Given the difficulty in counting and the variability per crypt/villus, with quantitative differences like this it is important to do quantifications blind. I personally wouldn't conclude anything from this data and I would recommend to either improve it or not include it. If these data are shown, then data showing efficient double knockout in this tissue should also accompany it, by IF, Western or PCR. Otherwise, given a potentially strong phenotype, repopulation of the intestine by unrecombined crypts might have occurred - this is quite common (see Ganuza et al, EMBO J. 2012).

      We added fig. S1C with Western blot showing presence of CDK8 and CCNC in WT intestine and  their absence in the DKO intestine. We also corrected that the part of the intestine analyzed was the duodenum, not ileum. We also replaced intestine sections photos with the ones of better quality and higher magnification (200X) and corrected Y axis legend. We apologize for the confusion, and thank the reviewer for careful analysis of our data, which allowed us to make this correction. The numbers of cells were counted on 600x magnification, and the magnification given in the article is for presentation purposes only. Our number of goblet cells was indeed calculated per villus, not crypt, and the resulting number is similar to ones reported in Dannapel et al (Dannappel et al., 2022). As for Paneth cells their numbers correspond to several articles that use the c57bl6 strain (Brischetto et al., 2021; King et al., 2013), as the number of Paneth cells differs between different part of the intestine and different mouse strains (Nakamura et al., 2020). 

      Reviewer 2:

      This reviewer appreciated the authors' effort in improving the quality of this manuscript during their revision. While some concerns remain, the revision is a much improved work and the authors addressed most of my major concerns.

      Figure 2E CDK8 and CDK19 immunofluorescent staining images seem to show CDK8 and CDK19 location are completely distinct and in different cells, the authors need to elaborate on this results and discuss what such a distinct location means in line of their double knockout data.

      We thank the reviewer for this suggestion. We had expanded the discussion in the lines 518 and 529 and included a better quality picture of the 200x magnification. Our main line of reasoning is that despite distinct expression in different cell types, high magnification show a certain level of expression of both proteins in most cells, so single knockouts will not demonstrate more than a slight phenotype, while the full knockout will have the full effect. This is especially true if our hypothesis that CCNC stabilization is important here, as both kinases can stabilize the protein.

      Minor comments:

      Supplemental figure 1(C) legend typo : (C) Periodic acid-Schiff stained sections of ilea of tamoxifen treated R26/Cre/ERT2 and DKO mice.

      Thank you, we corrected this.

      While the effort to identify and generate new antibodies is appreciated, the specificity of the antibodies used should be examined and presented if available.

      The specificity of the antibodies for the western blot is confirmed in figure S1F. We added fig. S1G with IF staining of CDK19 KO testes proving our CDK19 antibody specificity.

      References:

      Brischetto C., Krieger K., Klotz C., et.al. 2021. NF-κB determines Paneth versus goblet cell fate decision in the small intestine. Development 148. doi:10.1242/dev.199683

      Chuang H.-C., Li R., Huang H., et.al. 2024. Single-cell sequencing of full-length transcripts and T-cell receptors with automated high-throughput Smart-seq3. BMC Genomics 25:1127. doi:10.1186/s12864-024-11036-0

      Dannappel M.V., Zhu D., Sun X., et.al. 2022. CDK8 and CDK19 regulate intestinal differentiation and homeostasis via the chromatin remodeling complex SWI/SNF. J Clin Invest 132. doi:10.1172/JCI158593

      King S.L., Mohiuddin J.J., Dekaney C.M.. 2013. Paneth cells expand from newly created and preexisting cells during repair after doxorubicin-induced damage. Am J Physiol Gastrointest Liver Physiol 305:G151–62. doi:10.1152/ajpgi.00441.2012

      Nakamura K., Yokoi Y., Fukaya R., et.al. 2020. Expression and localization of Paneth cells and their α-defensins in the small intestine of adult mouse. Front Immunol 11:570296. doi:10.3389/fimmu.2020.570296

    1. Author response:

      The following is the authors’ response to the current reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This study by Wang et al. identifies a new type of deacetylase, CobQ, in Aeromonas hydrophila. Notably, the identification of this deacetylase reveals a lack of homology with eukaryotic counterparts, thus underscoring its unique evolutionary trajectory within the bacterial domain.

      Strengths:

      The manuscript convincingly illustrates CobQ's deacetylase activity through robust in vitro experiments, establishing its distinctiveness from known prokaryotic deacetylases. Additionally, the authors elucidate CobQ's potential cooperation with other deacetylases in vivo to regulate bacterial cellular processes. Furthermore, the study highlights CobQ's significance in the regulation of acetylation within prokaryotic cells.

      Weaknesses:

      The problem I raised has been well resolved. I have no further questions.

      Thanks for your valuable comments very much.

      Reviewer #2 (Public review):

      In recent years, lots of researchers tried to explore the existence of new acetyltransferase and deacetylase by using specific antibody enrichment technologies and high resolution mass spectrometry. Here is an example for this effort. Yuqian Wang et al. studied a novel Zn2+- and NAD+-independent KDAC protein, AhCobQ, in Aeromonas hydrophila. They studied the biological function of AhCobQ by using biochemistry method and MS identification technology to confirm it. These results extended our understanding of the regulatory mechanism of bacterial lysine acetylation modifications. However, I find this conclusion is a little speculative, and unfortunately it also doesn't totally support the conclusion as the authors provided.

      Major concerns:

      - It is a little arbitrary to come to the title "Aeromonas hydrophila CobQ is a new type of NAD+- and Zn2+-independent protein lysine deacetylase in prokaryotes." It should be modified to delete the "in the prokaryotes" except that the authors get new more evidence in the other prokaryotes for the existence of the AhCobQ.

      Thank you for your suggestion. However, I believe there has been some confusion regarding the title. In the revised manuscript we have already updated the title to: "Aeromonas hydrophila CobQ is a new type of NAD+- and Zn2+-independent protein lysine deacetylase."

      This title does not include the phrase "in prokaryotes," as you mentioned. We kindly suggest verifying the version of the manuscript that was reviewed to ensure you are reviewing the most recent changes.

      - I was confused about the arrangement of the supplementary results. Because there are no citations for Figures S9-S19.

      Thank you for your feedback. It appears there may have been a misunderstanding, possibly due to reviewing an outdated version of the manuscript. In the revised manuscript we revised the supplementary figures and now have only 12 figures, all of which are correctly cited in the manuscript on pages 12 to 15. Below is a detailed list of the updated figure citations:

      Figures S1: page 8, line 148;

      Figures S2: page 9, line 168;

      Figures S3 and S4: page 10, line 178;

      Figures S5: page 10, line 186;

      Figures S6: page 10, line 189;

      Figures S7: page 12, line 221;

      Figures S8-S10: page 13, line 245;

      Figures S11: page 11, line 282;

      Figures S12: page 15, line 286

      - Same to the above, there are no data about Tables S1-S6.

      Thank you for your attention to the supplementary materials. As with the figures, we have already uploaded the data for Tables S1-S6 in the revised manuscript on November 19, 2024, and properly cited Tables S1 – S6 in the manuscript. Below is the citation information:

      Tables S1: page 10, line 194;

      Tables S2: page 13, line 245;

      Tables S3: page 21, line 438;

      Tables S4: page 22, line 439;

      Tables S5: page 22, line 445;

      Tables S6: page 27, line 564.

      Please note that Tables S3 – S4 include the chemical reagents, primers, and other experimental materials, which are not intended to be cited in the results section.)

      - All the load control is not integrated. Please provide all of the load controls with whole PAGE gel or whole membrane western blot results. Without these whole results, it is not convincing to come the conclusion as the authors mentioned in the context.

      Thank you for your comment. Please note that the full membrane western blot results were included in the revised manuscript. We hope this satisfies your request. If you need further clarification or additional data, please do not hesitate to let us know.

      - Thoroughly review the materials & methods section. It is unclear to me what exactly the authors describe in the method. All the experimental designs and protocols should be described in detail, including growth conditions, assay conditions, and purification conditions, etc.

      Thank you for your valuable suggestion. In response to your comment and previous feedback, we have alredy revised the Materials & Methods section thoroughly in the revised manuscript. The experimental details, including growth conditions, assay protocols, and purification procedures, are described in full on pages 22 to 30 of the revised manuscript.

      - Include relevant information about the experiments performed in the figure legends, such as experimental conditions, replicates, etc. Often it is not clear what was done based on the figure legend description.

      Thank you very much for your detailed feedback and suggestions. We have made sure to describe what each data point represents in the figure legends, as per the previous feedback. However, we would like to clarify that while we have provided detailed descriptions in the legends, the inclusion of every specific experimental condition in the figure legends could result in redundancy, as these details are already thoroughly outlined in the Materials & Methods section.

      We hope this explanation addresses your concern.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      I have no further revision comments.

      Thank you very much.

      Reviewer #2 (Recommendations for the authors):

      I carefully read the point-to-point response from the author. Although they listed lots of the reasons for the ugly results, it still can not persuade me to accept their conclusions. While, as I know, it is impossible to reject their work in eLife as it was sent out for peer-review. I also can't accuse them of being wrong, but I have my opinion on this point. That is not the results, but the attitude.

      Thank you for your feedback. However, I must express some concerns regarding the nature of your comments. Based on the issues you've raised, it seems that you may have reviewed an outdated version of the manuscript. In the updated revision we addressed all the points you've raised, including the figure and table citations, experimental methods, and data integration.

      We understand that differing opinions are part of the peer-review process, but we respectfully believe that your conclusion regarding our attitude is based on a misunderstanding, possibly caused by reviewing an incorrect version of the manuscript. We have always strived to approach this manuscript with utmost professionalism and have diligently responded to each of your concerns.

      We sincerely suggest reviewing the latest version of our manuscript, and we welcome any further constructive feedback. We hope this clarifies any misunderstandings and look forward to your continued support.

      Thank you for your time and thoughtful consideration.


      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This study by Wang et al. identifies a new type of deacetylase, CobQ, in Aeromonas hydrophila. Notably, the identification of this deacetylase reveals a lack of homology with eukaryotic counterparts, thus underscoring its unique evolutionary trajectory within the bacterial domain.

      Strengths:

      The manuscript convincingly illustrates CobQ's deacetylase activity through robust in vitro experiments, establishing its distinctiveness from known prokaryotic deacetylases. Additionally, the authors elucidate CobQ's potential cooperation with other deacetylases in vivo to regulate bacterial cellular processes. Furthermore, the study highlights CobQ's significance in the regulation of acetylation within prokaryotic cells.

      Weaknesses:

      The problem I raised has been well resolved. I have no further questions.

      Reviewer #2 (Public review):

      In recent years, lots of researchers tried to explore the existence of new acetyltransferase and deacetylase by using specific antibody enrichment technologies and high resolution mass spectrometry. Here is an example for this effort. Yuqian Wang et al. studied a novel Zn2+- and NAD+-independent KDAC protein, AhCobQ, in Aeromonas hydrophila. They studied the biological function of AhCobQ by using biochemistry method and MS identification technology to confirm it. These results extended our understanding of the regulatory mechanism of bacterial lysine acetylation modifications. However, I find this conclusion is a little speculative, and unfortunately, it also doesn't totally support the conclusion as the authors provided.

      Reviewer #3 (Public review):

      Summary:

      This study reports on a novel NAD+ and Zn2+-independent protein lysine deacetylase (KDAC) in Aeromonas hydrophila, termed as AhCobQ (AHA_1389). This protein is annotated as a CobQ/CobB/MinD/ParA family protein and does not show similarity with known NAD+-dependent or Zn2+-dependent KDACs. The authors showed that AhCobQ has NAD+ and Zn2+-independent deacetylase activity with acetylated BSA by western blot and MS analyses. They also provided evidence that the 195-245 aa region of AhCobQ is responsible for the deacetylase activity, which is conserved in some marine prokaryotes and has no similarity with eukaryotic proteins. They identified target proteins of AhCobQ deacetylase by proteomic analysis and verified the deacetylase activity using site-specific Kac proteins. Finally, they showed that AhCobQ activates isocitrate dehydrogenase by deacetylation at K388.

      Strengths:

      The finding of a new type of KDAC has a valuable impact on the field of protein acetylation. The characters (NAD+ and Zn2+-independent deacetylase activity in an unknown domain) shown in this study are very unexpected.

      Weaknesses:

      (1) The characters (NAD+ and Zn2+-independent deacetylase activity in an unknown domain) shown in this study are very unexpected. To convince readers, MSMS data must be necessary to accurately detect (de)acetylation at the target site in the deacetylase activity assay. The authors showed the MSMS data in assays with acetylated BSA, but other assays only rely on western blot.

      (2) They prepared site-specific Kac proteins and used them in deacetylase activity assays. Incorporation of acetyllysine at the target site should be confirmed by MSMS and shown as supplementary data.

      (3) The authors imply that the 195-245 aa region of AhCobQ may represent a new domain responsible for deacetylase activity. The feature of the region would be of interest but is not sufficiently described in Figure 5. The amino acid sequence alignments with representative proteins with conserved residues would be informative. It would be also informative if the modeled structure predicted by AlphaFold is shown and the structural similarity with known deacetylases is discussed.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      The problem I raised has been well resolved. I have no further questions.

      Reviewer #2 (Recommendations for the authors):

      Questions to response of"-The load control is not all integrated. All of the load controls with whole PAGE gel or whole membrane western blot results should be provided. Without these whole results, it is not convincing to come to the conclusion that the authors have."

      Just as the Authors answered. The Coomassie Blue R-350 staining outcomes from the PVDF membranes. That is a good control for the experiment. However, I still have several questions about it:

      (1) The first is the quality of these Western blot. Why all the bands of these Western blot is so ugly? To tell the truth, it is very difficult to come to a conclusion from these poor western blots.

      We appreciate your feedback regarding the quality of the Western blots presented in Figure 7. We believe the “ugly bands” you referred to reflect our results validating the functions of CobQ through the use of recombinant site-specific Kac protein substrates.

      In our study, we meticulously engineered these recombinant site-specific Kac proteins using a two-plasmid system, based on foundational research published in Nature Chemical Biology (2017, 13(12): 1253-1260), which introduced the genetic encoding of Nε-acetyllysine into recombinant proteins. However, we faced a common challenge: protein truncation due to premature translation termination at the reassigned codon. This issue not only hampers protein yields, as discussed in ChemBioChem (2017, 18(20): 1973-1983), but also contributes to the suboptimal appearance of the Western blot results.

      Despite conducting at least two independent repetitions for the Western blot analysis of the site-specific Kac proteins, which yielded consistent results, we recognize that the overall quality remains less than ideal. This variability is inherently related to the characteristics of the target proteins. Nevertheless, the primary aim of our manuscript is to validate the novel deacetylase activity of CobQ. We have provided multiple lines of evidence, including mass spectrometry (MS/MS) and Western blot analyses, to substantiate this claim. In response to your comments, we have decided to remove the ambiguous Western blot results from Figure 7, retaining only four figures that demonstrate significant differences across at least two independent replicates (Author response images 1-5). Additionally, we have included four biological replicates of the Western blot results for ICD Kac388 + CobQ in the supplementary materials (Author response image 5) to further validate the deacetylase function of CobQ.

      Author response image 1.

      Western blot validation of the Kac26 AcrA-2 protein substrates regulated by the three KDACs in two biological replicates.

      Author response image 2.

      Western blot validation of the Kac48 Sun protein substrates regulated by the three KDACs in two biological replicates.

      Author response image 3.

      Western blot validation of the Kac103 Sun protein substrates regulated by the three KDACs in two biological replicates.

      Author response image 4.

      Western blot validation of the Kac195 Eno protein substrates regulated by the three KDACs in three biological replicates.

      Author response image 5.

      Western blot validation of the Kac388 ICD protein substrates regulated by AhCobQ in this study. Each sample was independently repeated at least three time.

      (2) The second is why some of the results are not from the same PVDF by comparing the Coommassie staining with the WB results just as authors responded. For example, the HrpA-K816 (ac), Eno-K195 (ac), ArcA-2-K26 (ac), ArcA-2-K26 (ac), IscS-K93(ac), A0KJ75-K81(ac), GyrB-K331(ac), GyrB-K449(ac), FtsA-K320(ac), FtsA-K409(ac), RecA-K279(ac), and the RecA-K306(ac). All of them are clearly not from the same staining results of PVDF membrane but from a new PVDF membrane.

      We assure you that the R-350 stained PVDF membranes originate from the same Western blot membranes. However, we acknowledge that visual discrepancies may arise due to differences in imaging techniques. The Western blot results were scanned using a ChemiDoc MP (Bio-Rad, Hercules, CA, USA), while the Coomassie R-350 stained PVDF membranes were captured using a standard camera. These differences can create a misleading appearance, making it seem as though they come from different membranes.

      It is also important to note that the intensity of the protein marker cannot be directly compared between the two imaging methods. As illustrated in Author response image 6, the protein marker at 70 kDa is clearly detectable in the Coomassie R-350 image, whereas it may not be as apparent in the Western blot result due to inherent differences in detection sensitivity.

      Author response image 6.

      The comparison of Western blotting and R-350 strained results of same protein marker in the same PVDF membrane. The protein marker located at 70 kDa can be detected easily in Coomassie R-350, while is difficult to display in WB result.

      Additionally, we have removed some of the so-called "ugly" Western blot results in the updated manuscript and provided the original full film of the relevant images as an attachment. This documentation demonstrates that all the data you referenced originate from the same film, as shown in Figures 1-5.

      (3) The third is why there is no replication for all these WB results. We should draw a conclusion with serious attitude, but not from the only one repeat, even say nothing about the poor results.

      Thank you for your valuable suggestion. In the second version of the manuscript, we have included the original full film of the relevant images. While we previously explained the reasons behind the "ugly" Western blot results, we have decided to remove some, or even all, of these results from Figure 7 in the updated version. The related images will be updated in the supplementary materials (Figures 1-5 in responding letter and Figure 7 in the revised manuscript).

      Furthermore, we have provided a more detailed discussion regarding the poor results in the updated manuscript to ensure clarity and transparency. We appreciate your understanding and hope these changes meet your expectations.

      Questions to response of " L174-187, L795 (Please show the whole membrane (or PAGE gel) of the loading control of CobB, and CobQ, except for the Kac-BSA)".

      (1) As we all educated that there is no control, and no biology. Where is the band of CobQ? Why do not stain the same PVDF membrane with R-350 staining but with a new membrane?

      Thank you for your insightful feedback. As noted in our previous response, the absence of visible bands for AhCobQ and AhCobB on the Coomassie R-350 stained PVDF membrane is primarily due to the low loading amounts and protein loss during the Western blotting process.

      To reinforce our findings, we repeated the analysis of the protein samples via SDS-PAGE, using the same loading quantity as in the previous Western blot shown in Figure 2 of the manuscript. As illustrated in Author response image 7, the bands for CobB and CobQ are discernible, albeit with significantly lower intensities compared to the Kac-BSA bands. Upon examining the full Coomassie R-350 stained PVDF membranes provided in Supplementary Material 1, we observe that the CobB and CobQ bands are not easily visible. This aligns with your observations and can be attributed to potential protein loss during the transfer from SDS-PAGE to the PVDF membrane.

      Author response image 7.

      The SDS-PAGE gel displayed the loading amounts of Kac-BSA and CobB/CobQ.

      To enhance the visibility of the CobQ/CobB bands, we increased the loading of CobQ/CobB in a new Western blot experiment, using 2 µg of Kac-BSA in combination with 0.8 µg of CobQ/CobB. As shown in Figure 8, while the increasing amounts of Kac-BSA resulted in a more blurred signal, the bands for the recombinant CobQ and CobB proteins were clearly detectable. This indicates that both proteins were indeed involved in the in vitro protein deacetylation assay.

      Author response image 8.

      Western blot verified the deacetylase activity assay of AhCobQ and AhCobB on Kac-BSA.

      Furthermore, we conducted a mass spectrometry analysis comparing Kac-BSA and Kac-BSA incubated with CobQ, as well as BSA without acetylation, against the A. hydrophila database with a cut-off of unique matched peptides >1. It is challenging to completely avoid contaminant detection during protein purification, especially when using high-resolution mass spectrometry. Our findings revealed that CobQ has the highest number of unique matched peptides (Author response table 1), while contaminants such as AHA_3036, AHA_0497, AHA_1279, and valS could be excluded, as they were present in Kac-BSA or BSA samples. Additionally, Tuf1, RplQ, GroEL, RpsF, RpsU, RpsB, RpsO, and RpsJ are known ribosomal subunits or chaperonins that are abundantly expressed in cells and may interact with various proteins, leading to contaminant detection.

      Author response table 1.

      LC MS/MS results of selected peptide quantification among Kac-BSA and Kac-BSA incubated with CobQ and BSA without acetylation against A. hydrophila database (unique matched peptides>1).

      Although AceE, a pyruvate dehydrogenase E1 component, theoretically possesses deacetylase activity, this possibility is low. First, in the SDS-PAGE gel of the purified recombinant protein, CobQ is the major band, with other proteins present at very low levels (less than 1/10 of CobQ). This suggests that significant deacetylation by contaminants is unlikely. Second, we purified His-tagged AhCobQ and GST-fused AhCobQ separately and tested their deacetylase activities. As shown in Figure S4 of the updated manuscript, both purified AhCobQ proteins exhibited deacetylase activity, while the negative control (purified GST protein only) did not, further supporting our conclusion that enzyme activity is not attributable to contaminating proteins (Figure S5).

      (2) Without the CobB and CobQ bands, it is impossible to say the function of CobQ is a new deacetylase. To avoid this confusion, it is easy to run a new gel and stain it with anti-His antibody to show these deacetylases.

      Thank you very much for your suggestion. We have performed the experiment in the comment (1) as your suggestion.

      (3) The explanation about the CobB/CobQ bands are not visible is not acceptable. Because the molecular weight of the CobB and CobQ is smaller than that of BSA, it is impossible that these bands will be loss during membrane transfer.

      Thank you for your valuable feedback. I completely agree that the loss of CobB and CobQ proteins during membrane transfer is unlikely due to their smaller molecular weight compared to BSA. As shown in Figure 7, the bands for CobB and CobQ are detectable in the SDS-PAGE gel but not visible on the Coomassie R-350 stained PVDF membrane.

      Several factors could contribute to this issue. One possibility is that the detection sensitivity of Coomassie R-350 may be lower than that of Coomassie R-250 used in the gel. Additionally, the Western blot results using an anti-His antibody further indicate low loading amounts of CobB and CobQ proteins on the PVDF membrane (Figure 8). This suggests that the observed low levels may indeed be due to protein loss during the membrane transfer process, despite their relatively small size.

      Reviewer #3 (Recommendations for the authors):

      (1) I found Tables S1 and S2 in the revised manuscript. It is strange to me that the intensity of Kac-BSA+CobQ is zero, completely nothing. Typically, a portion of the acetylated peptide remains after the deacetylation reaction.

      Thank you for your observation. When we report an intensity of zero, it does not imply a complete absence of signal; rather, it indicates that the signal for the target peptide is below the detectable threshold. This is likely due to the minimum cut-off setting in the MaxQuant (MQ) software, which is determined by parameters like "peptide_mass_tolerance" (as discussed in MQ user groups online, though it may not be explicitly listed in the parameters file).

      In our study, we performed a deacetylase assay that demonstrated CobQ's rapid activity; for instance, it can deacetylate ICD-K388ac within just four minutes. This leads me to hypothesize that the CobQ + Kac-BSA sample may have undergone near-complete enzymatic hydrolysis during the reaction.

      Furthermore, Table S1 in manuscript presents only a selection of the mass spectrometry results to illustrate CobQ's activity. In addition to the 15 acetylated peptides shown, there are many more (27 peptides) that exhibit significantly reduced acetylation levels without reaching zero intensity. The overall acetylation level of BSA peptides incubated with CobQ is calculated to be only 0.13 times that of Kac-BSA (Diagnostic peak: yes, peptide score: >100, Localization probability: >0.95) (Author response image 9).

      Based on these findings, we believe our mass spectrometry results are reliable and effectively support our conclusions. Thank you for your understanding.

      Author response image 9.

      The intensities of all Kac peptides of Kac-BSA with or without AhCobQ incubation in LC MS/MS.

      (2) It would be better to provide the information about ArcA and ArcA-2 as mentioned in the authors' response. It would be helpful for readers to understand that they are different proteins.

      Thank you for your suggestion. In the A. hydrophila ATCC 7966 dataset, there are indeed two distinct proteins referred to as ArcA: ArcA-1, which functions as an aerobic respiration control protein, and ArcA-2, which acts as an arginine deiminase. Importantly, these two proteins do not share any sequence homology; they are only similarly named due to their acronyms. While we believe this distinction does not require extensive explanation in the current study, we appreciate your input. Additionally, in response to Reviewer 2’s feedback, we have decided to remove the Western blot result for ArcA-2 due to its poor quality in the updated manuscript.

      (3) Line 409-416. Despite my comment, the citation of related papers on ICD acetylation in E. coli is still missing.

      Thank you for your suggestion. It has been added and highlighted in red. (Venkat S, et al, 2018, 430(13): 1901-1911)

      (4) The image resolution of Figure 3C and 3D is still bad. I could not evaluate that Kac was exactly incorporated at the target site.

      Thank you for your feedback regarding the image resolution of Figures 3C and 3D. We have now displayed these figures with improved clarity, as you suggested.

      To further validate the reliability of our MS2 data, we employed Proteome Discoverer 2.4 (Thermo) to analyze the raw data and provide theoretical mass information. As shown in Author response images 10-13, the MS2 spectra and fragment match lists for both unmodified and acetylated peptides offer additional confirmation of the reliability of our mass spectrometry results.

      Author response image 10.

      MS2 spectrum of unmodified peptide using PD v2.4 software.

      Author response image 11.

      The theoretical mass of unmodified peptide by PD 2.4

      Author response image 12.

      MS2 spectrum of acetylated peptide using PD v2.4 software.

      Author response image 13.

      The theoretical mass of acetylated peptide by PD 2.4.

      (5) Again, in Figure 8D, it should be shown the significance between ICD-Kac388 and ICD-Kac388+AhCobB to support the authors' conclusion that AhCobQ activates ICD by deacetylation at K388.

      Thanks for your suggestion, we have updated the figure in Figure 8D in updated manuscript.

      (6) It was nice that the authors presented the mass spectrum data of ICD-K388 acetylation (Figure 2 in responding letter). However, the data did not convince me that K388 is acetylated. In the figure, two b-ion peaks are detected, 285.1557 and 386.2034, which may correspond to NK (theoretical mass, 260.15) and NKT (theoretical mass, 361.20) peptides, respectively. If K388 is acetylated, an increase in the mass of 42 should be observed, but the difference between the detected and theoretical mass is 25. I also could not understand what the peak of 126.0913 mass is, indicated with acK* in red.

      Thank you for your detailed observation. The data presented in the MS2 spectrum for ICD-K388 acetylation in Figure 2 of the previous response letter were generated using Proteome Discoverer 2.4 (PD, Thermo) to ensure accurate mass calculations. Similar to the results from MaxQuant, ICD-K388 was identified again (Author response image 14).

      Regarding the b-ion peaks you mentioned, the values 285.1557 and 386.2034 correspond to NK<sup>ac</sup> and NK<sup>ac</sup>T peptides, respectively. The theoretical masses for these peptides are as follows: NK<sup>ac</sup> (285.15 = 115.05020 + 128.095 + 42.01) and NK<sup>ac</sup>T (386.20 = NK<sup>ac</sup> + 101.04768). The differences between the theoretical and detected masses for the relevant b-ions (b2*-NK, b52+-NH3, and b3) are minimal, at 0.00 Da and 2.1 ppm, respectively, which is consistent with the incorporation of an NH3 group (Author response image 15).

      Author response image 14.

      The MS2 of ICD-K388 peptide by PD 2.4.

      Author response image 15.

      The theoretical mass of ICD-K388 peptide by PD 2.4.

      The peak at 126.0913 m/z, indicated as acK*, represents immonium ions of ε-N-acetyllysine, which are generated during the fragmentation of acetyllysine. This diagnostic ion is widely recognized as a marker for identifying acetylated peptides (Nakayasu, et al,. A method to determine lysine acetylation stoichiometries. International journal of proteomics. 2014;2014(1):730725; Trelle et al., Utility of immonium ions for assignment of ε-N-acetyllysine-containing peptides by tandem mass spectrometry. Analytical chemistry. 2008;80(9):3422-30). Additionally, it is a default parameter in MaxQuant for identifying Kac peptides (Author response image 16).

      Based on these findings, we believe the evidence supporting ICD-K388 acetylation is robust.

      Author response image 16.

      The default parameter in Kac peptide identification in Maxquant v1.6 software

      (7) As mentioned by other reviewers, some of the figures and tables are incomplete. Some panels (ex. Figure 7C and 7D) and explanations (ex. What are lanes 1, 2, and 3 in Figure S3) are still missing.

      Thank you for your suggestion. It has been added.

    1. Author Response

      The following is the authors’ response to the previous reviews

      We would like to thank you again for your thorough review of the manuscript. We have taken all comments into account in the revised version of the manuscript. Please find below our detailed responses to your comments.

      Reviewing Editor

      The manuscript has been improved, but there are some remaining issues that need to be addressed, as outlined in the reviewers' comments. In particular, please pay attention to Figures 1A and 2A as they appear to be the same. Moreover, the original gel images for Western blots should be made available given the concerns raised by Reviewer #1.

      Thank you for your recommendations. We have carefully considered all comments and made the requested revisions to improve the manuscript.

      Reviewer #1 (Public Review):

      In this manuscript, the authors aimed to compare, from testis tissues at different ages from mice in vivo and after culture, multiple aspects of Leydig cells. These aspects included mRNA levels, proliferation, apoptosis, steroid levels, protein levels, etc. A lot of work was put into this manuscript in terms of experiments, systems, and approaches. The technical aspects of this work may be of interest to labs working on the specific topics of in vitro spermatogenesis for fertility preservation.

      Second review:

      The authors should be commended for substantial improvement in their manuscript for resubmission.

      Thank you very much for this second review and your help to improve this manuscript.

      Recommendations For The Authors:

      Going forward, the authors would be well-served to put a similar amount of effort on first drafts as well, which would both increase reviewer enthusiasm and reduce reviewer workload to document all the deficiencies! Abstract is much improved, and clearly articulates the point of the study.

      We are very grateful for all your constructive comments, which have greatly contributed to the improvement of our manuscript.

      1) 54 - replace "could be" with was

      “could be” was replaced by “was”

      2) 75 - delete "being"

      “being” was deleted.

      3) 103 - would say "indirectly promotes" since Rhox5 is a transcription factor that presumably activates genes in Sertoli cells whose products then affect neighboring germ cells, either by direct action or by influencing Sertoli cell behavior changes

      “indirectly” was added in the sentence.

      4) 139, 155, elsewhere - haven't seen dpp italicized before, certainly not the norm

      In dpp (days post-partum), “pp” is italicized as it is a Latin word.

      5) 265 - delete "found"

      “found” was deleted.

      6) 263-273 - Is the CYP19 protein referred to encoded by the Cyp19a1 gene (line 263)? Should standardize nomenclature...

      The CYP19 protein (aromatase) is indeed encoded by the Cyp19a1 gene. The nomenclature was standardized: “CYP19” was replaced by “CYP19A1” in the entire manuscript.

      7) 280 - "homolog" doesn't seem like the right word, as it has a very specific meaning with regards to the evolutionary genetic relatedness of genes. Maybe analog?

      “homolog” was replaced by “analog”.

      8) 306 - would reword to something like "proportions of seminiferous tubules containing round and elongating spermatids" - the because the tubules don't reach spermatid stages

      This sentence was reworded as suggested.

      9) 310 - delete "resulted in", unnecessary

      “resulted in” was deleted.

      10) Why are the images shown in Figures 1A and 2A the same? That seems odd - was that intentional? Curious overall why the data is presented in such a way that it's done twice...

      We mistakenly presented immunofluorescence images twice. Duplicate images have been removed. In the modified version of this manuscript, Figure 1A shows 3-HSD immunofluorescence staining in cultures of fresh testicular tissues and in their in vivo counterparts while Figure 1 – figure supplement 1A (not Figure 2A) shows 3-HSD immunofluorescence staining in cultures of frozen/thawed testicular tissues.

      11) In all the western blots, the cropping is done awfully close to the bands - why is this? Can full gels be shown in a Supplement? And especially in the westerns in Fig. 5C, esp for CYP17A1, the cropping is unacceptable. This reviewer is wondering whether this is an oversight, or whether there is another band below that one that is being masked? Again, should show whole blot for transparency and to ensure Rigor and Reproducibility.

      Full gels are shown in the Supplementary File 2. For CYP17A1, we have shown that only one band of the expected molecular weight is obtained with the antibody (Please see photo below). After this verification, the nitrocellulose membranes were cut at the 55 kDa molecular weight band in order to reveal CYP17A1 expression in the upper part of the membranes and the protein used for normalization in the lower part of the membranes.

      Author response image 1.

      12) For all figures, wondering why the font sizes are so disparate? This will need to be addressed before publication so it looks more professional.

      All figures have been reworked as requested.

      Reviewer #3 (Public Review):

      Moutard, Laura, et al. investigated the gene expression and functional aspects of Leydig cells in a cryopreservation/long-term culture system. The authors found that critical genetic markers for Leydig cells were diminished when compared to the in-vivo testis. The testis also showed less androgen production and androgen responsiveness. Although they did not produce normal testosterone concentrations in basal media conditions, the cultured testis still remained highly responsive to gonadotrophin exposure, exhibiting a large increase in androgen production. Even after the hCG-dependent increase in testosterone, genetic markers of Leydig cells remained low, which means there is still a missing factor in the culture media that facilitates proper Leydig cell differentiation. Optimizing this testis culture protocol to help maintain proper Leydig cell differentiation could be useful for future human testis biopsy cultures, which will help preserve fertility and child cancer patients.

      Overall, the authors addressed most comments and questions from the previous review. The additional data regarding the necrotic area is helpful for interpreting the quality of the cultures. The authors did not conduct a multiple comparison tests although there are multiple comparisons conducted on for a single dependent variable (Fig 2J, Fig 3F, among many others), however, the addition of this multiple comparison is unlikely to change the conclusions of the paper or the figure and, thus is a minor technical detail in this case.

      Thank you very much for this second review and your help to improve this manuscript.

    1. Author response:

      The following is the authors’ response to the previous reviews

      We have thoroughly addressed all the reviewers’ comments and meticulously revised the manuscript. Key modifications include the following:

      (a) Organizing the Logic and Highlighting Key Findings: We have revised the manuscript to emphasize key findings (especially the distinctions between the SEC and WOI groups) according to the following logic: constructing a receptive endometrial organoid, comparing its molecular characteristics with those of the receptive endometrium, highlighting its main features (hormone response, enhanced energy metabolism, ciliary assembly and motility, epithelial-mesenchymal transition), and exploring the function involved in embryo interaction.

      (b) Clarity and Better Description of Bioinformatic Analyses: We have revised the sections involving bioinformatic analyses to provide a more streamlined and comprehensible explanation. Instead of overwhelming the reader with excessive details, we focused on the most important findings, and performed additional experimental validation.

      (c) Rationale for Gene Selection: We have clarified the rationale for selecting certain genes and pathways for inclusion in the analysis and manuscript. The associated gene expression data for all figures have been provided in the attached Dataset.

      (d) In the response letter, we have provided the detailed presentation of the methodological optimization for constructing this endometrial assembloids, along with optimization and comparison of endometrial organoid culture media. Furthermore, in the Limitations section, we have explicitly stated that stromal cells and immune cells gradually diminish with increasing passage numbers. Therefore, this study primarily utilized endometrial assembloids within the first three passages for all investigations.

      Below, we provide a point-by-point response to each comment, with all modifications highlighted in the revised manuscript. We respectfully hope that these revisions effectively address the concerns raised by the reviewers.

      Public Reviews:

      Reviewer #1 (Public Review):

      This study generated 3D cell constructs from endometrial cell mixtures that were seeded in the Matrigel scaffold. The cell assemblies were treated with hormones to induce a "window of implantation" (WOI) state. The authors did their best to revise their study according to the reviewers' comments. However, the study remains unconvincing and at the same time too dense and not focused enough.

      (1) The use of the term organoids is still confusing and should be avoided. Organoids are epithelial tissue-resembling structures. Hence, the multiplecell aggregates developed here are rather "coculture models" (or "assembloids"). It is still unexpected (unlikely) that these structures containing epithelial, stromal and immune cells can be robustly passaged in the epithelial growth conditions used. All other research groups developing real organoids from endometrium have shown that only the epithelial compartment remains in culture at passaging (while the stromal compartment is lost). If authors keep to their idea, they should perform scRNA-seq on both early and late (passage 6-10) "organoids". And they should provide details of culturing/passaging/plating etc that are different with other groups and might explain why they keep stromal and immune cells in their culture for such a long time. In other words, they should then in detail compare their method to the standard method of all other researchers in the field, and show the differences in survival and growth of the stromal and immune cells.

      (1) We appreciate your feedback and have revised the term 'organoids' to 'assembloids'. 2)

      I. Due to budget constraints, this study did not perform scRNA-seq on both early and late passages (P6-P10). Instead, immunofluorescence staining confirmed the persistence of stromal cells at passage 6 (as shown below).

      Author response image 1.

      Whole-mount immunofluorescence showed that Vimentin+ F-actin+ cells (stromal cells) were arranged around the glandular spheres that were only F-actin+(passage 6).

      II. Improvements in this study include the following.

      a. Optimization of endometrial tissue processing: The procedures for tissue collection, pretreatment, digestion, and culture were refined to maximize the retention of endometrial epithelial cells, stromal cells, and immune cells (detailed optimizations are provided in Response Table 1).

      b. Enhanced culture medium formulation: Based on previous protocols, WNT3A was added to promote organoid development and differentiation (PMID: 27315476), while FGF2 was supplemented to improve stromal cell survival (PMID: 35224622) (see Response Table 2 for medium comparisons). Representative culture outcomes are shown in the figure below.

      We acknowledge that the stromal and immune cells in this system still exhibit differences compared to their in vivo counterparts. In this study, we utilized the first three passages, which offer optimal cell diversity and viability, to meet experimental needs. However, replicating and maintaining the full complexity of endometrial cell types in vitro remains a major challenge in the field—one that we are actively working to address.

      Author response table 1.

      Methodological Optimization of Endometrial Organoids (Construction, Passaging, and Cryopreservation)

      Author response table 2.

      Optimization and comparison of endometrial organoid culture media

      Author response image 2.

      Bright-field microscopy captures the expansion of glands and surrounding stromal cells across passages 0 to 2 (scar bar=200μm) (Yellow arrows: stromal cells; White arrows: glands).

      (2) The paper is still much too dense, touching upon all kind of conclusions from the manifold bioinformatic analyses. The latter should be much clearer and better described, and then some interesting findings (pathways/genes) should be highlighted without mentioning every single aspect that is observed. The paper needs a lot of editing to better focus and extract take-home messages, not bombing the reader with a mass of pathways, genes etc which makes the manuscript just not readable or 'digest-able'. There is no explanation whatever and no clear rationale why certain genes are included in a list while others are not. There is the impression that mass bioinformatics is applied without enough focus.

      Thanks for your suggestions. We have made improvements and revisions in the following areas:

      (a) Clarity and Better Description of Bioinformatic Analyses: We have revised the sections involving bioinformatic analyses to provide a more streamlined and comprehensible explanation. Instead of overwhelming the reader with excessive details, we focused on the most important findings.

      (b) Organizing the Logic and Highlighting Key Findings: We have revised the manuscript to emphasize key findings according to the following logic: constructing a receptive endometrial organoid, comparing its molecular characteristics with those of the receptive endometrium, highlighting its main features (hormone response, enhanced energy metabolism, ciliary assembly and motility, epithelial-mesenchymal transition), and exploring the function involved in embryo interaction.

      (c) Rationale for Gene Selection: We have clarified the rationale for selecting certain genes and pathways for inclusion in the analysis and manuscript.

      We hope these revisions address your concerns and improve the overall quality and clarity of the manuscript. Thank you once again for your valuable input.

      (3) The study is much too descriptive and does not show functional validation or exploration (except glycogen production). Some interesting findings extracted from the bioinformatics must be functionally tested.

      Thanks for your suggestions. We have restructured the logic and revised the manuscript, incorporating functional validation. The focus is on the following points: highlighting its main features (hormone response, enhanced energy metabolism, ciliary assembly and motility, epithelial-mesenchymal transition), and exploring the functions involved in embryo interaction.

      (4) In contrast to what was found in vivo (Wang et al. 2020), no abrupt change in gene expression pattern is mentioned here from the (early-) secretory to the WoI phase. Should be discussed. Although the bioinformatic analyses point into this direction, there are major concerns which must be solved before the study can provide the needed reliability and credibility for revision.

      To further investigate the abrupt change, the Mfuzz algorithm was utilized to analyze gene expression across the three groups, focusing on gene clusters that were progressively upregulated or downregulated. It was observed that mitochondrial and cilia-related genes exhibited the highest expression levels in WOI endometrial organoids, as well as cell junction and negative regulation of cell differentiation were downregulated (Figure 4A).

      (5) All data should be benchmarked to the Wang et al 2020 and Garcia-Alonso et al. 2021 papers reporting very detailed scRNA-seq data, and not only the Stephen R. Quake 2020 paper.

      We appreciate your suggestion. By integrating data from Garcia-Alonso et al. (2021) (shown in the figure below), we observed that both WOI organoids and SEC organoids exhibit increased glandular secretory epithelium and developed ciliated epithelium, mirroring features of mid-secretory endometrium. The findings exhibit parallels when contrasting these two papers.

      Author response image 3.

      UMAP visualization of integrated scRNA-seq data (our dataset and Garcia-Alonso et al. 2021) showing: (A) cell types, (B) WOI-org, (C)CTRL-org, (D)SEC-org versus published midsecretory samples.

      (6) Fig. 2B: Vimentin staining is not at all clear. F-actin could be used to show the typical morphology of the stromal cells?

      We appreciate your suggestion. We performed additional staining for F-actin based on Vimentin, and found that Vimentin+ F-actin+ cells (stromal cells) were arranged around the glandular spheres that were only F-actin+.

      (7) Where does the term "EMT-derived stromal cells" come from? On what basis has this term been coined?

      Within endometrial biology, stromal cells in the transition from epithelial to mesenchymal phenotype are specifically referred to as 'stromal EMT transition cells' (PMID: 39775038, PMID: 39968688).

      In certain cancers or fibrotic diseases, epithelial cells can transition into a mesenchymal phenotype, contributing to the stromal environment that supports tumor growth or tissue remodeling (PMID: 20572012).

      (8) CD44 is shown in Fig. 2D but the text mentions CD45 (line 159)?

      In Fig 2D, T cells are defined as a cluster of CD45+CD3+ cells, further subdivided into CD4+ and CD8+ T cells based on their expression of CD4 and CD8. This figure does not include data on CD44.

      (9) All quantification experiments (of stainings etc) should be in detail described how this was done. It looks very difficult (almost not feasible) when looking at the provided pictures to count the stained cells.

      a. Manual Measurement:

      For TEM-observed pinopodes, glycogen particles, microvilli, and cilia, manual region-of-interest (ROI) selection was performed using ImageJ software for quantitative analysis of counts, area, and length. Twenty randomly selected images per experimental group were analyzed for each morphological parameter.

      b. Automated Measurement:

      We quantified the fluorescence images using ImageJ. Firstly, preprocess them by adjusting brightness and contrast, and removing background noise with the “Subtract Background” feature.

      Secondly, set the threshold to highlight the cells, then select the regions of interest (ROI) using selection tools. Thirdly, as for counting the cells, navigate to Analyze > Analyze Particles. AS for measuring the influence intensity and area, set the “Measurement” options as mean gray value. Adjust parameters as needed, and view results in the “Results” window. Save the data for further analysis and ensure consistency throughout your measurements for reliable results.

      For 3D fluorescence quantification, ZEN software (Carl Zeiss) was exclusively used, with 11 images analyzed per experimental group. This part has been incorporated into “Supporting Information”

      Line 94-100.

      c. Normalization Method:

      For fluorescence quantification, DAPI was used as an internal reference for normalization, where both DAPI and target fluorescence channel intensities were quantified simultaneously. The normalized target signal intensity (target/DAPI ratio) was then compared across experimental groups. A minimum of 15 images were analyzed for each parameter per group. This part has been incorporated into “Supporting Information” Line 101-104.

      (10) Fig. 3C: it is unclear how quantification can be reliably done. Moreover, OLFM4 looks positive in all cells of Ctrl, but authors still see an increase?

      (a) Fluorescence images were quantitatively analyzed using ImageJ by measuring the mean gray values. For normalization, DAPI staining served as an internal reference, with simultaneous measurement of mean gray values in both the target fluorescence channel and the DAPI channel. The relative fluorescence intensity was then calculated as the ratio of target channel to DAPI signal for inter-group quantitative comparisons.

      (b) OLFM4 is an E2-responsive gene. Its expression in endometrial organoids of the CTRL group is physiologically normal (PMID: 31666317). However, its fluorescence intensity (quantified as mean gray value) was significantly stronger in both the SEC and WOI groups compared to the CTRL group (quantitative method as described above).

      (11) Fig. 3F: Met is downregulated which is not in accordance with the mentioned activation of the PI3K-AKT pathway.

      We appreciate your careful review. Our initial description was imprecise. In the revised manuscript, this statement has been removed entirely.

      (12) Lines 222-226: transcriptome and proteome differences are not significant; so, how meaningful are the results then? Then, it is very hard to conclude an evolution from secretory phase to WoI.

      We appreciate your feedback. The manuscript has been comprehensively revised, and the aforementioned content has been removed.

      (13) WoI organoids show an increased number of cilia. However, some literature shows the opposite, i.e. less ciliated cells in the endometrial lining at WoI (to keep the embryo in place). How to reconcile?

      Thank you for raising this question. We conducted a statistical analysis of the proportion of ciliated cells across endometrial phases.

      (a) Based on the 2020 study by Stephen R. Quake and Carlos Simon’s team published in Nature Medicine (PMID: 32929266), the mid-secretory phase (Days 19–23) exhibited a higher proportion of ciliated cells compared to the early-secretory (Days 15–18) and late-secretory phases (Days 24– 28) (Fig. R13 A).

      (b) According to the 2021 study by Roser Vento-Tormo’s team in Nature Genetics, ciliated cell abundance peaked in the early-to-mid-secretory endometrium across all phases (Fig. R13 B-C).

      Data were sourced from the Reproductive Cell Atlas.

      (14) How are pinopodes distinguished from microvilli? Moreover, Fig. 3 does not show the typical EM structure of cilia.

      Thank you for this insightful question.

      (a) Pinopodes are large, bulbous protrusions with a smooth apical membrane. Under transmission electron microscopy (TEM), it can be observed that the pinopodes contain various small particles, which are typically extracellular fluid and dissolved substances.

      Microvilli are elongated, finger-like projections that typically exhibit a uniform and orderly arrangement, forming a "brush border" structure. Under transmission electron microscopy, dense components of the cytoskeleton, such as microfilaments and microtubules, can be seen at the base of the microvilli.

      (b) You may refer to the ciliated TEM structure shown in the current manuscript's Fig. 2E (originally labeled as Fig. 2H in the draft). The cilium is composed of microtubules. The cross-section shows that the periphery of the cilium is surrounded by nine pairs of microtubules arranged in a ring. The longitudinal section shows that the cilium has a long cylindrical structure, with the two central microtubules being quite prominent and located at the center of the cilium.

      (15) There is a recently published paper demonstrating another model for implantation. This paper should be referenced as well (Shibata et al. Science Advances, 2024).

      Thanks for your valuable comments. We have cited this reference in the manuscript at Line 77-78.

      (16) Line 78: two groups were the first here (Turco and Borreto) and should both be mentioned.

      Thanks for your valuable comments. We have cited this reference in the manuscript at Line 74-76.

      (17) Line 554: "as an alternative platform" - alternative to what? Authors answer reviewers' comments by just changing one word, but this makes the text odd.

      Thank you for your review. Here, we propose that this WOI organoid serves as an alternative research platform for studying endometrial receptivity and maternal-fetal interactions, compared to current secretory-phase organoids. In the revised manuscript, we have supplemented the data by co-culturing this WOI organoid with blastoid, demonstrating its robust embryo implantation potential.

      Reviewer #2 (Public Review):

      In this research, Zhang et al. have pioneered the creation of an advanced organoid culture designed to emulate the intricate characteristics of endometrial tissue during the crucial Window of Implantation (WOI) phase. Their method involves the incorporation of three distinct hormones into the organoid culture, coupled with additives that replicate the dynamics of the menstrual cycle. Through a series of assays, they underscore the striking parallels between the endometrial tissue present during the WOI and their crafted organoids. Through a comparative analysis involving historical endometrial tissue data and control organoids, they establish a system that exhibits a capacity to simulate the intricate nuances of the WOI. The authors made a commendable effort to address the majority of the statements. Developing an endometrial organoid culture methodology that mimics the window of implantation is a game-changer for studying the implantation process. However, the authors should strive to enhance the results to demonstrate how different WOI organoids are from SEC organoids, ensuring whether they are worth using in implantation studies, or a proper demonstration using implantation experiments.

      Thank you for your valuable suggestions. The WOI organoids differ from SEC organoids in the following aspects.

      (1) Structurally, WOI endometrial organoids exhibit subcellular features characteristic of the implantation window: densely packed pinopodes on the luminal side of epithelial cells, abundant glycogen granules, elongated and tightly arranged microvilli, and increased cilia (Figure 2F).

      (2) At the molecular level, WOI organoids show enlarged and functionally active mitochondria, enhanced ciliary assembly and motility, and single-cell signatures resembling mid-secretory endometrium.

      Specifically, mitochondrial- and cilia-related genes/proteins are most highly expressed in WOI organoids (Figure 4A,B). TEM analysis revealed that WOI organoids have the largest average mitochondrial area (Figure 4C). Mitochondrial-related genes display an increasing trend across the three organoid groups, and WOI organoids produce more ATP and IL-8 (Figure 4D,E).

      For cilia, WOI organoids upregulate genes/proteins involved in ciliary assembly, basal bodies, and motile cilia, while downregulating non-motile cilia markers (Figure 5A-C).

      Single-cell analysis further confirms that WOI organoids recapitulate mid-secretory endometrial traits in mitochondrial metabolism and cell adhesion (Figure 2G).

      (3) Functionally, WOI organoids demonstrate superior embryo implantation potential. Given the scarcity and ethical constraints of human embryos, we used blastoids for implantation assays (Figure 6A). These blastoids successfully grew within endometrial organoids, established interactions (Figure 6B), and exhibited normal trilineage differentiation (epiblast: OCT4; hypoblast: GATA6; trophoblast: KRT18) (Figure 6C). WOI organoids achieved significantly higher blastoid survival (66% vs. 19% in CTRL and 28% in SEC) and interaction rates (90% vs. 47% in CTRL and 53% in SEC), confirming their robust embryo-receptive capacity (Figure 6D,E).

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      In conclusion, it is needed to first meet all the concerns of the reviewers and then submit an appropriately adapted and comprehensive paper (also showing the robustness of the "organoids" and functionality of the findings) instead of this still fully descriptive paper. Further comments are included in the rebuttal document of the authors and will be provided by the editor as PDF.

      Reviewer #2 (Recommendations For The Authors):

      The authors made a good effort to reply all the statements. However, there are some points that the authors need to address.

      • There is an inconsistency in the manuscript regarding the number of passages in which the organoids are used; in the response to the reviewers, it mentions 5 passages, while in the Materials and Methods section, it states 3 passages.

      We sincerely appreciate your thorough review. In this study, organoids within the first three passages were used. To address the reviewer's question comprehensively, we have now provided a detailed account of the organoid passage history in our response.

      • We agree that the difference between SEC and WOI organoids may be subtle, but in response to this, the authors should explain what they mean by "the most notable differences lie in the more comprehensive differentiation and varied cellular functions exhibited by WOI organoids..."

      In the original manuscript, this statement indicated that, at the single-cell level, WOI endometrial organoids exhibited more functionally mature and thoroughly differentiated characteristics compared to SEC endometrial organoids (See details below).

      In the revised version, we have restructured this section to focus on following aspects: hormone response, energy metabolism, ciliary assembly and motility, epithelial-mesenchymal transition, and embryo implantation potential. Consequently, the "the most notable differences lie in the more comprehensive differentiation and varied cellular functions exhibited by WOI organoids..."has been removed.

      (1) Varied cellular functions:

      a. Secretory Epithelium: Compared to SEC organoids, WOI organoids exhibit enhanced peptide metabolism and mitochondrial energy metabolism in their secretory epithelium, supporting endometrial decidualization and embryo implantation (Figure 3F).

      b. Proliferative Epithelium: Compared to SEC organoids, WOI organoids demonstrate enhanced GTPase activity, angiogenesis, cytoskeletal assembly, cell differentiation, and RAS protein signaling in their proliferative epithelium (Figure S2G).

      c. Ciliated Epithelium: The ciliated epithelium of WOI endometrial organoids is associated with the regulation of vascular development and exhibits higher transcriptional activity compared to SEC organoids (Figure 5E).

      d. Stromal Cells: Compared to SEC organoids, WOI organoids exhibit enhanced cell junctions, cell migration, and cytoskeletal regulation in EMT-derived stromal cells (Figure S4A right panel). Similarly, cell junctions are also strengthened in stromal cells (Figure S4A left panel).

      (2) comprehensive differentiation:

      a. Compared to SEC organoids, WOI organoids exhibit more complete differentiation from proliferative epithelium to secretory epithelium (Figure 3G).

      b. The WOI organoids demonstrate more robust ciliary differentiation compared to SEC organoids (Figure 5F).

      c. The proliferative epithelium progressively differentiates into EMT-derived cells. Compared to SEC organoids, WOI organoids are predominantly localized at the terminal end of the differentiation trajectory, indicating more complete differentiation (Figure S4B).

      • What do the authors mean by "average intensity" when referring to the extra reagents added to the WOI? The results that the authors show in response to Reviewer 2's Q1 must be included as part of the results and explain how it was done in the materials and methods section.

      This parameter indicates the growth status of organoids. It measures the gray value of organoids through long-term live-cell tracking. When organoids undergo apoptosis, they progressively condense into denser solid spheres, leading to an increase in gray value (average intensity). This content has been incorporated into the Results section (Line 129) and is further explained in the Supporting Information "Materials and Methods" (Lines 70-77).

      • In panel 1C, it is not possible to see the stromal cells around because they are brightfield images.

      You are partly right. Bright-field images alone indeed make it difficult to distinguish stromal cells. However, by combining whole-mount immunofluorescence staining with the characteristic elongated spindle-shaped morphology of stromal cells, we were able to roughly determine their distribution in the bright-field images.

      • Responding to Reviewer 2's question Q7, the authors indicate how they establish the cluster. However, they do not specify whether they extrapolate the data from a database or create the cluster themselves based on the literature. It should be stated from which classification list (or classification database) the extrapolation has been made.

      Within endometrial biology, stromal cells in the transition from epithelial to mesenchymal phenotype are specifically referred to as 'stromal EMT transition cells' (PMID: 39775038, PMID: 39968688).

      In certain cancers or fibrotic diseases, epithelial cells can transition into a mesenchymal phenotype, contributing to the stromal environment that supports tumor growth or tissue remodeling (PMID: 20572012).

      • Regarding Reviewer 2's question Q8, if the authors have not been able to make comparisons with, at least, SEC organoids, unfortunately, the ERT loses much of its strength and should not serve as support.

      We agree with you at this point. These results have been moved to the supplementary figures.

      • If the differences in the transcriptome and proteome between SEC and WOI organoids are not significant, the result does not support the authors' model. If there are barely any differences at the proteome and transcriptome level between SEC and WOI organoids, why would anyone choose to use their model over SEC organoids?

      We sincerely appreciate your valuable feedback. In this revised manuscript, we have further integrated transcriptomic and proteomic analyses, revealing that WOI organoids exhibit enlarged and functionally active mitochondria, along with enhanced cilia assembly and motility compared to SEC organoids. Additionally, using a blastoid model, we demonstrated that WOI organoids possess superior embryo implantation potential, significantly outperforming SEC organoids. Our research group aims to develop an embryo co-culture model. Through systematic comparisons of structural, molecular, and co-culture characteristics between SEC and WOI organoids, we ultimately confirmed the superior performance of WOI organoids.

      • SEC and WOI organoids must be different enough to establish a new model, and the authors do not demonstrate that they are.

      Thank you for your valuable feedback. In the revised manuscript, we have emphasized the distinctions between SEC and WOI organoids in terms of structure, molecular characteristics, and functionality (co-culture with blastoid), as detailed below.

      (1) Structurally, WOI endometrial organoids exhibit subcellular features characteristic of the implantation window: densely packed pinopodes on the luminal side of epithelial cells, abundant glycogen granules, elongated and tightly arranged microvilli, and increased cilia (Figure 2F).

      (2) At the molecular level, WOI organoids show enlarged and functionally active mitochondria, enhanced ciliary assembly and motility, and single-cell signatures resembling mid-secretory endometrium.

      Specifically, mitochondrial- and cilia-related genes/proteins are most highly expressed in WOI organoids (Figure 4A,B). TEM analysis revealed that WOI organoids have the largest average mitochondrial area (Figure 4C). Mitochondrial-related genes display an increasing trend across the three organoid groups, and WOI organoids produce more ATP and IL-8 (Figure 4D,E).

      For cilia, WOI organoids upregulate genes/proteins involved in ciliary assembly, basal bodies, and motile cilia, while downregulating non-motile cilia markers (Figure 5A-C).

      Single-cell analysis further confirms that WOI organoids recapitulate mid-secretory endometrial traits in mitochondrial metabolism and cell adhesion (Figure 2G).

      (3) Functionally, WOI organoids demonstrate superior embryo implantation potential. Given the scarcity and ethical constraints of human embryos, we used blastoids for implantation assays (Figure 6A). These blastoids successfully grew within endometrial organoids, established interactions (Figure 6B), and exhibited normal trilineage differentiation (epiblast: OCT4; hypoblast: GATA6; trophoblast: KRT18) (Figure 6C). WOI organoids achieved significantly higher blastoid survival (66% vs. 19% in CTRL and 28% in SEC) and interaction rates (90% vs. 47% in CTRL and 53% in SEC), confirming their robust embryo-receptive capacity (Figure 6D,E).

      • Regarding Q16, Boretto et al. 2017 and Turco et al. 2017 also manage to isolate stromal cells, but they lose them between passages. It's not a matter of isolating them from the tissue or not, but rather how they justify their maintenance in culture. In the images added by the authors, it can be seen that the majority of stromal cells are lost from P0 to P1 after thawing. I still believe that the epithelial part can be passed and maintained, but the rest cannot, and that should be mentioned in the paper as a limitation. However, the authors can demonstrate the maintenance of stromal cells by performing immunostaining with vimentin from passages 4, 5, and 6.

      Thank you for your valuable comments. We have added the statement 'Stromal cells and immune cells are difficult to pass down stably and their proportion is lower than that in the in vivo endometrium' to the Limitations section (Line 364-365). Additionally, we performed immunostaining with vimentin starting from passage 6 and confirmed the presence of Vimentin+ F-actin+ stromal cells (as shown in Author response image 1).

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This study presents a new and valuable theoretical account of spatial representational drift in the hippocampus. The evidence supporting the claims is convincing, with a clear and accessible explanation of the phenomenon. Overall, this study will likely attract researchers exploring learning and representation in both biological and artificial neural networks.

      We would like to ask the reviewers to consider elevating the assessment due to the following arguments. As noted in the original review, the study bridges two different fields (machine learning and neuroscience), and does not only touch a single subfield (representational drift in neuroscience). In the revision, we also analysed data from four different labs, strengthening the evidence and the generality of the conclusions.

      Public Reviews:

      Reviewer #1 (Public Review):

      The authors start from the premise that neural circuits exhibit "representational drift" -- i.e., slow and spontaneous changes in neural tuning despite constant network performance. While the extent to which biological systems exhibit drift is an active area of study and debate (as the authors acknowledge), there is enough interest in this topic to justify the development of theoretical models of drift.

      The contribution of this paper is to claim that drift can reflect a mixture of "directed random motion" as well as "steady state null drift." Thus far, most work within the computational neuroscience literature has focused on the latter. That is, drift is often viewed to be a harmless byproduct of continual learning under noise. In this view, drift does not affect the performance of the circuit nor does it change the nature of the network's solution or representation of the environment. The authors aim to challenge the latter viewpoint by showing that the statistics of neural representations can change (e.g. increase in sparsity) during early stages of drift. Further, they interpret this directed form of drift as "implicit regularization" on the network.

      The evidence presented in favor of these claims is concise. Nevertheless, on balance, I find their evidence persuasive on a theoretical level -- i.e., I am convinced that implicit regularization of noisy learning rules is a feature of most artificial network models. This paper does not seem to make strong claims about real biological systems. The authors do cite circumstantial experimental evidence in line with the expectations of their model (Khatib et al. 2022), but those experimental data are not carefully and quantitatively related to the authors' model.

      We thank the reviewer for pushing us to present stronger experimental evidence. We now analysed data from four different labs. Two of those are novel analyses of existing data (Karlsson et al, Jercog et al). All datasets show the same trend - increasing sparsity and increasing information per cell. We think that the results, presented in the new figure 3, allow us to make a stronger claim on real biological systems.

      To establish the possibility of implicit regularization in artificial networks, the authors cite convincing work from the machine-learning community (Blanc et al. 2020, Li et al., 2021). Here the authors make an important contribution by translating these findings into more biologically plausible models and showing that their core assumptions remain plausible. The authors also develop helpful intuition in Figure 4 by showing a minimal model that captures the essence of their result.

      We are glad that these translation efforts are appreciated.

      In Figure 2, the authors show a convincing example of the gradual sparsification of tuning curves during the early stages of drift in a model of 1D navigation. However, the evidence presented in Figure 3 could be improved. In particular, 3A shows a histogram displaying the fraction of active units over 1117 simulations. Although there is a spike near zero, a sizeable portion of simulations have greater than 60% active units at the end of the training, and critically the authors do not characterize the time course of the active fraction for every network, so it is difficult to evaluate their claim that "all [networks] demonstrated... [a] phase of directed random motion with the low-loss space." It would be useful to revise the manuscript to unpack these results more carefully. For example, a histogram of log(tau) computed in panel B on a subset of simulations may be more informative than the current histogram in panel A.

      The previous figure 3A was indeed confusing. In particular, it lumped together many simulations without proper curation. We redid this figure (now Figure 4), and added supplementary figures (Figures S1, S2) to better explain our results. It is now clear that the simulations with a large number of active units were either due to non-convergence, slow timescale of sparsification or simulations featuring label noise in which the fraction of active units is less affected. Regarding the log(tau) calculation, while it could indeed be an informative plot, it could not be calculated in a simple manner for all simulations. This is because learning curves are not always exponential, but sometimes feature initial plateaus (see also Saxe et al 2013, Schuessler et al 2020). We added a more detailed explanation of this limitation in the methods section, and we believe the current figure exemplifies the effect in a satisfactory manner.

      Reviewer #2 (Public Review):

      Summary:

      In the manuscript "Representational drift as a result of implicit regularization" the authors study the phenomenon of representational drift (RD) in the context of an artificial network that is trained in a predictive coding framework. When trained on a task for spatial navigation on a linear track, they found that a stochastic gradient descent algorithm led to a fast initial convergence to spatially tuned units, but then to a second very slow, yet directed drift which sparsified the representation while increasing the spatial information. They finally show that this separation of timescales is a robust phenomenon and occurs for a number of distinct learning rules.

      Strengths:

      This is a very clearly written and insightful paper, and I think people in the community will benefit from understanding how RD can emerge in such artificial networks. The mechanism underlying RD in these models is clearly laid out and the explanation given is convincing.

      We thank the reviewer for the support.

      Weaknesses:

      It is unclear how this mechanism may account for the learning of multiple environments.

      There are two facets to the topic of multiple environments. First, are the results of the current paper relevant when there are multiple environments? Second, what is the interaction between brain mechanisms of dealing with multiple environments and the results of the current paper?

      We believe the answer to the first question is positive. The near-orthogonality of representations between environments implies that changes in one can happen without changes in the other. This is evident, for instance, in Khatib et al and Geva et al - in both cases, drift seems to happen independently in two environments, even though they are visited intermittently and are visually similar.

      The second question is a fascinating one, and we are planning to pursue it in future work. While the exact way in which the brain achieves this near-independence is an open question, remapping is one possible window into this process.

      We extended the discussion to make these points clear.

      The process of RD through this mechanism also appears highly non-stationary, in contrast to what is seen in familiar environments in the hippocampus, for example.

      The non-stationarity noted by the reviewer is indeed a major feature of our observations, and is indeed linked to familiarity. We divide learning into three phases (now more clearly stated in Table 1 and Figure 4C). The first, rapid phase, consists of improvement of performance - corresponding to initial familiarity with the environment. The third phase, often reported in the literature of representational drift, is indeed stationary and obtained after prolonged familiarity. Our work focuses on the second phase, which is not as immediate as the first one, and can take several days. We note in the discussion that experiments which include a long familiarization process can miss this phase (see also Table 3). Furthermore, we speculate that real life is less stationary than a lab environment, and this second phase might actually be more relevant there.

      Reviewer #3 (Public Review):

      Summary:

      Single-unit neural activity tuned to environmental or behavioral variables gradually changes over time. This phenomenon, called representational drift, occurs even when all external variables remain constant, and challenges the idea that stable neural activity supports the performance of well-learned behaviors. While a number of studies have described representational drift across multiple brain regions, our understanding of the underlying mechanism driving drift is limited. Ratzon et al. propose that implicit regularization - which occurs when machine learning networks continue to reconfigure after reaching an optimal solution - could provide insights into why and how drift occurs in neurons. To test this theory, Ratzon et al. trained a Feedforward Network trained to perform the oft-utilized linear track behavioral paradigm and compare the changes in hidden layer units to those observed in hippocampal place cells recorded in awake, behaving animals.

      Ratzon et al. clearly demonstrate that hidden layer units in their model undergo consistent changes even after the task is well-learned, mirroring representational drift observed in real hippocampal neurons. They show that the drift occurs across three separate measures: the active proportion of units (referred to as sparsification), spatial information of units, and correlation of spatial activity. They continue to address the conditions and parameters under which drift occurs in their model to assess the generalizability of their findings.

      However, the generalizability results are presented primarily in written form: additional figures are warranted to aid in reproducibility.

      We added figures, and a Github with all the code to allow full reproducibility.

      Last, they investigate the mechanism through which sparsification occurs, showing that the flatness of the manifold near the solution can influence how the network reconfigures. The authors suggest that their findings indicate a three-stage learning process: 1) fast initial learning followed by 2) directed motion along a manifold which transitions to 3) undirected motion along a manifold.

      Overall, the authors' results support the main conclusion that implicit regularization in machine learning networks mirrors representational drift observed in hippocampal place cells.

      We thank the reviewer for this summary.

      However, additional figures/analyses are needed to clearly demonstrate how different parameters used in their model qualitatively and quantitatively influence drift.

      We now provide additional figures regarding parameters (Figures S1, S2).

      Finally, the authors need to clearly identify how their data supports the three-stage learning model they suggest.

      Their findings promise to open new fields of inquiry into the connection between machine learning and representational drift and generate testable predictions for neural data.

      Strengths:

      (1) Ratzon et al. make an insightful connection between well-known phenomena in two separate fields: implicit regularization in machine learning and representational drift in the brain. They demonstrate that changes in a recurrent neural network mirror those observed in the brain, which opens a number of interesting questions for future investigation.

      (2) The authors do an admirable job of writing to a large audience and make efforts to provide examples to make machine learning ideas accessible to a neuroscience audience and vice versa. This is no small feat and aids in broadening the impact of their work.

      (3) This paper promises to generate testable hypotheses to examine in real neural data, e.g., that drift rate should plateau over long timescales (now testable with the ability to track single-unit neural activity across long time scales with calcium imaging and flexible silicon probes). Additionally, it provides another set of tools for the neuroscience community at large to use when analyzing the increasingly high-dimensional data sets collected today.

      We thank the reviewer for these comments. Regarding the hypotheses, these are partially confirmed in the new analyses we provide of data from multiple labs (new Figure 3 and Table 3) - indicating that prolonged exposure to the environment leads to more stationarity.

      Weaknesses:

      (1) Neural representational drift and directed/undirected random walks along a manifold in ML are well described. However, outside of the first section of the main text, the analysis focuses primarily on the connection between manifold exploration and sparsification without addressing the other two drift metrics: spatial information and place field correlations. It is therefore unclear if the results from Figures 3 and 4 are specific to sparseness or extend to the other two metrics. For example, are these other metrics of drift also insensitive to most of the Feedforward Network parameters as shown in Figure 3 and the related text? These concerns could be addressed with panels analogous to Figures 3a-c and 4b for the other metrics and will increase the reproducibility of this work.

      We note that the results from figures 3 and 4 (original manuscript) are based on abstract tasks, while in figure 2 there is a contextual notion of spatial position. Spatial position metrics are not applicable to the abstract tasks as they are simple random mapping of inputs, and there isn’t necessarily an underlying latent variable such as position. This transition between task types is better explained in the text now. In essence the spatial information and place field correlation changes are simply signatures of the movements in parameter space. In the abstract tasks their change becomes trivial, as the spatial information becomes strongly correlated with sparsity and place fields are simply the activity vectors of units. These are guaranteed to change as long as there are changes in the activity statistics. We present here the calculation of these metrics averaged over simulations for completeness.

      Author response image 1.

      PV correlation between training time points averaged over 362 simulations. (B) Mean SI of units normalized to first time step, averaged over 362 simulations. Red line shows the average time point of loss convergence, the shaded area represents one standard deviation.

      (2) Many caveats/exceptions to the generality of findings are mentioned only in the main text without any supporting figures, e.g., "For label noise, the dynamics were qualitatively different, the fraction of active units did not reduce, but the activity of the units did sparsify" (lines 116-117). Supporting figures are warranted to illustrate which findings are "qualitatively different" from the main model, which are not different from the main model, and which of the many parameters mentioned are important for reproducing the findings.

      We now added figures (S1, S2) that show this exactly. We also added a github to allow full reproduction.

      (3) Key details of the model used by the authors are not listed in the methods. While they are mentioned in reference 30 (Recanatesi et al., 2021), they need to be explicitly defined in the methods section to ensure future reproducibility.

      The details of the simulation are detailed in the methods sections. We also added a github to allow full reproducibility.

      (4) How different states of drift correspond to the three learning stages outlined by the authors is unclear. Specifically, it is not clear where the second stage ends, and the third stage begins, either in real neural data or in the figures. This is compounded by the fact that the third stage - of undirected, random manifold exploration - is only discussed in relation to the introductory Figure 1 and is never connected to the neural network data or actual brain data presented by the authors. Are both stages meant to represent drift? Or is only the second stage meant to mirror drift, while undirected random motion along a manifold is a prediction that could be tested in real neural data? Identifying where each stage occurs in Figures 2C and E, for example, would clearly illustrate which attributes of drift in hidden layer neurons and real hippocampal neurons correspond to each stage.

      Thanks for this comment, which urged us to better explain these concepts.

      The different processes (reduction in loss, reduction in Hessian) happen in parallel with different timescales. Thus, there are no sharp transitions between the phases. This is now explained in the text in relation to figure 4C, where the approximate boundaries are depicted.

      The term drift is often used to denote a change in representation without a change in behavior. In this sense, both the second and third phases correspond to drift. Only the third stage is stationary. This is now emphasized in the text and in the new Table 1. Regarding experimental data, apart from the new figure 3 with four datasets, we also summarize in Table 3 the relation between duration of familiarity and stationarity of the data.

      Recommendations for the authors:

      The reviewers have raised several concerns. They concur that the authors should address the specific points below to enhance the manuscript.

      (1) The three different phases of learning should be clearly delineated, along with how they are determined. It remains unclear in which exact phase the drift is observed.

      This is now clearly explained in the new Table 1 and Figure 4C. Note that the different processes (reduction in loss, reduction in Hessian) happen in parallel with different timescales. Thus, there are no sharp transitions between the phases. This is now explained in the text in relation to figure 4C, where the approximate boundaries are depicted.

      The term drift is often used to denote a change in representation without a change in behavior. In this sense, both the second and third phases correspond to drift. Only the third stage is stationary. This is now emphasized in the text and in the new Table 1. Regarding experimental data, apart from the new figure 3 with four datasets, we also summarize in Table 3 the relation between duration of familiarity and stationarity of the data.

      (2) The term "sparsification" of unit activity is not fully clear. Its meaning should be more explicitly explained, especially since, in the simulations, a significant number of units appear to remain active (Fig. 3A).

      We now define precisely the two measures we use - Active Fraction, and Fraction Active Units. There is a new section with an accompanying figure in the Methods section. As Figure S2 shows, the noise statistics (label noise vs. update noise) differentially affects these two measures.

      (3) While the study primarily focuses on one aspect of representational drift-the proportion of active units-it should also explore other features traditionally associated with representational drift, such as spatial information and the correlation between place fields.

      This absence of features is related to the abstract nature of some of the tasks simulated in our paper. In our original submission the transition between a predictive coding task to more abstract tasks was not clearly explained, creating some confusion regarding the measured metrics. We now clarified the motivation for this transition.

      Both the initial simulation and the new experimental data analysis include spatial information (Figures 2,3). The following simulations (Figure 4) with many parameter choices use more abstract tasks, for which the notion of correlation between place cells and spatial information loses its meaning as there is no spatial ordering of the inputs, and every input is encountered only once. Spatial information becomes strongly correlated with the inverse of the active fraction metric. The correlation between place cells is also directly linked to increase in sparseness for these tasks.

      (4) There should be a clearer illustration of how labeling noise influences learning dynamics and sparsification.

      This was indeed confusing in the original submission. We removed the simulations with label noise from Figure 4, and added a supplementary figure (S2) illustrating the different effects of label noise.

      (5) The representational drift observed in this study's simulations appears to be nonstationary, which differs from in vivo reports. The reasons for this discrepancy should be clarified.

      We added experimental results from three additional labs demonstrating a change in activity statistics (i.e. increase in spatial information and increase in sparseness) over a long period of time. We suggest that such a change long after the environment is already familiar is an indication for the second phase, and stress that this change seems to saturate at some point, and that most drift papers start collecting data after this saturation, hence this effect was missed in previous in vivo reports. Furthermore, these effects are become more abundant with the advent on new calcium imaging methods, as the older electrophysiological regording methods did not usually allow recording of large amounts of cells for long periods of time. The new Table 3 surveys several experimental papers, emphasizing the degree of familiarity with the environment.

      (6) A distinctive feature of the hippocampus is its ability to learn different spatial representations for various environments. The study does not test representational drift in this context, a topic of significant interest to the community. Whether the authors choose to delve into this is up to them, but it should at least be discussed more comprehensively, as it's only briefly touched upon in the current manuscript version.

      There are two facets to the topic of multiple environments. First, are the results of the current paper relevant when there are multiple environments? Second, what is the interaction between brain mechanisms of dealing with multiple environments and the results of the current paper?

      We believe the answer to the first question is positive. The near-orthogonality of representations between environments implies that changes in one can happen without changes in the other. This is evident, for instance, in Khatib et al and Geva et al - in both cases, drift seems to happen independently in two environments, even though they are visited intermittently and are visually similar.

      The second question is a fascinating one, and we are planning to pursue it in future work. While the exact way in which the brain achieves this near-independence is an open question, remapping is one possible window into this process.

      We extended the discussion to make these points clear.

      (7) The methods section should offer more details about the neural nets employed in the study. The manuscript should be explicit about the terms "hidden layer", "units", and "neurons", ensuring they are defined clearly and not used interchangeably..

      We changed the usage of these terms to be more coherent and made our code publicly available. Specifically, “units” refer to artificial networks and “neurons” to biological ones.

      In addition, each reviewer has raised both major and minor concerns. These are listed below and should be addressed where possible.

      Reviewer #1 (Recommendations For The Authors):

      I recommend that the authors edit the text to soften their claims. For example:

      In the abstract "To uncover the underlying mechanism, we..." could be changed to "To investigate, we..."

      Agree. Done

      On line 21, "Specifically, recent studies showed that..." could be changed to "Specifically, recent studies suggest that..."

      Agree. Done

      On line 100, "All cases" should probably be softened to "Most cases" or more details should be added to Figure 3 to support the claim that every simulation truly had a phase of directed random motion.

      The text was changed in accordance with the reviewer’s suggestion. In addition, the figure was changed and only includes simulations in which we expected unit sparsity to arise (without label noise). We also added explanations and supplementary figures for label noise.

      Unless I missed something obvious, there is no new experimental data analysis reported in the paper. Thus, line 159 of the discussion, "a phenomenon we also observed in experimental data" should be changed to "a phenomenon that recently reported in experimental data."

      We thank the reviewer for drawing our attention to this. We now analyzed data from three other labs, two of which are novel analyses on existing data. All four datasets show the same trends of sparseness with increasing spatial information. The new Figure 3 and text now describe this.

      On line 179 of the Discussion, "a family of network configurations that have identical performance..." could be softened to "nearly identical performance." It would be possible for networks to have minuscule differences in performance that are not detected due to stochastic batch effects or limits on machine precision.

      The text was changed in accordance with the reviewer’s suggestion.

      Other minor comments:

      Citation 44 is missing the conference venue, please check all citations are formatted properly.

      Corrected.

      In the discussion on line 184, the connection to remapping was confusing to me, particularly because the cited reference (Sanders et al. 2020) is more of a conceptual model than an artificial network model that could be adapted to the setting of noisy learning considered in this paper. How would an RNN model of remapping (e.g. Low et al. 2023; Remapping in a recurrent neural network model of navigation and context inference) be expected to behave during the sparsifying portion of drift?

      We now clarified this section. The conceptual model of Sanders et al includes a specific prediction (Figure 7 there) which is very similar to ours - a systematic change in robustness depending on duration of training. Regarding the Low et al model, using such mechanistic models is an exciting avenue for future research.

      Reviewer #2 (Recommendations For The Authors):

      I only have two major questions.

      (1) Learning multiple representations: Memory systems in the brain typically must store many distinct memories. Certainly, the hippocampus, where RD is prominent, is involved in the ongoing storage of episodic memories. But even in the idealized case of just two spatial memories, for example, two distinct linear tracks, how would this learning process look? Would there be any interference between the two learning processes or would they be largely independent? Is the separation of time scales robust to the number of representations stored? I understand that to answer this question fully probably requires a research effort that goes well beyond the current study, but perhaps an example could be shown with two environments. At the very least the authors could express their thoughts on the matter.

      There are two facets to the topic of multiple environments. First, are the results of the current paper relevant when there are multiple environments? Second, what is the interaction between brain mechanisms of dealing with multiple environments and the results of the current paper?

      We believe the answer to the first question is positive. The near-orthogonality of representations between environments implies that changes in one can happen without changes in the other. This is evident, for instance, in Khatib et al and Geva et al - in both cases, drift seems to happen independently in two environments, even though they are visited intermittently and are visually similar.

      The second question is a fascinating one, and we are planning to pursue it in future work. While the exact way in which the brain achieves this near-independence is an open question, remapping is one possible window into this process.

      We extended the discussion to make these points clear.

      (2) Directed drift versus stationarity: I could not help but notice that the RD illustrated in Fig.2D is not stationary in nature, i.e. the upper right and lower left panels are quite different. This appears to contrast with findings in the hippocampus, for example, Fig.3e-g in (Ziv et al, 2013). Perhaps it is obvious that a directed process will not be stationary, but the authors note that there is a third phase of steady-state null drift. Is the RD seen there stationary? Basically, I wonder if the process the authors are studying is relevant only as a novel environment becomes familiar, or if it is also applicable to RD in an already familiar environment. Please discuss the issue of stationarity in this context.

      The non-stationarity noted by the reviewer is indeed a major feature of our observations, and is indeed linked to familiarity. We divide learning into three phases (now more clearly stated in Table 1 and Figure 4C). The first, rapid, phase consists of improvement of performance - corresponding to initial familiarity with the environment. The third phase, often reported in the literature of representational drift, is indeed stationary and obtained after prolonged familiarity. Our work focuses on the second phase, which is not as immediate as the first one, and can take several days. We note in the discussion that experiments which include a long familiarization process can miss this phase (see also Table 3). Furthermore, we speculate that real life is less stationary than a lab environment, and this second phase might actually be more relevant there.

      Reviewer #3 (Recommendations For The Authors):

      Most of my general recommendations are outlined in the public review. A large portion of my comments regards increasing clarity and explicitly defining many of the terms used which may require generating more figures (to better illustrate the generality of findings) or modifying existing figures (e.g., to show how/where the three stages of learning map onto the authors' data).

      Sparsification is not clearly defined in the main text. As I read it, sparsification is meant to refer to the activity of neurons, but this needs to be clearly defined. For example, lines 262-263 in the methods define "sparseness" by the number of active units, but lines 116-117 state: "For label noise, the dynamics were qualitatively different, the fraction of active units did not reduce, but the activity of the units did sparsify." If the fraction of active units (defined as "sparseness") did not change, what does it mean that the activity of the units "sparsified"? If the authors mean that the spatial activity patterns of hidden units became more sharply tuned, this should be clearly stated.

      We now defined precisely the two measures we use - Active Fraction, and Fraction Active Units. There is a new section with an accompanying figure in the Methods section. As Figure S2 shows, the noise statistics (label noise vs. update noise) differentially affects these two measures.

      Likewise, it is unclear which of the features the authors outlined - spatial information, active proportion of units, and spatial correlation - are meant to represent drift. The authors should clearly delineate which of these three metrics they mean to delineate drift in the main text rather than leave it to the reader to infer. While all three are mentioned early on in the text (Figure 2), the authors focus more on sparseness in the last half of the text, making it unclear if it is just sparseness that the authors mean to represent drift or the other metrics as well.

      The main focus of our paper is on the non-stationarity of drift. Namely that features (such as these three) systematically change in a directed manner as part of the drift process. This is in The new analyses of experimental data show sparseness and spatial information.

      The focus on sparseness in the second half of the paper is because we move to more abstract These are also easy to study in the more abstract tasks in the second part of the paper. In our original submission the transition between a predictive coding task to more abstract tasks was not clearly explained, creating some confusion regarding the measured metrics. We now clarified the motivation for this transition.

      It is not clear if a change in the number of active units alone constitutes "drift", especially since Geva et al. (2023) recently showed that both changes in firing rate AND place field location drive drift, and that the passage of time drives changes in activity rate (or # cells active).

      Our work did not deal with purely time-dependent drift, but rather focused on experience-dependence. Furthermore, Geva et al study the stationary phase of drift, where we do not expect a systematic change in the total number of cells active. They report changes in the average firing rate of active cells in this phase, as a function of time - which does not contradict our findings.

      "hidden layer", "units", and "neurons" seem to be used interchangeably in the text (e.g., line 81-85). However, this is confusing in several places, in particular in lines 83-85 where "neurons" is used twice. The first usage appears to refer to the rate maps of the hidden layer units simulated by the authors, while the second "neurons" appears to refer to real data from Ziv 2013 (ref 5). The authors should make it explicit whether they are referring to hidden layer units or actual neurons to avoid reader confusion.

      We changed the usage of these terms to be more coherent. Specifically, “units” refer to artificial networks and “neurons” to biological ones.

      The authors should clearly illustrate which parts of their findings support their three-phase learning theory. For example, does 2E illustrate these phases, with the first tenth of training time points illustrating the early phase, time 0.1-0.4 illustrating the intermediate phase, and 0.4-1 illustrating the last phase? Additionally, they should clarify whether the second and third stages are meant to represent drift, or is it only the second stage of directed manifold exploration that is considered to represent drift? This is unclear from the main text.

      The different processes (reduction in loss, reduction in Hessian) happen in parallel with different timescales. Thus, there are no sharp transitions between the phases. This is now explained in the text in relation to figure 4C, where the approximate boundaries are depicted.

      The term drift is often used to denote a change in representation without a change in behavior. In this sense, both the second and third phases correspond to drift. Only the third stage is stationary. This is now emphasized in the text and in the new Table 1. Regarding experimental data, apart from the new figure 3 with four datasets, we also summarize in Table 3 the relation between duration of familiarity and stationarity of the data.

      Line 45 - It appears that the acronym ML is not defined above here anywhere.

      Added.

      Line 71: the ReLU function should be defined in the text, e.g., sigma(x) = x if x > 0 else 0.

      Added.

      106-107: Figures (or supplemental figures) to demonstrate how most parameters do not influence sparsification dynamics are warranted. As written, it is unclear what "most parameters" mean - all but noise scale. What about the learning rule? Are there any interactions between parameters?

      We now removed the label noise from Figure 4, and added two supplementary figures to clearly explain the effect of parameters. Figure 4 itself was also redone to clarify this issue.

      2F middle: should "change" be omitted for SI?

      The panel was replaced by a new one in Figure 3.

      116-119: A figure showing how results differ for label noise is warranted.

      This is now done in Figure S1, S2.

      124: typo, The -> the

      Corrected.

      127-129: This conclusion statement is the first place in the text where the three stages are explicitly outlined. There does not appear to be any support or further explanation of these stages in the text above.

      We now explain this earlier at the end of the Introduction section, along with the new Table 1 and marking on Figure 4C.

      132-133 seems to be more of a statement and less of a prediction or conclusion - do the authors mean "the flatness of the loss landscape in the vicinity of the solution predicts the rate of sparsification?"

      We thank the reviewer for this observation. The sentence was rephrased:

      Old: As illustrated in Fig. 1, different solutions in the zero-loss manifold might vary in some of their properties. The specific property suggested from theory is the flatness of the loss landscape in the vicinity of the solution.

      New: As illustrated in Fig. 1, solutions in the zero-loss manifold have identical loss, but might vary in some of their properties. The authors of [26] suggest that noisy learning will slowly increase the flatness of the loss landscape in the vicinity of the solution.

      135: typo, it's -> its

      Corrected.

      Line 135-136 "Crucially, the loss on the 136 entire manifold is exactly zero..." This appears to contradict the Figure 4A legend - the loss appears to be very high near the top and bottom edges of the manifold in 4A. Do the authors mean that the loss along the horizontal axis of the manifold is zero?

      The reviewer is correct. The manifold mentioned in the sentence is indeed the horizontal axis. We changed the text and the figure to make it clearer.

      Equation 6: This does not appear to agree with equation 2 - should there be an E_t term for an expectation function?

      Corrected.

      Line 262-263: "Sparseness means that a unit has become inactive for all inputs." This should also be stated explicitly as the definition of sparseness/sparsification in the main text.

      We now define precisely the two measures we use - Active Fraction, and Fraction Active Units. There is a new section with an accompanying figure in the Methods section. As Figure S2 shows, the noise statistics (label noise vs. update noise) differentially affects these two measures.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #2 (Public Review):

      Weaknesses:

      The comparison of affinity predictions derived from AlphaFold2 and H3-opt models, based on molecular dynamics simulations, should have been discussed in depth. In some cases, there are huge differences between the estimations from H3-opt models and those from experimental structures. It seems that the authors obtained average differences of the real delta, instead of average differences of the absolute value of the delta. This can be misleading, because high negative differences might be compensated by high positive differences when computing the mean value. Moreover, it would have been good for the authors to disclose the trajectories from the MD simulations.

      Thanks for your careful checks. We fully understand your concerns about the large differences when calculating affinity. To understand the source of these huge differences, we carefully analyzed the trajectories of the input structures during MD simulations. We found that the antigen-antibody complex shifted as it transited from NVT to NPT during pre-equilibrium, even when restraints are used to determine the protein structure. To address this issue, we consulted the solution provided on Amber's mailing list (http://archive.ambermd.org/202102/0298.html) and modified the top file ATOMS_MOLECULE item of the simulation system to merge the antigen-antibody complexes into one molecule. As a result, the number of SOLVENT_POINTERS was also adjusted. Finally, we performed all MD simulations and calculated affinities of all complexes.

      We have corrected the “Afterwards, a 25000-step NVT simulation with a time step of 1 fs was performed to gradually heat the system from 0 K to 100 K. A 250000-step NPT simulation with a time step of 2 fs was carried out to further heat the system from 100 K to 298 K.” into “Afterwards, a 400-ps NVT simulation with a time step of 2 fs was performed to gradually heat the system from 0 K to 298 K (0–100 K: 100 ps; 100-298 K: 200 ps; hold 298 K: 100 ps), and a 100-ps NPT simulation with a time step of 2 fs was performed to equilibrate the density of the system. During heating and density equilibration, we constrained the antigen-antibody structure with a restraint value of 10 kcal×mol-1×Å-2.” and added the following sentence in the Method section of our revised manuscript: “The first 50 ns restrains the non-hydrogen atoms of the antigen-antibody complex, and the last 50 ns restrains the non-hydrogen atoms of the antigen, with a constraint value of 10 kcal×mol-1×Å-2”

      In addition, we have corrected the calculation of mean deltas using absolute values and have demonstrated that the average affinities of structures predicted by H3-OPT were closer to those of experimentally determined structures than values obtained through AF2. These results have been updated in the revised manuscript. However, significant differences still exist between the estimations of H3-OPT models and those derived from experimental structures in few cases. We found that antibodies moved away from antigens both in AF2 and H3-OPT predicted complexes during simulations, resulting in RMSDbackbone (RMSD of antibody backbone) exceeding 20 Å. These deviations led to significant structural changes in the complexes and consequently resulted in notable differences in affinity calculations. Thus, we removed three samples (PDBID: 4qhu, 6flc, 6plk) from benchmark because these predicted structures moved away from the antigen structure during MD simulations, resulting in huge energy differences from the native structures.

      Author response table 1.

      We also appreciate your reminder, and we have calculated all RMSDbackbone during production runs (SI Fig. 5).

      Author response image 1.

      Reviewer #3 (Public Review):

      Weaknesses:

      The proposed method lacks of a confidence score or a warning to help guiding the users in moderate to challenging cases.

      We were sorry for our mistakes. We have updated our GitHub code and added following sentences to clarify how we train this confidence score module in Method Section: “Confidence score prediction module

      We apply an MSE loss for confidence prediction, label error was calculated as the Cα deviation of each residue after alignment. The inputs of this module are the same as those used for H3-OPT, and it generates a confidence score ranging from 0 to 100. The dropout rates of H3-OPT were set to 0.25. The learning rate and weight decay of Adam optimizer are set to 1 × 10−5 and 1 × 10−4, respectively.”

      Reviewer #2 (Recommendations For The Authors):

      I would strongly suggest that the authors deepen their discussion on the affinity prediction based on Molecular Dynamics. In particular, why do the authors think that some structures exhibit huge differences between the predictions from the experimental structure and the predicted by H3-opt? Also, please compute the mean deltas using the absolute value and not the real value; the letter can be extremely misleading and hidden very high differences in different directions that are compensating when averaging.

      I would also advice to include graphical results of the MD trajectories, at least as Supp. Material.

      We gratefully thank you for your feedback and fully understand your concerns. We found the source of these huge differences and solved this problem by changing method of MD simulations. Then, we calculated all affinities and corrected the mean deltas calculation using the absolute value. The RMSDbackbone values were also measured to enable accurate affinity predictions during production runs (SI Fig. 5). There are still big differences between the estimations of H3-OPT models and those from experimental structures in some cases. We found that antibodies moved away from antigens both in AF2 and H3-OPT predicted complexes during simulations, resulting in RMSDbackbone exceeding 20 Å. These deviations led to significant structural changes in the complexes and consequently resulted in notable differences in affinity calculations. Thus, we removed three samples (PDBID: 4qhu, 6flc, 6plk) from benchmark.

      Thanks again for your professional advice.

      Reviewer #3 (Recommendations For The Authors):

      (1) I am pleased with the most of the answers provided by the authors to the first review. In my humble opinion, the new manuscript has greatly improved. However, I think some answers to the reviewers are worth to be included in the main text or supporting information for the benefit of general readers. In particular, the requested statistics (i.e. p-values for Cα-RMSD values across the modeling approaches, p-values and error bars in Fig 5a and 5b, etc.) should be introduced in the manuscript.

      We sincerely appreciate your advice. We have added the statistics values to Fig. 4 and Fig. 5 to our manuscript.

      Author response image 2.

      Author response image 3.

      (2) Similarly, authors state in the answers that "we have trained a separate module to predict the confidence score of the optimized CDR-H3 loops". That sounds a great improvement to H3-OPT! However, I couldn't find any reference of that new module in the reviewed version of the manuscript, nor in the available GitHub code. That is the reason for me to hold the weakness "The proposed method lacks of a confidence score".

      We were really sorry for our careless mistakes. Thank you for your reminding. We have updated our GitHub code and added following sentences to clarify how we train this confidence score module in Method Section:

      “Confidence score prediction module

      We apply an MSE loss for confidence prediction, label error was calculated as the Cα deviation of each residue after alignment. The inputs of this module are the same as those used for H3-OPT, and it generates a confidence score ranging from 0 to 100. The dropout rates of H3-OPT were set to 0.25. The learning rate and weight decay of Adam optimizer are set to 1 × 10−5 and 1 × 10−4, respectively.”

      (3) I acknowledge all the efforts made for solving new mutant/designed nanobody structures. Judging from the solved structures, mutants Y95F and Q118N seems critical to either crystallographic or dimerization contacts stabilizing the CDR-H3 loop, hence preventing the formation of crystals. Clearly, solving a molecular structure is a challenge, hence including the following comment in the manuscript is relevant for readers to correctly asset the magnitude of the validation: "The sequence identities of the VH domain and H3 loop are 0.816 and 0.647, respectively, comparing with the best template. The CDR-H3 lengths of these nanobodies are both 17. According to our classification strategy, these nanobodies belong to Sub1. The confidence scores of these AlphaFold2 predicted loops were all higher than 0.8, and these loops were accepted as the outputs of H3-OPT by CBM."

      We appreciate your kind recommendations and have revised “Although Mut1 (E45A) and Mut2 (Q14N) shared the same CDR-H3 sequences as WT, only minor variations were observed in the CDR-H3. H3-OPT generated accurate predictions with Cα-RMSDs of 1.510 Å, 1.541 Å and 1.411 Å for the WT, Mut1, and Mut2, respectively.” into “Although Mut1 (E45A) and Mut2 (Q14N) shared the same CDR-H3 sequences as WT (LengthCDR-H3 = 17), only minor variations were observed in the CDR-H3. H3-OPT generated accurate predictions with Cα-RMSDs of 1.510 Å, 1.541 Å and 1.411 Å for the WT, Mut1, and Mut2, respectively (The confidence scores of these AlphaFold2 predicted loops were all higher than 0.8, and these loops were accepted as the outputs of H3-OPT by CBM). ”. In addition, we have added following sentence in the legend of Figure 4 to ensure that readers can appropriately evaluate the significance and reliability of our validations: “The sequence identities of the VH domain and H3 loop are 0.816 and 0.647, respectively, comparing with the best template.”.

      (4) As pointed out in the first review, I think the work https://doi.org/10.1021/acs.jctc.1c00341 is worth acknowledging in section "2.2 Molecular dynamics (MD) simulations could not provide accurate CDR-H3 loop conformations" of supplementary material, as it constitutes a clear reference (and probably one of the few) to the MD simulations that authors pretend to perform. Similarly, the work https://doi.org/10.3390/molecules28103991 introduces a former benchmark on AI algorithms for predicting antibody and nanobody structures that readers may find interest to contrast with the present work. Indeed, this later reference is used by authors to answer a reviewer comment.

      Thanks a lot for your valuable comments. We have added these references in the proper positions in our manuscript.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Response to Reviewer 1 Comments (PublicReview)

      Point 1: First, the authors should provide more convincing data showing that tor and tapA genes are indeed duplicated genes in A. flavus. The authors appeared to use the A. flavus PTS strain as a parental strain for constructing the tor and tapA mutants. If so, the A. flavus CA14 strain (Hua et al., 2007) should be the parental wild-type strain for the A. flavus PTS strain. I did a BLAST search in NCBI for the torA (AFLA_044350) and tapA (AFLA_092770) genes using the most recent CA14 genome assembly sequence (GCA_014784225.2) and only found one allele for each gene: torA on chromosome 7 and tapA on chromosome 3. I could not find any other parts with similar sequences. Even in another popular A. flavus wild-type strain, NRRL3357, both torA and tapA exist as a single allele. Based on the published genome assembly data for A. flavus, there is no evidence to support the idea that tor and tapA exist as copies of each other. Therefore, the authors could perform a Southern blot analysis to further verify their claim. If torA and tapA indeed exist as duplicate copies in different chromosomal locations, Southern blot data could provide supporting results.

      Response 1: We thank the reviewer for their insightful observation. Based on the southern blot analysis results presented in Figure 1, we have determined that torA and tapA are single-copy genes. Additionally, we conducted protoplast transformation experiments repeated several times. which revealed that both torA and tapA transformants exhibited ectopic mutations. It is plausible that the deletion of torA and tapA genes may lead to the demise of A. flavus, this phenomenon is consistent with previous studies conducted on the fungus Fusarium graminearum[1].To ensure the rigor of the study, we have retracted the previously incorrect conclusion. We once again express our heartfelt appreciation to the experts for their valuable suggestions.

      Author response image 1.

      Fig.1 Southern blot hybridization analyses of WT, torA, and tapA transformants. (A) The structure diagram of the torA gene. (B) The structure diagram of the tapA gene. (C) Southern blot hybridization analyses of torA gene. (D) Southern blot hybridization analyses of tapA gene.

      Point 2: Second, the authors should consider the possibility of aneuploidy for their constructed mutants. When an essential gene is targeted for deletion, aneuploidy often occurs even in a fungal strain without the "ku" mutation, which results in seemingly dual copies of the gene. As the authors appear to use the A. flavus PTS strain having the "ku" mutation, the parental strain has increased genome instability, which may result in enhanced chromosomal rearrangements. So, it will be necessary to Illumina-sequence their tor and tapA mutants to make sure that they are not aneuploidy.

      Response 2: Thank you for your comment. Based on the sequencing results of the torA and tapA mutants, it was determined that the torA and tapA genes were still present in both mutants. In this case, it suggests that the torA and tapA genes may have undergone a genetic rearrangement or insertion at a different site in the mutant strains.

      Point 3: Furthermore, the genetic nomenclature +/- and -/- should be reserved for heterozygous and homozygous mutants in a diploid strain. As A. flavus is not a diploid strain, this type of description could cause confusion for the readers.

      Response 3: Thank you for your suggestion. We acknowledge your concerns about potential confusion caused by using this type of description, and we agree that it is best to avoid any misunderstandings for readers. Therefore, we have decided to remove this part of the content from the manuscript.

      Response to Reviewer 2 Comments (PublicReview)

      Point 1: However, findings have not been deeply explored and conclusions are mostly are based on parallel phenotypic observations. In addition, there are some concerns for the conclusions.

      Response 1: We are grateful for the suggestion. We conduct additional experiments and analyses to provide a more comprehensive understanding and address concerns raised.

      Response to Reviewer 3 Comments (PublicReview)

      Point 1: The paper by Li et al. describes the role of the TOR pathway in Aspergillus flavus. The authors tested the effect of rapamycin in WT and different deletion strains. This paper is based on a lot of experiments and work but remains rather descriptive and confirms the results obtained in other fungi. It shows that the TOR pathway is involved in conidiation, aflatoxin production, pathogenicity, and hyphal growth. This is inferred from rapamycin treatment and TOR1/2 deletions. Rapamycin treatment also causes lipid accumulation in hyphae. The phenotypes are not surprising as they have been shown already for several fungi. In addition, one caveat is in my opinion that the strains grow very slowly and this could cause many downstream effects. Several kinases and phosphatases are involved in the TOR pathway. They were known from S. cerevisiae or filamentous fungi. The authors characterized them as well with knock-out approaches.

      Response 1: Thank you for your comment. The role of the target of rapamycin (TOR) signaling pathway is of fundamental importance in the physiological processes of diverse eukaryotic organisms. Nevertheless, its precise involvement in regulating the developmental and virulent characteristics of opportunistic pathogenic fungi, such as A. flavus, has yet to be fully elucidated. Furthermore, the mechanistic underpinnings of TOR pathway activity specifically in A. flavus remain largely unresolved. Consequently, our study represents a significant contribution as the first comprehensive exploration of the conserved TOR signaling pathway encompassing a majority of its constituent genes in A. flavus.

      Response to Reviewer 1 Comments (Recommendations For The Authors)

      Point 1: In Table S3, the authors indicated that the Δku70 ΔniaD ΔpyrG::pyrG strain is A. flavus wild-type strain. However, this strain is not a wild-type strain because it seems like a control strain after introducing the pyrG gene into the A. flavus PTS strain (Δku70 ΔniaD ΔpyrG). So please indicate the real wild-type A. flavus strain name to help readers find out its original genome sequence data. Also, the reference for this Δku70 ΔniaD ΔpyrG::pyrG strain is "saved in our lab". This is not an eligible reference. If you use this control strain for the first time in this study, it should be described as "In this study". Otherwise, please indicate the proper reference for which the strain was first used.

      Response 1: Thank you for your valuable feedback on our manuscript. We appreciate your attention to detail and the opportunity to clarify the information regarding the strain in Table S3. The A. flavus CA14 strain which produces aflatoxins and large sclerotia was isolated from a pistachio bud in the Wolfskill Grant Experimental Farm (University of Davis, Winters, California, USA)[2]. The A. flavus CA14 strain is the parental wild-type strain for the A. flavus CA14 PTs (Δku70, ΔniaD, ΔpyrG) strain. The recipient strain CA14 PTs has been used satisfactorily in gene knockout and subsequent genetic complementation experiments[3]. In this study, the A. flavus CA14 PTs strain was used as the transformation recipient strain, and the control strain (Δku70, ΔniaD, ΔpyrG::pyrG) created by introducing the pyrG gene into the A. flavus CA14 PTs strain. Refer to previously published literature[4],this control strain (Δku70, ΔniaD, ΔpyrG::pyrG) was named wild-type strain. Therefore, this control strain was also named wild-type strain in this study. As this control strain is indeed used in this study, we will revise the reference to "In this study" Once again, we appreciate your keen attention to detail and thank you for bringing these issues to our attention.

      Response to Reviewer 2 Comments (Recommendations For The Authors)

      Point 1: As in response: However, the tor gene in A. flavus exhibited varying copy numbers, as was confirmed by absolute quantification PCR at the genome level (Table S1). However, it is hard to understand Table S1: Estimation of copy number of tor gene in A. flavus toro and sumoo stand for the initial copy number, and the data are figured as the mean {plus minus} 95% confidence limit. CN is copy number. As indicated in the section of Method, using sumo gene as reference, the tor and tapA gene copy number was calculated by standard curve. In Table S1 of WT, for tor gene, CN value is 1412537 compared to 1698243 in tor+/-, for the reference gene sumo,794328 compared to1584893, how these data could support copy gene numbers of tor?

      Response 1: Thank you for your suggestion. We understand the confusion with the data presented in Table S1 regarding the copy number estimation of the tor gene in A. flavus. We apologize for not providing a clear explanation for the data in the table. Quantitative real-time PCR (qPCR) is widely used to determine the copy number of a specific gene. It involves amplifying the gene of interest and a reference gene simultaneously using specific primers and probes. By comparing the amplification curves of the gene of interest and the reference gene, you can estimate the relative copy number of the gene.

      To address your concern and provide more accurate information, we have re-performed the copy number analysis using southern blot. Southern blot analysis allows for the direct estimation of gene copy number by hybridizing genomic DNA with a specific probe for the gene. This method provides more reliable and accurate results in determining gene copy numbers. The southern blot analysis results are presented in Figure 1.

      We appreciate your input and apologize for any confusion caused by the earlier presentation of the data.

      Point 2: In response: For the knockout of the FRB domain, we used the homologous recombination method, but because tor genes are double-copy genes, there are also double copies in the FRB domain. Despite our efforts, we encountered challenges in precisely determining the location of the other copy of the tor gene. I could not understand these consistent data, why not for using sequencing?

      Response 2: Thank you for your comment. We observed that the torA gene is a single copy. We removed this part of the results to avoid any ambiguity or potential misinterpretation.

      Point 3: Response in Due to the large number of genes involved, we did not perform a complementation experiment. If there were no complementation data, how to demonstrate data are solid?

      Response 3: Thank you for your important suggestion. We understand that complementation experiments are commonly used to validate gene deletions. Therefore, to ensure the reliability of our data, we have conducted supplementary experiments on specific gene deletions, such as ΔsitA-C and Δppg1-C. Thank you again for your positive comments and valuable suggestions to improve the quality of our manuscript.

      References:

      (1) Yu F, Gu Q, Yun Y, et al. The TOR signaling pathway regulates vegetative development and virulence in Fusarium graminearum. New Phytol. 2014; 203(1): 219-32.

      (2) Hua SS, Tarun AS, Pandey SN, Chang L, Chang PK. Characterization of AFLAV, a Tf1/Sushi retrotransposon from Aspergillus flavus. Mycopathologia. 2007 Feb;163(2):97-104.

      (3) Chang PK, Scharfenstein LL, Mack B, Hua SST. Genome sequence of an Aspergillus flavus CA14 strain that is widely used in gene function studies. Microbiol Resour Announc. 2019 Aug 15;8(33):e00837-19.

      (4) Zhu Z, Yang M, Yang G, Zhang B, Cao X, Yuan J, Ge F, Wang S. PP2C phosphatases Ptc1 and Ptc2 dephosphorylate PGK1 to regulate autophagy and aflatoxin synthesis in the pathogenic fungus Aspergillus flavus. mBio. 2023 Oct 31;14(5):e0097723.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review)

      Summary:

      Huang and colleagues present a method for approximation of linkage disequilibrium (LD) matrices. The problem of computing LD matrices is the problem of computing a correlation matrix. In the cases considered by the authors, the number of rows (n), corresponding to individuals, is small compared to the number of columns (m), corresponding to the number of variants. Computing the correlation matrix has cubic time complexity , which is prohibitive for large samples. The authors approach this using three main strategies:

      1. they compute a coarsened approximation of the LD matrix by dividing the genome into variant-wise blocks which statistics are effectively averaged over;

      2. they use a trick to get the coarsened LD matrix from a coarsened genomic relatedness matrix (GRM), which, with time complexity, is faster when n << m;

      3. they use the Mailman algorithm to improve the speed of basic linear algebra operations by a factor of log(max(m,n)). The authors apply this approach to several datasets.

      Strengths:

      The authors demonstrate that their proposed method performs in line with theoretical explanations.

      The coarsened LD matrix is useful for describing global patterns of LD, which do not necessarily require variant-level resolution.

      They provide an open-source implementation of their software.

      Weaknesses:

      The coarsened LD matrix is of limited utility outside of analyzing macroscale LD characteristics. The method still essentially has cubic complexity--albeit the factors are smaller and Mailman reduces this appreciably. It would be interesting if the authors were able to apply randomized or iterative approaches to achieve more fundamental gains. The algorithm remains slow when n is large and/or the grid resolution is increased.

      Thanks for your positive and accurate evaluation! We acknowledge the weakness and include some sentences in Discussion.

      “The weakness of the proposed method is obvious that the algorithm remains slow when the sample size is large or the grid resolution is increased. With the availability of such as UK Biobank data (Bycroft et al., 2018), the proposed method may not be adequate, and much advanced methods, such as randomized implementation for the proposed methods, are needed.”  

      Reviewer #2 (Public Review)

      Summary:

      In this paper, the authors point out that the standard approach of estimating LD is inefficient for datasets with large numbers of SNPs, with a computational cost of , where n is the number of individuals and m is the number of SNPs. Using the known relationship between the LD matrix and the genomic- relatedness matrix, they can calculate the mean level of LD within the genome or across genomic segments with a computational cost of . Since in most datasets, n<<m, this can lead to major computational improvements. They have produced software written in C++ to implement this algorithm, which they call X-LD. Using the output of their method, they estimate the LD decay and the mean extended LD for various subpopulations from the 1000 Genomes Project data.

      Strengths:

      Generally, for computational papers like this, the proof is in the pudding, and the authors appear to have been successful at their aim of producing an efficient computational tool. The most compelling evidence of this in the paper is Figure 2 and Supplementary Figure S2. In Figure 2, they report how well their X- LD estimates of LD compare to estimates based on the standard approach using PLINK. They appear to have very good agreement. In Figure S2, they report the computational runtime of X-LD vs PLINK, and as expected X-LD is faster than PLINK as long as it is evaluating LD for more than 8000 SNPs.

      Weakness:

      While the X-LD software appears to work well, I had a hard time following the manuscript enough to make a very good assessment of the work. This is partly because many parameters used are not defined clearly or at all in some cases. My best effort to intuit what the parameters meant often led me to find what appeared to be errors in their derivation. As a result, I am left worrying if the performance of X-LD is due to errors cancelling out in the particular setting they consider, making it potentially prone to errors when taken to different contexts.

      Thanks for you critical reading and evaluation. We do feel apologize for typos, which have been corrected and clearly defined now (see Eq 1 and Table 1). In addition, we include more detailed mathematical steps, which explain how LD decay regression is constructed and consequently finds its interpretation (see the detailed derivation steps between Eq 3 and Eq 4).

      Impact:

      I feel like there is value in the work that has been done here if there were more clarity in the writing. Currently, LD calculations are a costly step in tools like LD score regression and Bayesian prediction algorithms, so a more efficient way to conduct these calculations would be useful broadly. However, given the difficulty I had following the manuscript, I was not able to assess when the authors’ approach would be appropriate for an extension such as that.

      See our replies below in responding to your more detailed questions.

      Reviewer #1 (Recommendations For The Authors)

      There are numerous linguistic errors throughout, making it challenging to read.

      It is unclear how the intercepts were chosen in Figure S2. Since theory only gives you the slopes, it seems like it would make more sense to choose the intercept such that it aligns with the empirical results in some way.

      Thanks for your critical evaluation. We do feel apologize some typos, and we have read it through and clarify the text as much as possible. In addition, we included Table 1, which introduces mathematical symbols of the paper.

      In Figure S2, the two algorithms being compared have different software implementations, PLINK vs X-LD. Their real performance not only depended on the time complexity of the algorithms (right-side y-axis), but also how the software was coded. PLINK is known for its excellent programming. If we could have programmed as well as Chris Chang, the performance of X-LD should have been even better and approach the ratio m/n. However, even under less skilled programming, X-LD outperformed plink.

      Reviewer #2 (Recommendations For The Authors):

      Thank you for the chance to review your manuscript. It looks like compelling work that could be improved by greater detail. Providing the level of detail necessary may require creating a Supplementary Note that does a lot of hand-holding for readers like me who are mathematically literate but who don’t have the background that you do. Then you can refer readers to the Supplement if they can’t follow your work.

      We fix the problems and style issues as possible as we can.

      Regarding the weakness section in the public review, here are a few examples of where I got confused, though this list is not exhaustive.

      1) Consider Equation 1 (line 100), which I believe must be incorrect. Imagine that g consists of two SNPs on different chromosomes with correlation rho. Then ell_g (which is defined as the average squared elements of the correlation matrix) would be

      ell_g = 1/4 (1 + 1 + rho^2 + rho^2) = (1+rho^2)/2.

      But ell_1=1 and ell_2=1 and ell_12=rho^2 (The average squared elements of the chromosome-specific correlation matrices and the cross-chromosome correlation matrix, respectively). So

      sum(ell_i)+sum(ell_ij) = 1 + 1 + rho^2 + rho^2 = (1+rho^2)*2.

      I believe your formulas would hold if you defined your LD values as the sum of squared correlations instead of the mean, but then I don’t know if the math in the subsequent sections holds. I think this problem also holds for Eq 2 and therefore makes Eqs 3 and 4 difficult to interpret.

      Thanks for your attentive review and invaluable suggestions. We acknowledge the typo in calculating the mean in Eq 1, resulting in difficulties in understanding the equations. We sincerely apologize for this oversight. To address this issue and ensure clarity in the interpretation of Eq 3 and Eq 4, we have provided more detailed explanations (see the derivation between Eq 3 and Eq 4).

      2) I didn’t know what the parameters are in Equation 3. The vector ell needs to be defined. Is it the vector of ell_i for each chromosomal segment i? I’m also confused by the definition of m_i, which is defined on line 113 as the “SNP number of the i-th chromosome.” Do the authors mean the number of SNPs on the i-th chromosomal segment? If so, it wasn’t clear to me how Eq 2 and Eq 3 imply Eq 4. Further, it wasn’t clear to me why E(b1) quantifies the average LD decay of the genome. I’m used to seeing plots of average LD as a function of distance between SNPs to calculate this, though I’m admittedly not a population geneticist, so maybe this is standard. Standard or not, readers deserve to have their hands held a bit more through this either in the text or in a Supplementary Note.

      Thanks for your insightful feedback. When we were writing this paper, our actually focus was Eq 3 and to establish the relationship between chromosomal LD and the reciprocal of the length of chromosome (Fig 6A) – which was surrogated by the number of SNPs, the correlation between ell_i and 1/m_i.

      We asked around our friends who are population geneticists, who anticipated the correlation between chromosomal LD (ell) and 1/m. The rationale simple if one knows the very basis of population genetics. A long chromosome experiences more recombination, which weakens LD for a pair of loci. In particular, for a pair of loci D_t=D_0 (1-c)^t. D_t the LD at the t generation, D_0 at the 0 generation, and c the recombination fraction. As recombination hotspots are nearly even distributed along the genome, such as reported by Science 2019;363:eaau8861, the chromosome will be broken into the shape in Author response image 1 (Fig 1C, newly added). Along the diagonal you see tight LD block, which will be vanished in the further as predicted by D_t equation, and any loci far away from each other will not be in LD otherwise raised by such as population structure. Ideally, we assume the diagonal block of aveage size of m×m and average LD of a SNP with other SNPs inside the diagonal block (red) is l_u; and, in contrast, off-diagonal average LD (light red) to be l_uv. This logic is hidden but employed in such as ld score regression and prs refinement using LD structure.

      Author response image 1.

      But, how to estimate chromosomal LD (ell), which is overwhelming as our friends said! So, the Figure 6A is logically anticipated by a seasoned population geneticist, but has never been realized because of is nightmare. Often, those signature patterns should have been employed as showcases in releasing new reference data, such as HapMap. However, to our knowledge, this signature linear relationship has never been illustrated in those reference data.

      If you further test a population geneticist, if any chromosome will deviate from this line (Fig 6A)? The answer most likely will be chromosome 6 because of the LD tight HLA region. However, it is chromosome 11 because of its most completed sequenced centromere. Chr 11 is a surprise! With T2T sequenced population, Chr 11 will not deviate much. We predict!

      However, we suspect whether people appreciate this point, we shift our focus to efficient computation of LD—which is more likely understood. We acknowledge the lack of clarity in notation definitions and the absence of the derivation for the interpretation of b1 and b0 for LD decay regression. So, we have added a table to provide an explanation of the notation (see the Table 1) and provided additional derivations, which explained how LD decay regression was derived (see the derivation between Eq 3 and Eq 4). Figure 1C provides illustration for the underlying assumption under LD.

      The technique to bridge Eq 2~3 to Eq 4 is called “building interpretation”. It once was one of the kernel tasks for population genetics or statistical genetics, and a classical example is Haseman-Elston regression (Behavior Genetics, 1972, 2:3-19). When it is moving towards a data-driven style, the culture becomes “shut up, calculate”. Finding interpretation for a regression is a vanishing craftmanship, and people often end up with unclear results!

      3) In line 135, it’s not clear to me what is meant by . If it is , then wouldn’t the resulting matrix be a matrix of zeros since is zero everywhere except the lower off-diagonal? So maybe it is ? But then later in that line, you say that the square of this matrix is the sum of several terms of the form . Are these the scalar elements of the G matrix? But then the sum is a scalar, which can’t be true since is a matrix.

      Thanks for your attentive review. We indeed confused the definition of matrices and their elements, and should refer to the stacked off-diagonal elements of matrix . So, is a vector for variable – the relationship between sample i and j. We assume the reviewer use R software, then corresponds to mean .

      See the text between Eq 5 and Eq 6.

      “We extract two vectors , which stacks the off-diagonal elements of , and , which takes the diagonal elements of .”

      In addition, , so the ground truth is that , but not zero.

      To clarify these math symbols, we replace G with K, so as to be consistent with our other works (see Table 1).

      To derive the means and the sampling variances for and , the Eq 7 can be established by some modifications on the Delta method as exampled in Appendix I of Lynch and Walsh’s book (Lynch and Walsh, 1998). We added this sentence near Eq 7 in the main text.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Recommendations for the authors:

      Please make corrections as suggested by reviewer 1 to improve the manuscript. Specifically, reviewer 1 suggests making changes to p values in Figure 5, and the importance of citing original scholarly works related to effects of increase in excitability of sympathetic neurons by M1 receptors, and the terminology for M currents and KCNQ currents. These changes will improve the manuscript and are strongly recommended.

      The section dealing with Aging Reduces KCNQ currents seems to contain a lot of extraneous information especially in the last part of the long paragraph and this section should be rewritten for improved clarity and - the implications or lack thereof - of the correlation of KCNQ with AP firing rates. The apparent lack of correlation between KCNQ current and KCNQ2 protein needs to be better explained. This is a central part of the study and this result undercuts the premise of the paper. Additionally, the poor specificity of Linordipine for KCNQ should be pointed out in the limitations.

      Finally, the editor notes that the author response should not contain ambiguities in what was addressed in the revision. In the original summary of consolidated revisions that were requested, one clearly and separately stated point (point 4) was that experiments in slice cultures should be strongly considered to extend the significance of the work to an intact brain preparation. The author response letter seems to imply that this was done, but this is not the case. The author response seems to have combined this point with another separate point (point 3) about using KCNQ drugs, and imply that all concerns were addressed. Authors should be clear about what revisions were in fact addressed.

      Summary of recommendations from the three reviewers:

      Please make corrections as suggested by reviewer 1 to improve the manuscript.

      Specifically, reviewer 1 suggests making changes to p values in Figure 5,

      As a team, we have decided to keep p values. Here is our rationale:

      Our lab favors reporting p-values for all statistical comparisons to help readers identify what we consider statistically significant. We color-coded the p-values, with red for p-value < 0.05 and black for p-value > 0.05. As a reader, seeing a p-value=0.7 allows me to know that the authors performed an analysis comparing these conditions and found the mean not to be different. Not presenting the p-value makes me wonder whether the authors even analyzed those groups. We value the ability to analyze the data by seeing all p-values than not being distracted by non-significant p-values.

      and the importance of citing original scholarly works related to effects of increase in excitability of sympathetic neurons by M1 receptors, and the terminology for M currents and KCNQ currents. These changes will improve the manuscript and are strongly recommended.

      We cited original papers on that area and changed the terminology for M current. I kept KCNQ when referring to the channel protein or abundance.

      The section dealing with Aging Reduces KCNQ currents seems to contain a lot of extraneous information especially in the last part of the long paragraph and this section should be rewritten for improved clarity… and - the implications or lack thereof - of the correlation of KCNQ with AP firing rates.

      I separated the long paragraph in two. I also removed extraneous information in that section. It now reads:

      Previous work by our group and others demonstrated that cholinergic stimulation leads to a decrease in M current and increases the excitability of sympathetic motor neurons at young ages.67-71 The molecular determinants of the M current are channels formed by KCNQ2 and KCNQ3 in these neurons.70, 76, 77 Thus, Figure 6A shows a voltage response (measured in current-clamp mode) and a consecutive M current recording (measured in voltage-clamp mode) in the same neuron upon stimulation of cholinergic type 1 muscarinic receptors. It illustrates the temporal correlation between the decrease of M current with the increase in excitability and firing of APs. This strong dependence led us to hypothesize that aging decreases M current, leading to a depolarized RMP and hyperexcitability (Figure 6B). For these experiments, we measured the RMP and evoked activity using perforated patch, followed by the amplitude of M current using a whole-cell voltage clamp in the same cell. We also measured the membrane capacitance as a proxy for cell size. Interestingly, M current density was smaller by 29% in middle age (7.5 ± 0.7 pA/pF) and by 55% in old (4.8 ± 0.7 pA/pF) compared to young (10.6 ± 1.5 pA/pF) neurons (Figure 6C-D). The average capacitance was similar in young (30.8 ± 2.2 pF), middle-aged (27.4 ± 1.2 pF), and old (28.8 ± 2.3 pF) neurons (Figure 6E), suggesting that aging is not associated with changes in cell size of sympathetic motor neurons, and supporting the hypothesis that aging alters the levels of M current. Next, we tested the effect on the abundance of the channels mediating M current. Contrary to our expectation, we observed that KCNQ2 protein levels were 1.5 ± 0.1 -fold higher in old compared to young neurons (Figure 6F-G). Unfortunately, we did not find an antibody to detect consistently KCNQ3 channels. We concluded that the decrease in M current is not caused by a decrease in the abundance of KCNQ2 protein.

      B. and - the implications or lack thereof - of the correlation of KCNQ with AP firing rates.

      I am not sure to understand the request in the section on the correlation of KCNQ with AP firing rate. I divided the long paragraph.

      The apparent lack of correlation between KCNQ current and KCNQ2 protein needs to be better explained. This is a central part of the study and this result undercuts the premise of the paper.

      Indeed, total KCNQ2 protein abundance increases while M current decreases. We do not claim in our work that changes in excitability are caused by a reduction in the expression or density of KCNQ2 channels. On the contrary, our current working hypothesis is that the reduction in M current is caused by changes in traffic, degradation, posttranslational modifications, or cofactors for KCNQ2 or KCNQ3 channels. I have modified the description in the results section and discussion to clarify this concept. We also note that the discussion section contains a paragraph discussing this discrepancy.

      Additionally, the poor specificity of Linordipine for KCNQ should be pointed out in the limitations.

      Thank you for the suggestion. I have added the following sentences to the Limitations section. It reads: “We want to point out that linopirdine has been reported to affect other ionic currents besides M current (Neacsu and Babes, 2010; Lamas et al., 1997). Despite this limitation, the application of linopirdine to young sympathetic motor neurons led to depolarization and firing of action potentials.”

      Finally, the editor notes that the author response should not contain ambiguities in what was addressed in the revision. In the original summary of consolidated revisions that were requested, one clearly and separately stated point (point 4) was that experiments in slice cultures should be strongly considered to extend the significance of the work to an intact brain preparation. The author response letter seems to imply that this was done, but this is not the case. The author response seems to have combined this point with another separate point (point 3) about using KCNQ drugs, and imply that all concerns were addressed. Authors should be clear about what revisions were in fact addressed.

      We apologize for this omission. After reviewing this comment, I realized I did not respond to the Major points in the section of the Recommendations for the authors from Reviewer 3. We missed that entire section. Our previous responses addressed the Public review of Reviewer 3. When doing so, we did not separate the sentences, omitting the request to perform the experiment in slices.

      The proposed experiments will require an upward microscope coupled to an electrophysiology rig; unfortunately, we do not have the equipment to do these experiments. We agree that our findings need to be tested in intact preparations to understand how the hyperactivity of sympathetic motor neurons affects systemic responses and the function of controlling organ function. This is a crucial step to move the field forward. Our laboratory is trying to find the appropriate experimental design to address this problem. We believe we must go beyond redoing these experiments in slices.

      Reviewer #1 (Recommendations For The Authors):

      (1) The significance values greater than p < 0.05 do not add anything and distract focus from the results that are meaningful. Fig. 5 is a good example. What does p = 0.7 mean? Or p = 0.6? Does this help the reader with useful information?

      We thank Reviewer 1 for raising this question. We have attempted different versions of how we report p values, as we want to make sure to address rigor and transparency in reporting data.

      Our lab favors reporting p-values for all statistical comparisons to help readers identify what we consider statistically significant. We color-coded the p-values, with red for p-value < 0.05 and black for p-value > 0.05. As a reader, seeing a p-value=0.7 allows me to know that the authors performed an analysis comparing these conditions and found the mean not to be different. Not presenting the p-value makes me wonder whether the authors even analyzed those groups. We value the ability to analyze the data by seeing all p-values than not being distracted by non-significant p-values.

      (2) Fig. 1 is not informative and should be removed.

      Although we agree with the reviewer that this figure is not informative, it was created to guide the reader in identifying the problem addressed in our manuscript in the physiological context. Our colleagues who read the first drafts of the manuscript recommended this, so we prefer to keep the figure.

      (3) The emphasis on a particular muscarinic agonist favored by many ion channel physiologists, oxotremorine, is not meaningful (lines 192, 198). The important point is stimulation of muscarinic AChRs, which physiologically are stimulated by acetylcholine. The particular muscarinic agonist used is unimportant. Unless mandated by eLife, "cholinergic type 1 muscarinic receptors" are usually referred to as M1 mAChRs, or even better is "Gq-coupled M1 mAChRs." I don't think that Kruse and Whitten, 2021 were the first to demonstrate the increase in excitability of sympathetic neurons from stimulation of M1 mAChRs. Please try and cite in a more scholarly fashion.

      A) We have modified lines 192 and 198, removing the mention of oxotremorine.

      B) We have modified the nomenclature used to refer to cholinergic type 1 muscarinic receptors.

      C) We cited references on the role of M current on sympathetic motor neuron excitability.

      (4) The authors may want to use the term "M current" (after defining it) as the current produced by KCNQ2&3-containing channels in sympathetic neurons, and reserve "KCNQ" or "Kv7" currents as those made by cloned KCNQ/Kv7 channels in heterologous systems. A reason for this is to exclude currents KCNQ1-containing channels, which most definitely do not contribute to the "KCNQ" current in these cells. I am not mandating this, but rather suggesting it to conform with the literature.

      Thank you for the suggestion. I have modified the text to use the term M current. I maintained the use of KCNQ only when referring to KCNQ channel, such as in the section describing the abundance of KCNQ2.

      (5) The section in the text on "Aging reduces KCNQ current" is confusing. Can the authors describe their results and their interpretation more directly?

      (6) Please explain the meaning of the increase in KCNQ2 abundance with age in Fig. 6G. How is this increase in KCNQ2 expression consistent with an increase in excitability? The explanation of "The decrease in KCNQ current and the increase in the abundance of KCNQ2 protein suggest a potential compensatory mechanism that occurs during aging, which we are actively investigating in an independent study." is rather odd, considering that the entire thesis of this paper is that changes in excitability and firing properties are underlied by changes in KCNQ2/3 channel expression/density. Suddenly, is this not the case?? What about KCNQ3? It would be very enlightening if the authors would just quantify the ratio of KCNQ2:KCNQ3 subunits in M-type channels in young and old mice using simple TEA dose/response curves (see Shapiro et al., JNS, 2000; Selyanko et al., J. Physiol., Hadley et al., Br. J. Pharm., 2001 and a great many more). It is also surprising that the authors did not assess or probe for differences in mAChR-induced suppression of M current between SCG neurons of young and old mice. This would seem to be a fundamental experiment in this line of inquiry.

      We have divided this paragraph in sections.

      A. Please explain the meaning of the increase in KCNQ2 abundance with age in Fig. 6G. How is this increase in KCNQ2 expression consistent with an increase in excitability? The explanation of "The decrease in KCNQ current and the increase in the abundance of KCNQ2 protein suggest a potential compensatory mechanism that occurs during aging, which we are actively investigating in an independent study." is rather odd, considering that the entire thesis of this paper is that changes in excitability and firing properties are underlied by changes in KCNQ2/3 channel expression/density. Suddenly, is this not the case??

      Our interpretation is that the decrease in M current is not caused by a decrease in the abundance of KCNQ (2) channels. We do not claim that changes in excitability are caused by a reduction in the expression or density of KCNQ2 channels. On the contrary, our working hypothesis is that the reduction in M current is caused by changes in traffic, degradation, posttranslational modifications, or cofactors for KCNQ2 or KCNQ3 channels. We have modified the description in the results section to clarify this concept. “We concluded that the decrease in M current is not caused by a decrease in the abundance of KCNQ2 protein.”

      B. What about KCNQ3?

      Unfortunately, we did not find an antibody to detect KCNQ3 channels. I have added a sentence to state this.

      C. KCNQ2: KCNQ3 subunits in M-type channels in young and old mice using simple TEA dose/response curves.

      Our laboratory is working to deeply understand the mechanism behind the changes in M current and its regulation by mAChRs in young and old ages. However, it is part of different research to attend to the complexity of the question. We think pharmacology experiments are insufficient to understand the question's complexity as we described in the next answer.

      D. It is also surprising that the authors did not assess or probe for differences in mAChR-induced suppression of M current between SCG neurons of young and old mice. This would seem to be a fundamental experiment in this line of inquiry.

      As mentioned, our laboratory is working to understand the mechanism behind M current and its regulation in young and old ages deeply. Our preliminary data show that M currents recorded in old neurons show two behaviors with the activation of mAChR: 1) they do not respond (blue line), or 2) they show a smaller and slower current inhibition than young neurons (red line). This data shows the complexity of the mechanism behind the M current in old neurons where changes in basal levels of PIP2, phospholipids metabolism, KCNQ2/3 changes in traffic/degradation, and M current pharmacology need to be addressed together for a proper interpretation. Showing only one part of this set of experiments in this article may lead to misinterpretation of results.

      Author response image 1.

      (7) Why do the authors use linopirdine instead of XE-991? Both are dirty drugs hardly specific to KCNQ channels at 25 uM concentrations, but linopirdine less so. The Methods section lists the source of XE991 used in the study, not linopirdine. Is there an error?

      A. Why do the authors use linopirdine instead of XE-991?

      We use linopiridine with the experimental goal of observing the recovery phase during the washout. The main difference between the effects of XE991 and linopiridine on Kv7.2/3 is associated with the recovery phase. Currents under XE991 treatment recover 30% after 10 min compared to 93.4% with linopiridine in expression systems at -30 mV (Greene DL et al., 2017, J Pharmacol Exp Ther). After validation of KCNQ2/3 inhibition by linopirdine (IC50 value of 2.4 µM), we found linopirdine the most appropriate drug for our experiments.

      Unfortunately, we were not able to observe a recovery in our experiments. The limited recovery after washout may be associated with the membrane potential of our conditions (-60 to -50 mV).

      B. Both are dirty drugs hardly specific to KCNQ channels at 25 uM concentrations, but linopirdine less so.

      We understand the concern of the reviewer. The specificity of XE-991 and linopiridine is not absolute. Linopiridine has been reported to activate TRPV1 channels (EC50 =115 µM, Neacsu and Babes, 2010, J Pharmacol Sci) or nicotinic acetylcholine receptors and GABA-induced Cl- currents (EC50 =7.6 µM and 8.1 µM respectively; Lamas et al, 1997, Eur J Neurosci).

      To clarify this limitation in the article, we have added the following sentence in the section Limitations and Conclusions. “We want to point out that linopirdine has been reported to affect other ionic currents besides M current (Neacsu and Babes, 2010; Lamas et al., 1997). Despite this limitation, the application of linopirdine to young sympathetic motor neurons led to depolarization and firing of action potentials.”

      C. The Methods section lists the source of XE991 used in the study, not linopirdine. Is there an error?

      Thank you for pointing out this. We have added information for both retigabine and linopirdine in the Methods section; both were missing.

      (8) Can the authors use a more scientific explanation of RTG action than "activating KCNQ channels?" For instance, RTG induces both a negative-shift in the voltage-dependance of activation and a voltage-independent increase in the open probability, both of which differing in detail between KCNQ2 and KCNQ3 subunits. The authors are free to use these exact words. Thus, the degree of "activation" is very dependent upon voltage at any voltages negative to the saturating voltages for channel activation.

      We have modified the text to reflect your suggestion. Thank you.

      (9) Methods: did the authors really use "poly-l-lysine-coated coverslips?" Almost all investigators use poly-D-lysine as a coating for mammalian tissue-culture cells and more substantial coatings such as poly-D-lysine + laminin or rat-tail collagen for peripheral neurons, to allow firm attachment to the coverslip.

      That is correct. We used poly-L-lysine-coated coverslips. Sympathetic motor neurons do not adhere to poly-D-Lysine.

      (10) As a suggestion, sampling M-type/KCNQ/Kv7 current at 2 kHz is not advised, as this is far faster than the gating kinetics of the channels. Were the signals filtered?

      Signals were not filtered. Currents were sampled at 2KHz. Our conditions are not far from what is reported by others. Some sample at 10KHz and even 50 KHz. Others do not report the sample frequency.

      Reviewer #2:

      Weaknesses:

      None, the revised version of the manuscript has addressed all my concerns.

      We are very appreciative and glad that our responses satisfied your previous concerns.

      Reviewer #3:

      The main weakness is that this study is a descriptive tabulation of changes in the electrophysiology of neurons in culture, and the effects shown are correlative rather than establishing causality.

      In the previous revision, Reviewer 3 wrote: “It is difficult to know from the data presented whether the changes in KCNQ channels are in fact directly responsible for the observed changes in membrane excitability.” And suggested the “use of blockers and activators to provide greater relevance.”

      Attending this recommendation, we performed experiments in Fig. 8. Young neurons exposed to linopirdine depolarize membrane potential and promote action potential firing. In contrast, the old neurons treated with retigabine repolarize membrane potential and stop firing action potentials. This new set of experiments suggests age-related electrophysiological changes in old neurons are associated with changes in M current. The main finding of our article.

      If Reviewer 3 refers to establishing causality between aging and a reduction in M current, I would like to emphasize that our laboratory is working toward a better understanding of the molecular mechanism of how M current is affected by aging; however, it will be part of a different article.  One of our attempts was to reverse aging with rapamycin, but the previous recommendation was to remove those experiments.

      … but the specifics of the effects and relevance to intact preparations are unclear.

      Additional experiments in slice cultures would provide greater significance on the potential relevance of the findings for intact preparations.

      I apologize for missing this point in the previous revision. The proposed experiments will require an upward microscope coupled to an electrophysiology rig. Unfortunately, I do not

      have the equipment to do these experiments.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1:

      Section 4.3 ("expert baseline model"): the authors need to explain how the probabilities defined as baselines were exactly used to predict individual patient susceptible profiles.

      We have added a more detailed and mathematically formal explanation of the “simulated expert’s best guess” in Section 4.3.

      This section now reads:

      “More formally, considering all training spectra as Strain, all training labels corresponding to one drug j and species t are gathered:

      The "simulated expert's best guess" predicted probability for any spectrum si and drug dj, then, corresponds to, the fraction of positive labels in their corresponding training label set :

      Authors should explain in more detail how a ROC curve is generated from a single spectrum (i.e., per patient) and then average across spectra. I have an idea of how it's done but I am not completely sure.

      We have added a more detailed explanation in Section 3.2. It reads:

      To compute the (per-patient average) ROC-AUC, for any spectrum/patient, all observed drug resistance labels and their corresponding predictions are gathered. Then, the patient-specific ROC-AUC is computed on that subset of labels and predictions. Finally, all ROC-AUCs per patient are averaged to a "spectrum-macro" ROC-AUC.

      In addition, our description under Supplementary Figure 8 (showing the ROC curve) provides additional clarification:

      Note that this ROC curve is not a traditional ROC curve constructed from one single label set and one corresponding prediction set. Rather, it is constructed from spectrum-macro metrics as follows: for any possible threshold value, binarize all predictions. Then, for every spectrum/patient independently, compute the sensitivity and specificity for the subset of labels corresponding to that spectrum/patient. Finally, those sensititivies and specificities are averaged across patients to obtain one point on above ROC curve.

      Section 3.2 & reply # 1: can the authors compute and apply the Youden cutoff that gives max precision-sensitivity for each ROC curve? In that way the authors could report those values.

      We have computed this cut-off on the curve shown in Supplementary Figure 8. The Figure now shows the sensitivity and specificity at the Youden cutoff in addition to the ROC. We have chosen only to report these values for this model as we did not want to inflate our manuscript with additional metrics (especially since the ROC-AUC already captures sensitivities and specificities). We do, however, see the value of adding this once, so that biologists have an indication of what kind of values to expect for these metrics.

      Related to reply #5: assuming that different classifiers are trained in the same data, with the same number of replicates, could authors use the DeLong test compare ROC curves? If not, please explain why.

      We thank the reviewer for bringing our attention to the DeLong’s test. It does indeed seem true that this test is appropriate for comparing two ROC-AUCs using the same ground truth values.

      We have chosen not to use this test for one conceptual and one practical reason:

      (1) Our point still stands that in machine learning one chooses the test set, and hence one can artificially increase statistical power by simply allocating a larger fraction of the data to test.

      (2) DeLong’s test is defined for single AUCs (i.e. to compare two lists of predictions against one list of ground truths), but here we report the spectrum/patient-macro ROC-AUC. It is not clear how to adjust the test to macro-evaluated AUCs. One option may be to apply the test per patient ROC curve, and perform multiple testing correction, but then we are not comparing models, but models per patient. In addition, the number of labels/predictions per patient is prohibitively small for statistical power.

      Reviewer #2 (Recommendations For The Authors):

      After revision, all issues were been resolved.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Response to reviewer’s comments

      Reviewer #2 (Public Review):

      Summary: 

      The manuscript focuses on comparison of two PLP-dependent enzyme classes that perform amino acyl decarboxylations. The goal of the work is to understand the substrate specificity and factors that influence catalytic rate in an enzyme linked to theanine production in tea plants.

      Strengths: 

      The work includes x-ray crystal structures of modest resolution of the enzymes of interest. These structures provide the basis for design of mutagenesis experiments to test hypotheses about substrate specificity and the factors that control catalytic rate. These ideas are tested via mutagenesis and activity assays, in some cases both in vitro and in plants. 

      Weaknesses:

      Although improved in a revision, the manuscript could be more clear in explaining the contents of the x-ray structures and how the complexes studied relate to the reactant and product complexes. The manuscript could also be more concise, with a discussion section that is largely redundant with the results and lacking in providing scholarly context from the literature to help the reader understand how the current findings fit in with work to characterize other PLP-dependent enzymes or protein engineering efforts. Some of the figures lack sufficient clarity and description. Some of the claims about the health benefits of tea are not well supported by literature citations.

      Thank you for your insightful comments on our manuscript and your recognition of the strengths of our study. We understand your concerns about the weaknesses mentioned, and we have addressed them appropriately in the revised manuscript. We acknowledge that the discussion section needs to be improved for conciseness and context. We have revised this part by removing the redundant content. We also acknowledge your comments concerning the clarity and description of some figures. We have revisited these figures and revised them, ensuring they are clear and adequately described. Lastly, concerning the claims about the health benefits of tea, we understand your concern about the lack of supporting citations. We ensure to back such claims with valid literature or, if necessary, omit these statements.

      Reviewer #2 (Recommendations For The Authors):

      (1) Line 21: Alanine Decarboxylase should not be capitalized.

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (2) Line 31: Grammatical error. Also not clear what "evolution analysis" means here. Revise to "Structural comparisons led us to..."

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (3) Line 34: Revise to "Combining a double mutant of CsAlaDC"

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (4) Line 35: Change word order to "increased theanine production 672%"

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (5) Line 37: meaning unclear. Revise to "provides a route to more efficient biosynthesis of theanine."

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (6) Line 44: I'm not sure that the "health effects" of tea have been proven in placebo controlled studies. And the references provided (2-4 and 5) do not describe original research articles supporting these claims. I would suggest removing these statements from the introduction and at later points in the manuscript.

      Thank you for your thoughtful feedback and suggestions. Based on your suggestion, we have removed these statements: "The popularity of tea is determined by its favorable flavor and numerous health benefits (2-4). The flavor and health-beneficial effects of tea are conferred by the abundant secondary metabolites, including catechins, caffeine, theanine, volatiles, etc (5). " As for the subsequent statement: " It has also many health-promoting functions, including neuroprotective effects, enhancement of immune functions, and potential anti-obesity capabilities, among others. " the referenced literature cited can substantiate this conclusion.

      (7) Line 58: insert "the" between provided and basis

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (8) Line 100: Not clear what this phrase means, "As expected, CsSerDC was closer to AtSerDC" Please clarify - closer to what?

      We apologize for any confusion caused by the unclear phrasing. When referring to "CsSerDC was closer to AtSerDC," we intended to convey that CsSerDC exhibits a higher degree of sequence homology with AtSerDC than it does with the other enzymes evaluated in our investigation. However, a 1.29% difference between 86.21% and 84.92% in amino acid similarity is not statistically significant (Figure 1B and Supplementary table 1 in the original manuscript), we have deleted the relevant descriptions in the revised manuscript.

      (9) Line 112: "were constructed into" makes no sense. It would be better to say the genes for the proteins of interest were inserted into the overexpression plasmid.

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (10) Line 115: missing the word "the" between generated and recombinant

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (11) Line 121: catalyze not catalyzed

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (12) Lines 129 and 130: The reported Km values are really large - in the mM range. Do these values make sense in terms of the available concentrations of the substrates inside the cell?

      The content of alanine in tea plant roots ranges from 0.28 to 4.18 mg/g DW (Yu et al., 2021; Cheng et al., 2017). Correspondingly, the physiological concentration of alanine is 3.14 mM to 46.92 mM, in tea plant roots. The content of serine in plants ranges from 0.014 to 17.6 mg/g DW (Kumar et al., 2017). Correspondingly, the physiological concentration of serine is 0.13 mM to 167.48 mM in plants. Therefore, in this study, the Km values are within the range of available substrate concentrations inside the cell.

      Yu, Y. et al. (2021) Glutamine synthetases play a vital role in high accumulation of theanine in tender shoots of albino tea germplasm "Huabai 1". J. Agric. Food Chem. 69 (46),13904-13915.

      Cheng, S. et al. (2017) Studies on the biochemical formation pathway of the amino acid L-theanine in tea (Camellia sinensis) and other plants. J. Agric. Food Chem. 65 (33), 7210-7216.

      Kumar, V. et al. (2017) Differential distribution of amino acids in plants. Amino Acids. 49(5), 821-869.

      (13) Line 211: it is unclear what the phrase "as opposed to wild-type" means. Please clarify.

      Thank you very much for your careful reading of the manuscript and valuable suggestions. We intend to communicate that the wild-type CsAlaDC and AtSerDC demonstrate decarboxylase activity, while the mutated proteins have experienced a loss of decarboxylation activity. We have already modified this concern in the revised version of the manuscript.

      (14) Line 222: residues not residue

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (15) Line 227 and Figure 4B: It is not clear what the different sequence logos mean in this part of the figure. The caption is too brief and not helpful. And the sentences describing this figure panel are also not sufficiently clear.

      Thank you very much for your careful reading of the manuscript and valuable suggestions. We have provided a more detailed explanation of this section in the revised manuscript and added additional annotations in the figure caption to provide further clarity.

      (16) Lines 233 and 234: "in the substrate specificity" is awkwardly worded. I would revise to "in selective binding of the appropriate substrate."

      Thank you very much for your careful reading of the manuscript and valuable suggestions. We have meticulously revised the description of this section.

      (17) Line 243: a word is missing in this sentence - but I can't figure out the intended meaning or what the missing word is. Rephrase to improve clarity.

      Thank you very much for your careful reading of the manuscript and valuable suggestions. We have revised this sentence to: " These findings indicate the essential role of Phe106 in the selective binding of alanine for CsAlaDC. "

      (18) Line 255: The "expression system...was carried out" is not correct. I would say the expression system was used - but you probably also want to rearrange the sentences to more directly say what it was used for. Later, the word "the" is also missing.

      Thank you very much for your careful reading of the manuscript and valuable suggestions. We have revised this sentence to: "To further verify that Phe106 of CsAlaDC and Tyr111 of AtSerDC were key amino acid residues determining its substrate recognition in planta, we employed the Nicotiana benthamiana transient expression system. "

      (19) Line 273: use "understand" instead of "elucidate" and instead of "we proposed a prediction test:" say "we designed a test of the prediction that..."

      Thank you very much for your careful reading of the manuscript. We have revised this sentence to: “In light of this observation, we postulated a hypothesis:”

      (20) Line 301: I don't think "effectuate" is a word. Replace with something else.

      Thank you very much for your careful reading of the manuscript. We have revised the sentence as: " The biosynthetic pathway of theanine in tea plants comprises two consecutive enzymatic steps: alanine decarboxylase facilitates the decarboxylation of alanine to generate EA, while theanine synthetase catalyzes the condensation reaction between EA and Glu to synthesize theanine. "

      (21) Line 307: replace "activity" with "ability"

      Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (22) Line 322: I didn't find the discussion very useful. Much of it is simply a recap of the results - which is not necessary. The structural comparisons are overly descriptive without providing appropriate rationale or topic sentence structure so that the reader understands why certain details are emphasized. I think the manuscript would be much stronger if this section were not included or integreted more concisely into the results section where appropriate.

      Thank you for your constructive comments. We understand your concerns about the discussion section of our manuscript. We acknowledge that the discussion section has redundancies with the result. In response to this, we have revised this section to eliminate unnecessary repetition of the results.

      (23) Line 369: "an amino acid devoid of the hydroxyl moiety present in Lys" - what does this mean? Lys does not have a hydroxyl functional group. Please correct so that the sentence makes sense.

      Thank you very much for your careful reading of the manuscript. This sentence states that the amino acid occupying the corresponding position in CsAlaDC is Phe, which lacks one hydroxyl functional group as compared to Lys. We have made modifications to the sentence as follows: "In contrast, the equivalent position in CsAlaDC is occupied by Phe, an amino acid lacking the hydroxyl group. This substitution enhances the hydrophobic nature of the substrate-binding pocket. "

      (24) Line 370: "This structural nuance portends a predisposition for CsAlaDC to select the comparatively hydrophobic amino acid alanine as its suitable substrate." This sentence also makes no sense - please revise to use simpler language so the meaning is more clear.

      Thank you very much for your careful reading of the manuscript and valuable suggestions. We have revised the sentence as follows: " Consequently, CsAlaDC demonstrates a unique predilection, selectively binding Ala (an amino acid with comparatively hydrophobic properties) as its preferred substrate."

      (25) Lines 376-384: This section makes several references to "catalytic rings." I have no idea what this term means? If the authors mean a loop structure in the enzyme - please use the term "loop"

      Thank you very much for your careful reading of the manuscript and valuable suggestions. We have corrected it in the revised manuscript.

      (26) Line 396-397: The authors reference data that is not shown in the manuscript. Either show the data in the results section or do not mention.

      Thank you for your insightful comment regarding the unshown data referenced in the manuscript. We have included Supplementary figure 9 in the revised manuscript to display this data.

      (27) Line 445-446: what is "mutation technology" - if the authors mean site-directed mutagenesis - please use the simpler and more recognizable terminology.

      Thank you very much for your careful reading of the manuscript and valuable suggestions. We have revised the sentence as follows: "Based on the findings of this study, site-directed mutagenesis can be employed to modify enzymes involved in theanine synthesis. This modification enhances the capacity of bacteria, yeast, model plants, and other organisms to synthesize theanine, thereby facilitating its application in industrial theanine production."

      Reviewer #3 (Public Review):

      In the manuscript titled "Structure and Evolution of Alanine/Serine Decarboxylases and the Engineering of Theanine Production," Wang et al. solved and compared the crystal structures of Alanine Decarboxylase (AlaDC) from Camellia sinensis and Serine Decarboxylase (SerDC) from Arabidopsis thaliana. Based on this structural information, the authors conducted both in vitro and in vivo functional studies to compare enzyme activities using site-directed mutagenesis and subsequent evolutionary analyses. This research has the potential to enhance our understanding of amino acid decarboxylase evolution and the biosynthetic pathway of the plant specialized metabolite theanine, as well as to further its potential applications in the tea industry.

      Thank you very much for taking the time to review this manuscript. We appreciate all your insightful comments.

      Reviewer #3 (Recommendations For The Authors):

      The additional material added by the authors addresses some of the previously raised questions and enhances the manuscript's quality. However, certain critical issues we pointed out earlier remain unaddressed. Some of the new data also raises new questions. To provide readers with more comprehensive data, the authors should include additional quantitative data and convert the data presented in the reviewer's comments into supplemental figure format.

      Thank you for acknowledging the improvements in the revised manuscript and providing further valuable feedback. We understand your concern about the critical issues that have not been fully addressed and the new questions raised by some of the newly added data. We have strived to address these issues with additional analysis and clarification in our subsequent revision. Regarding your suggestion for more quantitative data and converting the data mentioned in the reviewer's comments into a supplemental figure format, we agree that this would provide a more comprehensive view of the results. We have reformatted the relevant data into supplemental figures to enhance the clarity and accessibility of information. We are grateful for the time and effort you have dedicated to improving our manuscript.

      * Page 5 & Figure 1B

      "As expected, CsSerDC was most closed to AtSerDC, which implies that they shared similar functions. However, CsAlaDC is relatively distant from CsSerDC."

      : In Figure 1B, CsSerDC and AtSerDC are in different clades, and this figure does not show that the two enzymes are closest. To provide another quantitative comparison, please provide a matrix table showing amino acid sequence similarities as a supplemental table. 

      Comment: I don't believe that a 1.29% difference between 86.21% and 84.92% in amino acid similarity is statistically significant. Although the authors have rephrased the original sentence, it's improbable that this small 1.29% difference can explain the observed distinction.

      Many thanks. We have carefully considered your comments. Indeed, the 1.29% difference in amino acid similarity cannot reflect the functional difference between the AlaDC and SerDC proteins. We have deleted the relevant descriptions in the revised manuscript.

      * Page 6, Figure 2, Page 23 (Methods)

      "The supernatants were purified with a Ni-Agarose resin column followed by size-exclusion chromatography."

      : What kind of SEC column did the authors use? Can the authors provide the SEC elution profile comparison results and size standard curve?

      Comment: The authors should include the SEC elution profiles as a supplemental figure or incorporate them as a panel in Figure 2. Furthermore, they should provide a description of the oligomeric state of each protein in this experiment. Additionally, there is a significant difference between CsSerDC (65.38 mL) and CsAlaDC (74.37 mL) elution volumes. Can this difference be explained structurally? In comparison to the standard curve of molecular weight provided by the authors, it appears that these proteins are at least homo-tetramers, which contradicts the description in the text. This should be re-evaluated and clarified.  

      Thank you very much for your careful reading of the manuscript and valuable suggestions. We have included the SEC elution profile in Supplemental figure 1A and added descriptions of the oligomeric states of proteins in the revised manuscript. CsSerDC was eluted at 65.38 mL, corresponding to a molecular weight of 292 kDa, which is five times the monomeric protein (54.7 kDa). However, due to the absence of CsSerDC crystal structure, it remains uncertain whether the protein forms a pentamer. AtSerDC was eluted at 72.25 mL, with a corresponding molecular weight of 155 kDa, which is 3.3 times the monomer (47.3 kDa). CsAlaDC was eluted at 74.37 mL, with a corresponding molecular weight of 127 kDa, which is 2.7 times the monomer (47.3 kDa). The elution profiles suggest that AtSerDC and CsAlaDC potentially exist in homotrimeric form. This observation stands in contradiction to our subsequent findings where the protein manifests in a dimeric structure. A plausible explanation could be the non-ideal spherical shape of the protein. Under such circumstances, the hydrodynamic radius of the protein could supersede its actual size, potentially leading to an overestimation of the molecular weight on the size-exclusion chromatography [ref].

      References:

      Burgess, R. R. (2018) A brief practical review of size exclusion chromatography: Rules of thumb, limitations, and troubleshooting. Protein Expression and Purification. 150, 81-85.

      Erdner J. M., et al. (2006) Size-Exclusion Chromatography Using Deuterated Mobile Phases. Journal of Chromatography A. 1129(1):41–46.

      * Page 6 & Page 24 (Methods)

      "The 100 μL reaction mixture, containing 20 mM substrate (Ala or Ser), 100 mM potassium phosphate, 0.1 mM PLP, and 0.025 mM purified enzyme, was prepared and incubated at standard conditions (45 {degree sign}C and pH 8.0 for CsAlaDC, 40 {degree sign}C and pH 8.0 for AtSerDC for 30 min)."

      (1) The enzymatic activities of CsAldDC and AtSerDC were measured at two different temperatures (45 and 40 {degree sign}C), but their activities were directly compared. Is there a reason for experimenting at different temperatures?

      (2) Enzyme activities were measured at temperatures above 40{degree sign}C, which is not a physiologically relevant temperature and may affect the stability or activity of the proteins. At the very least, the authors should provide temperature-dependent protein stability data (e.g., CD spectra analysis) or, if possible, temperature-dependent enzyme activities, to show that their experimental conditions are suitable for studying the activities of these enzymes.

      Comment: I appreciate the authors for including temperature-dependent enzyme activity data in their study. However, it remains puzzling that plant enzymes were tested at a physiologically irrelevant temperature of 40 and 45 degrees Celsius. Additionally, it may not be appropriate to directly compare enzyme activity measurements at different temperatures. Furthermore, the data at 45 degrees in panel A appears to be an outlier, which contrasts with the overall trend observed in the graph.

      We appreciate your point regarding the testing temperatures for plant enzymes. We fully appreciate the importance of conducting experiments under physiologically relevant conditions. But the intent behind operating at these elevated temperatures was to assess the thermal stability of the enzymes, which can be a valuable characteristic in certain applications, such as industrial production processes, and does not necessarily reflect their physiological conditions. Our findings indicate that CsAlaDC exhibits its peak activity at 45 °C. This result aligns with previously reported data in the literature [Bai, P. et al. (2021) figure 4e], thus bolstering our confidence in the reliability of our experimental outcomes.

      Author response image 1.

      Relative activity of CsAlaDC at different temperatures.

      * Pages 6-7 & Table 1

      (1) Use the correct notation for Km and Vmax. Also, the authors show kinetic parameters and use multiple units (e.g., mmol/L or mM for Km).

      (2) When comparing the catalytic efficiency of enzymes, kcat/Km (or Vmax/Km) is generally used. The authors present a comparison of catalytic activity from results to conclusion. A clarification of what results are being compared is needed.

      Comment: The authors are still comparing catalytic efficiency solely based on the Vmax values. As previously suggested, it would be advisable to calculate kcat/Km and employ it for comparing catalytic efficiencies. Furthermore, based on the data provided by the authors, I conducted a rough calculation of these catalytic efficiencies and did not observe a significant difference, which contrasts with the authors' statement, "These findings indicated that the catalytic efficiency of CsAlaDC is considerably lower than that of both CsSerDC and AtSerDC." This discrepancy requires clarification.  

      We want to express our sincere appreciation for your meticulous review and constructive suggestions. We understand the importance of accurately comparing catalytic efficiencies using Kcat/Km values, rather than solely relying on Vmax values. Following your suggestion, we recalculated Kcat/Km to reanalyze our results. The computed Kcat/Km for CsSerDC and AtSerDC are 152.7 s-1 M-1 and 184.6 s-1 M-1, respectively. For CsAlaDC, the calculated Kcat/Km is 55.7 s-1 M-1. Therefore, the catalytic efficiency of CsSerDC and AtSerDC is approximately three times that of CsAlaDC.  What we intended to convey was that the Vmax of CsAlaDC is lower than that of CsSerDC and AtSerDC.  Our description in the manuscript was not accurate, and we have addressed this in the revised version.

      * Pages 9 & 10

      "This result suggested this Tyr is required for the catalytic activity of CsAlaDC and AtSerDC."

      : The author's results are interesting, but it is recommended to perform the experiments in a specific order. First, experiments should determine whether mutagenesis affects the protein's stability (e.g., CD, as discussed earlier), and second, whether mutagenesis affects ligand binding (e.g., ITC, SPR, etc.), before describing how site-directed mutagenesis alters enzyme activity. In particular, the authors' hypothesis would be much more convincing if they could show that the ligand binding affinity is similar between WT and mutants.

      Comments: While it is appreciated that you have included CD and UV-vis absorption spectra data, it would be more beneficial to provide quantitative data to address the previously proposed binding affinity. I also recommend presenting the data mentioned in the reviewer's comments as a supplementary figure for better clarity and reference.  

      Thank you for your valuable feedback and suggestions. I agree that providing quantitative data would lend more support to our findings and better address the proposed binding affinity.

      It is generally acknowledged that proteins complexed with PLP exhibit a yellow hue, and the ligand PLP forms a Schiff base structure with the ε-amino group of a lysine residue in the protein, with maximum absorbance around 420 nm. However, during our protein purification process, we observed that the purified protein retained its yellow coloration, even when PLP wasn't introduced into the purification buffer. Subsequent absorbance measurements revealed that the protein exhibited absorbance within the aforementioned wavelength (420 nm) (the experimental results are shown in the following figures), implying an inherent presence of the PLP ligand within the protein. This could have resulted from binding with PLP during the protein's expression in E. coli. Consequently, due to this inseparability between the protein and the ligand, obtaining quantitative data through experimental means becomes unfeasible.

      Author response image 2.

      (A) Absorption Spectra of CsAlaDC (WT) and CsAlaDC (Y336F). (B) Absorption Spectra of AtSerDC (WT) and AtSerDC (Y341F).

      Regarding your suggestion about presenting the data mentioned in the reviewer's comments as a supplementary figure, we agree that it is an excellent idea. We have prepared supplementary figure 7 and supplementary figure 8 accordingly, ensuring that they present the required data.

      * Page 10

      "The results showed that 5 mM L-DTT reduced the relative activity of CsAlaDC and AtSerDC to 22.0% and 35.2%, respectively"

      : The authors primarily use relative activity to compare WT and mutants. Can the authors specify the exact experiments, units, and experimental conditions? Is it Vmax or catalytic efficiency? If so, under what specific experimental conditions?

      Response: "However, due to the unknown mechanism of DTT inhibition on protein activity, we have removed this part of the content in the revised manuscript."

      Comment: I believe this requires a more comprehensive explanation rather than simply removing it from the text.  

      Although we have observed that DTT is capable of inhibiting enzyme activity, at present, we are unable to offer a comprehensive explanation for the inhibitory effect of DTT on enzyme activity in terms of its structural and catalytic mechanisms. Further research is required to elucidate the mechanism of action of DTT. It is worth noting, however, that our study does not emphasize investigating the specific inhibitory mechanisms of DTT on enzyme activity. Furthermore, the existing findings do not provide an adequate explanation for the observed phenomenon, leading us to exclude this particular aspect from the content.

      * Pages 10-12

      : The identification of 'Phe106 in CsAlaDC' and 'Tyr111 in AtSerDC,' along with the subsequent mutagenesis and enzymatic activity assays, is intriguing. However, the current manuscript lacks an explanation and discussion of the underlying reasons for these results. As previously mentioned, it would be helpful to gain insights and analysis from WT-ligand and mutant-ligand binding studies (e.g., ITC, SPR, etc.). Furthermore, the authors' analysis would be more convincing with accompanying structural analysis, such as steric hindrance analysis.

      Comment: While it is appreciated that you have included UV-vis absorption spectra data, it would be more beneficial to provide quantitative data to address the previously proposed binding affinity. I also recommend presenting the data mentioned in the reviewer's comments as a supplementary figure for better clarity and reference.  

      Response: Thank you for your valuable feedback and suggestions. Given that the protein forms a complex with PLP during its expression in E. coli and cannot be dissociated from it, obtaining quantitative data via experimental protocols is rendered impracticable.

      Author response image 3.

      (A) Absorption Spectra of CsAlaDC (WT) and CsAlaDC (F106Y). (B) Absorption Spectra of AtSerDC (WT) and AtSerDC (Y111F).

      Mutant proteins and wild-type proteins exhibited absorption bands at 420 nm, suggesting the formation of a Schiff base between PLP and the active-site lysine residue.

      Regarding your suggestion about presenting the data mentioned in the reviewer's comments as a supplementary figure, we have prepared supplementary figure 7 and supplementary figure 8 accordingly, ensuring that they present the required data.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #1 (Public Review):

      (1) It is a nice study but lacks some functional data required to determine how useful these alleles will be in practice, especially in comparison with the figure line that stimulated their creation.

      We are grateful for this comment. For the usefulness of these alleles, Figure 3 shows that specific and efficient genetic manipulation of one cell subpopulation can be achieved by mating across the DreER mouse strain to the rox-Cre mouse strain. In addition, Figure 7 shows that R26-loxCre-tdT can effectively ensure Cre-loxP recombination on some gene alleles and for genetic manipulation. The expression of the tdT protein is aligned with the expression of the Cre protein (Alb roxCre-tdT and R26-loxCre-tdT, Figure 2 and Figure 5), which ensures the accuracy of the tracing experiments. We believe more functional data can be shown in future articles that use mice lines mentioned in this manuscript.

      (2) The data in Figure 5 show strong activity at the Confetti locus, but the design of the newly reported R26-loxCre line lacks a WPRE sequence that was included in the iSure-Cre line to drive very robust protein expression.

      Thank you for bringing up this point in the manuscript. In the R26-loxCre-tdT mice knock-in strategy, the WPRE sequence is added behind the loxCre-P2A-tdT sequence, as shown in Supplementary Figure 9.

      (3) The most valuable experiment for such a new tool would be a head-to-head comparison with iSure (or the latest iSure version from the Benedito lab) using the same CreER and target foxed allele. At the very least a comparison of Cre protein expression between the two lines using identical CreER activators is needed.

      Thank you for your valuable and insightful comment. The comparison results of R26-loxCre-tdT with iSuRe-Cre using Alb-CreER and targeting R26-Confetti can be found in Figure 6, according to the reviewer’s suggestion.

      (4) Why did the authors not use the same driver to compare mCre 1, 4, 7, and 10? The study in Figure 2 uses Alb-roxCre for 1 and 7 and Cdh5-roxCre for 4 and 10, with clearly different levels of activity driven by the two alleles in vivo. Thus whether mCre1 is really better than mCre4 or 10 is not clear.

      Response: or two mCre versions that work efficiently. For example, if Alb-mCre1 was competitive with Cdh5-mCre10, we can use them for targeting genes in different cell types, broadening the potential utility of these mice.

      (5) Technical details are lacking. The authors provide little specific information regarding the precise way that the new alleles were generated, i.e. exactly what nucleotide sites were used and what the sequence of the introduced transgenes is. Such valuable information must be gleaned from schematic diagrams that are insufficient to fully explain the approach.

      Response: We appreciate your thoughtful suggestions. The schematic figures, along with the nucleotide sequences for the generation of mice, can be found in the revised Supplementary Figure 9.

      Reviewer #2 (Public Review):

      (1) The scenario where the lines would demonstrate their full potential compared to existing models has not been tested.

      Thank you for your thoughtful and constructive comment. The comparative analysis of R26-loxCre-tdT with iSuRe-Cre, employing Alb-CreER to target R26-Confetti, is provided in Figure 6.

      (2) The challenge lies in performing such experiments, as low doses of tamoxifen needed for inducing mosaic gene deletion may not be sufficient to efficiently recombine multiple alleles in individual cells while at the same time accurately reporting gene deletion. Therefore, a demonstration of the efficient deletion of multiple floxed alleles in a mosaic fashion would be a valuable addition.

      Thank you for your constructive comments. Mosaic analysis using sparse labeling and efficient gene deletion would be our future direction using roxCre and loxCre strategies.

      3) When combined with the confetti line, the reporter cassette will continue flipping, potentially leading to misleading lineage tracing results.

      Thank you for your professional comments. Indeed, the confetti used in this study can continue flipping, which would lead to potentially misleading lineage tracing results. Our use of R26-Confetti is to demonstrate the robustness of mCre for recombination. Some multiple-color mice lines that don’t flip have been published, for example, R26-Confetti2(PMID: 30778223) and Rainbow (PMID: 32794408). These reporters could be used for tracing Cre-expressing cells, without concerns of flipping of reporter cassettes.

      (4) Constitutive expression of Cre is also associated with toxicity, as discussed by the authors in the introduction.

      Thank you for your professional comments. The toxicity of constitutive expression of Cre and the toxicity associated with tamoxifen treatment in CreER mice line (PMID: 37692772) are known to the field. This study can’t solve the toxicity of the constitutive expression of Cre in this work. Many mouse lines with constitutive Cre driven by different promoters are present across various fields, representing similar toxicity. To solve this issue, it would be possible to construct a new strategy that enables the removal of Cre after its expression.

      Reviewer #3 (Public Review):

      (1) Although leakiness is rather minor according to the original publication and the senior author of the study wrote in a review a few years ago that there is no leakiness(https://doi.org/10.1016/j.jbc.2021.100509).

      Thank you so much for your careful check. In this review (PMID: 33676891), the writer’s comments on iSuRe-Cre are on the reader's side, and all summary words are based on the original published paper (PMID: 31118412). Currently, we have tested iSuRe-Cre in our hands. We did detect some leakiness in the heart and muscle, but hardly in other tissues as shown in Author response image 1.

      Author response image 1.

      Leakiness in Alb CreER;iSuRe-Cre mouse line. Pictures are representative results for 5 mice. Scale bars, white 100 µm.

      (2) I would have preferred to see a study, which uses the wonderful new tools to address a major biological question, rather than a primarily technical report, which describes the ongoing efforts to further improve Cre and Dre recombinase-mediated recombination.

      Response: We gratefully appreciate your valuable comment. The roxCre and loxCre mice mentioned in this study provide more effective methods for inducible genetic manipulation in studying gene function. We hope that the application of our new genetic tools could help address some major biological questions in different biomedical fields in the future.

      (3) Very high levels of Cre expression may cause toxic effects as previously reported for the hearts of Myh6-Cre mice. Thus, it seems sensible to test for unspecific toxic effects, which may be done by bulk RNA-seq analysis, cell viability, and cell proliferation assays. It should also be analyzed whether the combination of R26-roxCre-tdT with the Tnni3-Dre allele causes cardiac dysfunction, although such dysfunctions should be apparent from potential changes in gene expression.

      We are sorry that we mistakenly spelled R26-loxCre-tdT into R26-roxCre-tdT in our manuscript. We have not generated the R26-roxCre-tdT mouse line. We also thank the reviewer for concerns about the toxicity of high Cre expression. The toxicity of constitutive expression of Cre and the toxicity of tamoxifen treatment of CreER mice line (PMID: 37692772) are known to the field. This study can’t solve the toxicity of the constitutive expression of Cre in this work. Many mouse lines with constitutive Cre driven by different promoters are present across various fields, representing similar toxicity. To solve this issue, it would be possible to construct a new strategy that enables the removal of Cre after its expression.

      (4) Is there any leakiness when the inducible DreER allele is introduced but no tamoxifen treatment is applied? This should be documented. The same also applies to loxCre mice.

      In this study, we come up with new mice tool lines, including Alb roxCre1-tdT, Cdh5 roxCre4-tdT, Alb roxCre7-GFP, Cdh5 roxCre10-GFP and R26-loxCre-tdT. As the data shown in Supplementary Figure 1, Supplementary Figure 2, and Figure 4D, Alb roxCre1-tdT, Cdh5 roxCre4-tdT, Alb roxCre7-GFP, Cdh5 roxCre10-GFP and R26-loxCre-tdT are not leaky. Therefore, if there is any leakiness driven by the inducible DreER or CreER allele, the leakiness is derived from the DreER or CreER. Additional pertinent experimental data can be referenced in Figure S4C, Figure S7A-B, and Figure S8A.

      (5) It would be very helpful to include a dose-response curve for determining the minimum dosage required in Alb-CreER; R26-loxCre-tdT; Ctnnb1flox/flox mice for efficient recombination.

      Thank you for your suggestion. We value your feedback and have incorporated your suggestion to strengthen our study. Relevant experimental data can be referenced in Figure S8E-G.

      (6) In the liver panel of Figure 4F, tdT signals do not seem to colocalize with the VE-cad signals, which is odd. Is there any compelling explanation?

      The staining in Figure 4F in the revision is intended to deliver optimized and high-resolution images.

      (7) The authors claim that "virtually all tdT+ endothelial cells simultaneously expressed YFP/mCFP" (right panel of Figure 5D). Well, it seems that the abundance of tdT is much lower compared to YFP/mCFP. If the recombination of R26-Confetti was mainly triggered by R26-loxCre-tdT, the expression of tdT and YFP/mCFP should be comparable. This should be clarified.

      Thank you so much for your careful check. We checked these signals carefully and didn't find the “much lower” tdT signal. As the file-loading website has a file size limitation, the compressed image results in some signal unclear. We attached clear high-resolution images here. Author response image 2 shows how we split the tdT signal and compared it with YFP/mCFP.

      Author response image 2.

      (8) In several cases, the authors seem to have mixed up "R26-roxCre-tdT" with "R26-loxCre-tdT". There are errors in #251 and #256.Furthermore, in the passage from line #278 to #301. In the lines #297 and #300 it should probably read "Alb-CreER; R26-loxCretdT; Ctnnb1flox/flox" rather than "Alb-CreER;R26-tdT2;Ctnnb1flox/flox".

      We are grateful for these careful observations. We have corrected these typos accordingly.

      Recommendations for the authors:

      Reviewer #1:

      (1) However, for it to be useful to investigators a more direct comparison with the Benedito iSure line (or the latest version) is required as that is the crux of the study.

      Thank you for emphasizing this point, which we have now addressed in the revised manuscript and Figure 6.

      (2) I would like to know how the authors will make these new lines available to outside investigators.

      Please contact the lead author by email to consult about the availability of new mouse lines developed in this study.

      (3) The discussion is overly long and fails to address potential weaknesses. Much of it reiterates what was already said in the results section.

      We are thankful for your critical evaluation, which has helped us improve our discussion.

      Reviewer #2:

      (1) Assessing the efficiency and accuracy of the lines in mosaic deletions of multiple alleles and reporting them in single cells after low-dose tamoxifen exposure would be highly beneficial to demonstrate the full potential of the models.

      We appreciate your careful consideration of this issue. Our future endeavors will focus on mosaic analysis utilizing sparse labeling and efficient gene deletion, employing both roxCre and loxCre strategies.

      (2) Performing FACS analysis to confirm that all targeted (Cre reporter-positive) cells are also tdT-positive would provide more precise data and avoid vague statements like 'virtually all' or 'almost complete' in the results section:

      Line 166: Although mCre efficiently labeled virtually all targeted cells (Figure S3A-E)...

      Line 293: ... and not a single tdT+ hepatocyte 293 expressed Cyp2e1 (Figure 6D)... However, the authors do not provide any quantification. FACS would be ideal here.

      Line 244: ...expression of beta-catenin and GS almost disappeared in the 4W mutant sample... The resolution in the provided PDF is not adequate for assessment.

      Line 296: ... revealed almost complete deletion of Ctnnb1 in the Alb-CreER;R26-tdT2;Ctnnb1flox/flox mice...

      Thank you for suggesting these improvements, which have strengthened the robustness of our conclusions. In the revised version, we have incorporated FACS results that correspond to related sections. Additionally, a quantification statement has been included in the statistical analysis section. We appreciate your meticulous review and comments, which have significantly improved the clarity of our manuscript.

      (3) In the beginning of the results section, it is not clear which results are from this study and which are known background information (like Figure 1A). For example, it is not clear if Figure 1C presents data from R26-iSuRe-Cre. Please revise the text to more clearly present the experimental details and new findings.

      Thank you for this observation. Figure 1C belongs to this study, and the revised version has been modified to the related statement for improved clarity.

      (4) Experimental details regarding the genetic constructs and genotyping of the new knock-in lines are missing. Are R26 constructs driven by the endogenous R26 promoter or were additional enhancers used?

      Thank you for emphasizing this point. The schematic figures and nucleotide sequences for the generation of mice can be found in the revised Supplementary Figure 9, which can help to address this issue.

      (5) The method used to quantify mCre activity in terms of reporter+ target cells is not specified. From images or by FACS?

      Additionally, if images were used for quantification, it would be important to provide details on the number of images analyzed, the number of cells counted per image, and how individual cells were identified.

      Thank you for your comment. We have included the quantification statement in the statistical analysis section. Analyzing R26-Confetti+ target cells using FACS is challenging due to the limitations of the sorting instrument. Consequently, we quantified the related data by images. Each dot on the chart represents one sample, and the quantification for each mouse was conducted by averaging the data from five 10x fields taken from different sections.

      (6) Line 160: These data demonstrate that roxCre was functionally efficient yet non-leaky. Functional efficiency in vivo was not shown in the preceding experiments.

      Functional efficiency in vivo can be referred to in Figures S1-S2 and S4C.

      (7) It would be useful to provide a reference for easy vs low-efficiency recombination of different reporter alleles (lines 56-58).

      We are grateful for this comment, as it has allowed us to improve the clarity of our explanation. Consequently, we have made the necessary modifications.

      (8) Discussion on the potential drawbacks and limitations of the lines would be useful.

      We are thankful for your evaluation, which has significantly contributed to the enhancement of our discourse.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This paper investigates host and viral factors influencing transmission of alpha and delta SARS-CoV-2 variants in the Syrian hamster model and fundamentally increases knowledge regarding transmission of the virus via the aerosol route. The strength of evidence is solid and could be improved with a clearer presentation of the data.

      We thank the editors for their assessment. We are excited to present a revised version of the manuscript with improved data presentation and an improved discussion addressing the reviewer’s concerns.

      Public Reviews:

      Reviewer #1 (Public Review):

      In the submitted manuscript, Port et al. investigated the host and viral factors influencing the airborne transmission of SARS-CoV-2 Alpha and Delta variants of concern (VOC) using a Syrian hamster model. The authors analyzed the viral load profiles of the animal respiratory tracts and air samples from cages by quantifying gRNA, sgRNA, and infectious virus titers. They also assessed the breathing patterns, exhaled aerosol aerodynamic profile, and size distribution of airborne particles after SARS-CoV-2 Alpha and Delta infections. The data showed that male sex was associated with increased viral replication and virus shedding in the air. The relationship between co-infection with VOCs and the exposure pattern/timeframe was also tested. This study appears to be an expansion of a previous report (Port et al., 2022, Nature Microbiology). The experimental designs were rigorous, and the data were solid. These results will contribute to the understanding of the roles of host and virus factors in the airborne transmission of SARS-CoV-2 VOCs.

      Reviewer #2 (Public Review):

      This manuscript by Port and colleagues describes rigorous experiments that provide a wealth of virologic, respiratory physiology, and particle aerodynamic data pertaining to aerosol transmission of SARS-CoV-2 between infected Syrian hamsters. The data is particularly significant because infection is compared between alpha and delta variants, and because viral load is assessed via numerous assays (gRNA, sgRNA, TCID) and in tissues as well as the ambient environment of the cage. The paper will be of interest to a broad range of scientists including infectious diseases physicians, virologists, immunologists and potentially epidemiologists. The strength of evidence is relatively high but limited by unclear presentation in certain parts of the paper.

      Important conclusions are that infectious virus is only detectable in air samples during a narrow window of time relative to tissue samples, that airway constriction increases dynamically over time during infection limiting production of fine aerosol droplets, that variants do not appear to exclude one another during simultaneous exposures and that exposures to virus via the aerosol route lead to lower viral loads relative to direct inoculation suggesting an exposure dose response relationship.

      While the paper is valuable, I found certain elements of the data presentation to be unclear and overly complex.

      Reviewer #1 (Recommendations For The Authors):

      We thank the reviewer for their comments and their attention to detail. We have taken the following steps to address their suggestions and concerns.

      However, the following concerns need to be issued.

      1. Summary seems to be too simple, and some results are not clearly described in the summary.

      We have edited the summary and hope to have addressed the concerns raised by providing more information. We think that the summary includes all relevant findings.

      “It remains poorly understood how SARS-CoV-2 infection influences the physiological host factors important for aerosol transmission. We assessed breathing pattern, exhaled droplets, and infectious virus after infection with Alpha and Delta variants of concern (VOC) in the Syrian hamster. Both VOCs displayed a confined window of detectable airborne virus (24-48 h), shorter than compared to oropharyngeal swabs. The loss of airborne shedding was linked to airway constriction resulting in a decrease of fine aerosols (1-10µm) produced, which are suspected to be the major driver of airborne transmission. Male sex was associated with increased viral replication and virus shedding in the air. Next, we compared the transmission efficiency of both variants and found no significant differences. Transmission efficiency varied mostly among donors, 0-100% (including a superspreading event), and aerosol transmission over multiple chain links was representative of natural heterogeneity of exposure dose and downstream viral kinetics. Co-infection with VOCs only occurred when both viruses were shed by the same donor during an increased exposure timeframe (24-48 h). This highlights that assessment of host and virus factors resulting in a differential exhaled particle profile is critical for understanding airborne transmission.”

      1. Aerosol transmission experiment should be described in Materials and Methods although it is cited as Reference 21#;

      We have modified Line 433:

      “Aerosol caging

      Aerosol cages as described by Port et al. [2] were used for transmission experiments and air sampling as indicated. The aerosol transmission system consisted of plastic hamster boxes (Lab Products) connected by a plastic tube. The boxes were modified to accept a 7.62 cm (3') plastic sanitary fitting (McMaster-Carr), which enabled the length between the boxes to be changed. Airflow was generated with a vacuum pump (Vacuubrand) attached to the box housing the naïve animals and was controlled with a float-type meter/valve (McMaster-Carr).”

      And Line 458: “During the first 5 days, hamsters were housed in modified aerosol cages (only one hamster box) hooked up to an air pump.”.

      Especially, one superspreading event of Alpha VOC (donor animal) was observed in iteration A (Figure 4). What causes that event, experiment system?

      Based on the observed variation in airborne shedding (of the cages from which this was directly measured), we believe that one plausible explanation for the super-spreading event was that the Alpha-infected donor shed considerably more virus during the exposure than other donors, and thus more readily infected the sentinels. That said, it is also conceivable that other factors such as hamster behavior (e.g., closeness to the cage outlet, sleeping) or variable sentinel susceptibility could affect the distribution of transmissions.

      1. Same reference is repeatedly listed as Refs 2 and 21#.

      Addressed. We thank the reviewer for their attention to detail. We have also removed reference 53, which was the same as 54.

      1. Two forms of described time (hour and h) are used in the manuscript. Single form should be chosen.

      This has been addressed.

      5) Virus designation located in line 371 and line 583 is inconsistent, and it needs to be revised.

      For consistency we have chosen this nomenclature for the viruses used: SARS-CoV-2 variant Alpha (B.1.1.7) (hCoV320 19/England/204820464/2020, EPI_ISL_683466) and variant Delta (B.1.617.2/) (hCoV-19/USA/KY-CDC-2-4242084/2021, EPI_ISL_1823618).

      1. In Figure 5F, what time were lung and nasal turbinate tissues collected after virus infection?

      This has been added to the legend. Day 5. Line 904.

      1. Line 562-563, what is the coating antigen (spike protein, generated in-house)? purified or recombinant protein?

      It is in-house purified recombinant protein. This has been added to the methods.

      1. Line 575 and line 578: 10,000x is not standard description, and it should be revised.

      Done.

      Reviewer #2 (Recommendations For The Authors):

      We thank the reviewer for their comments and suggestions to improve the manuscript, and hope we have addressed all concerns adequately.

      • Direct interpretation of the linear regression slope in Figure 3 is challenging. Is the most relevant parameter for transmission known? Intuitively, it would be the absolute number of small droplets at a given timepoint rather than the slope and it would be easier to interpret if the data were reported in this fashion.

      We decided to show a percentage of counts to normalize the data among animals, as we observed large inter-individual variation in counts. The reviewer is correct that it is most likely the number of particles that would be most relevant to transmission, though much (including the role of particle size) remains to be determined. We have added a sentence to the results which explains this in L157.

      Therefore, we decided in this first analysis to utilize the slope measurement and not raw counts. The focus was on the slopes and how particle profiles were changing post inoculation. Because we have focused on percentages, it seems not appropriate to present particle counts within each diameter range because the analysis, model, and results are based on these percentages of particles.

      Use of regression to compute slope is a useful measure because it uses data from all timepoints to estimate the regression line and, therefore, the % of particles on each day. We decided on these methods because efficiency is especially important in a study with a relatively small number of animals and slopes are also a good surrogate for how animal particle profiles are changing post-inoculation.

      To assist with the interpretation: 1) We removed Figure 3C and D and replaced Figure 3B with individual line plots for all conditions to visualize the slopes. The figure legend was corrected to reflect these changes.

      2) We replaced L169 onwards to read: (Figure 3B). Females had a steeper decline at an average rate of 2.2 per day after inoculation in the percent of 1-10 μm particles (and a steeper incline for <0.53 μm) when compared to males, while holding variant group constant. When we compared variant group while holding sex constant, we found that the Delta group had a steeper decline at an average rate of 5.6 per day in the percent of 1-10 μm particles (and a steeper incline for <0.53 μm); a similar trend, but not as steep, was observed for the Alpha group.

      The estimated difference in slopes for Delta vs. controls and Alpha vs. controls in the percent of <0.53 μm particles was 5.4 (two-sided adjusted p= 0.0001) and 2.4 (two-sided adjusted p = 0.0874), respectively. The estimated difference in slopes for percent of 1-10 μm particles was not as pronounced, but similar trends were observed for Delta and Alpha. Additionally, a linear mixed model was considered and produced virtually the same results as the simpler analysis described above; the corresponding linear mixed model estimates were the same and standard errors were similar.

      • Fig 4: what is "limit of quality" mentioned in the legend? Are these samples undetectable?

      We have clarified this in the legend: “3.3 = limit of detection for RNA (<10 copies/rxn)”. If samples have below 10 copy numbers per reaction, they are determined to be below the limit of detection. The limit of detection is 10 copy number/rxn. All samples below 10 copies/rxn are taken to be negative and set = 10 copies/rxn, which equals 3.3. Log10 copies/mL oral swab.

      • Fig 4C would be easier to process in graphical rather than tabular form. The meaning of the colors is unclear.

      We agree with the reviewer that this is difficult to interpret, but we are uncertain if the same data in a tabular format would be easier to digest. We realized that the legend was misplaced and have added this back into the figure, which we hope clarifies the colors and the limit of detection.

      • Figure 4D & E are uninterpretable. What do the pie charts represent?

      We have remodeled this part of the figure to a schematic representation of the majority variant which transmitted for each individual sentinel, and have added a table (Table S1) which summarizes the exact sequencing results for the oral swabs. The reviewer is correct that it was difficult to interpret the pie charts, considering most values are either 0 or close to 100%. We hope this addresses the question. The legend states:

      Author response image 1.

      Airborne attack rate of Alpha and Delta SARS-CoV-2 variants. Donor animals (N = 7) were inoculated with either the Alpha or Delta variant with 103 TCID50 via the intranasal route and paired together randomly (1:1 ratio) in 7 attack rate scenarios (A-G). To each pair of donors, one day after inoculation, 4-5 sentinels were exposed for a duration of 4 h (i.e., h 24-28 post inoculation) in an aerosol transmission set-up at 200 cm distance. A. Schematic figure of the transmission set-up. B. Day 1 sgRNA detected in oral swabs taken from each donor after exposure ended. Individuals are depicted. Wilcoxon test, N = 7. Grey = Alpha, teal = Delta inoculated donors. C. Respiratory shedding measured by viral load in oropharyngeal swabs; measured by sgRNA on day 2, 3, and 5 for each sentinel. Animals are grouped by scenario. Colors refer to legend below. 3.3 = limit of detection of RNA (<10 copies/rxn). D. Schematic representation of majority variant for each sentinel as assessed by percentage of Alpha and Delta detected in oropharyngeal swabs taken at day 2 and day 5 post exposure by deep sequencing. Grey = Alpha, teal = Delta, white = no transmission.

      • Fig S2G is uninterpretable. Please label and explain.

      We have now included an explanations of the figure S2F. The figure is a graphic representation of the neutralization data depicted in Figure S2F. The spacing between grid lines is 1 unit of antigenic distance, corresponding to a twofold dilution of serum in the neutralization assay. The resulting antigenic distance depicted between Alpha and Delta is roughly a 4-fold difference in neutralization between homologous (e.g., Alpha sera with the Alpha virus vs. heterologous, Alpha sera with the Delta virus).

      • I would consider emphasizing lines 220-225 in the summary and abstract. The important implication is that aerosol transmission is more representative of natural heterogeneity of exposure dose and downstream viral kinetics. This is an often-overlooked point.

      We agree with the reviewer and have added this in Line 43.

      • Fig 5: A cartoon similar to Fig 4A showing timing of sentinel exposure with number of animals would be helpful.

      We have added this as a new panel A for Figure 5. See the redrafted Figure 5 below.

      • For Fig 5E & F It would be helpful to use a statistical test to more formally assess whether proportion at exposure predicts proportion of variants in downstream sentinel infection.

      This has been added as a new Figure 5 panel H and I, which we hope addresses the reviewer’s comment.

      Author response image 2.

      Airborne competitiveness of Alpha and Delta SARS-CoV-2 variants. A. Schematic. Donor animals (N = 8) were inoculated with Alpha and Delta variant with 5 x 102 TCID50, respectively, via the intranasal route (1:1 ratio), and three groups of sentinels (Sentinels 1, 2, and 3) were exposed subsequently at a 16.5 cm distance. Animals were exposed at a 1:1 ratio; exposure occurred on day 1 (Donors  Sentinels 1) and day 2 (Sentinels  Sentinels). B. Respiratory shedding measured by viral load in oropharyngeal swabs; measured by gRNA, sgRNA, and infectious titers on days 2 and day 5 post exposure. Bar-chart depicting median, 96% CI and individuals, N = 8, ordinary two-way ANOVA followed by Šídák's multiple comparisons test. C/D/E. Corresponding gRNA, sgRNA, and infectious virus in lungs and nasal turbinates sampled five days post exposure. Bar-chart depicting median, 96% CI and individuals, N = 8, ordinary two-way ANOVA, followed by Šídák's multiple comparisons test. Dark orange = Donors, light orange = Sentinels 1, grey = Sentinels 2, dark grey = Sentinels 3, p-values indicated where significant. Dotted line = limit of quality. F. Percentage of Alpha and Delta detected in oropharyngeal swabs taken at days 2 and day 5 post exposure for each individual donor and sentinel, determined by deep sequencing. Pie-charts depict individual animals. Grey = Alpha, teal = Delta. G. Lung and nasal turbinate samples collected on day 5 post inoculation/exposure. H. Summary of data of variant composition, violin plots depicting median and quantiles for each chain link (left) and for each set of samples collected (right). Shading indicates majority of variant (grey = Alpha, teal = Delta). I. Correlation plot depicting Spearman r for each chain link (right, day 2 swab) and for each set of samples collected across all animals (left). Colors refer to legend on right. Abbreviations: TCID, Tissue Culture Infectious Dose.”

      We have additionally added to the results section: L284: “Combined a trend, while not significant, was observed for increased replication of Delta after the first transmission event, but not after the second, and in the oropharyngeal cavity (swabs) as opposed to lungs (Figure 5H) (Donors compared to Sentinels 1: p = 0.0559; Donors compared to Sentinels 2: p = >0.9999; Kruskal Wallis test, followed by Dunn’s test). Swabs taken at 2 DPI/DPE did significantly predict variant patterns in swabs on 5 DPI/DPE (Spearman’s r = 0.623, p = 0.00436) and virus competition in the lower respiratory tract (Spearman’s r = 0.60, p = 0.00848). Oral swab samples taken on day 5 strongly correlate with both upper (Spearman’s r = 0.816, p = 0.00001) and lower respiratory tract tissue samples (Spearman’s r = 0.832, p = 0.00002) taken on the same day (Figure 5I).”

      • Fig 1A: how are pfu/hour inferred? This is somewhat explained in the supplement, but I found the inclusion of model output as the first panel confusing and am still not 100% clear how this was done. Consider, explaining this in the body of the paper.

      We have added a more detailed explanation of the PFU/h inference to the main text: The motivation for the model was to link more readily measurable quantities such as RNA measured in oral swabs to the quantity of greatest interest for transmission (infectious virus per unit time in the air). To do this, we jointly infer the kinetics of shed airborne virus and parameters relating observable quantities (infected sentinels, plaques from purified air sample filters) to the actual longitudinal shedding. The inferential model uses mechanistic descriptions of deposition of infectious virus into the air, uptake from the air, and loss of infectious virus in the environment to extract estimates of the key kinetic parameters, as well as the resultant airborne shedding, for each animal.

      We have added this information to L106 in the results and hope this clarifies the rationale and execution of the model.

      More minor points:

      • Line 292: "poor proxy" seems too strong as peak levels of viral RNA correlate with positive airway cultures. It might be more accurate to say that high levels of viral RNA during early infection only somewhat correlate with positive airway cultures.

      We have rephrased this to clarify that while peak RNA viral loads are predictive of positive cultures, measuring RNA, especially early during infection and only once, may not be sufficient to infer the magnitude or time-dependence of infectious virus shedding into the air. See Line 308: “We found that swab viral load measurements are a valuable but imperfect proxy for the magnitude and timing of airborne shedding. Crucially, there is a period early in infection (around 24 h post-infection in inoculated hamsters) when oral swabs show high infectious virus titers, but air samples show low or undetectable levels of virus. Viral shedding should not be treated as a single quantity that rises and falls synchronously throughout the host; spatial models of infection may be required to identify the best correlates of airborne infectiousness [32]. Attempts to quantify an individual’s airborne infectiousness from swab measurements should thus be interpreted with caution, and these spatiotemporal factors should be considered carefully.”

      • Line 352: Re is dependent on time of an outbreak (population immunity) and cannot be specified for a given variant as it depends on multiple other variables

      We agree that the current phrasing here could be interpreted to suggest, incorrectly, that Re is an intrinsic property of a variant. We have deleted that language and reworded the section to emphasize that the critical question is heterogeneity in transmission, not mean reproduction number. Line 348: “Moreover, at the time of emergence of Delta, a large part of the human population was either previously exposed to and/or vaccinated against SARS-CoV-2; that underlying host immune landscape also affects the relative fitness of variants. Our naïve animal model does not capture the high prevalence of pre-existing immunity present in the human population and may therefore be less relevant for studying overall variant fitness in the current epidemiological context. Analyses of the cross-neutralization between Alpha and Delta suggest subtly different antigenic profiles [35], and Delta’s faster kinetics in humans may have also helped it cause more reinfections and “breakthrough” infections [36].

      Our two transmission experiments yielded different outcomes. When sentinel hamsters were sequentially exposed, first to Alpha and then to Delta, generally no dual infections—both variants detectable—were observed. In contrast, when we exposed hamsters simultaneously to one donor infected with Alpha and another infected with Delta, we were able to detect mixed-variant virus populations in sentinels in one of the cages (Cage F, see Appendix figures S1, S2). The fact that we saw both single-lineage and multi-lineage transmission events suggests that virus population bottlenecks at the point of transmission do indeed depend on exposure mode and duration, as well as donor host shedding. Notably, our analysis suggests that the Alpha-Delta co-infections observed in the Cage F sentinels could be due to that being the one cage in which both the Alpha and the Delta donor shed substantially over the course of the exposure (Appendix figures S2, S3). Mixed variant infections were not retained equally, and the relative variant frequencies differed between investigated compartments of the respiratory tract, suggesting roles for randomness or host-and-tissue specific differences in virus fitness.

      A combination of host, environmental and virus parameters, many of which vary through time, play a role in virus transmission. These include virus phenotype, shedding in air, individual variability and sex differences, changes in breathing patterns, and droplet size distributions. Alongside recognized social and environmental factors, these host and viral parameters might help explain why the epidemiology of SARS-CoV-2 exhibits classic features of over-dispersed transmission [37]. Namely, SARS-CoV-2 circulates continuously in the human population, but many transmission chains are self-limiting, while rarer superspreading events account for a substantial fraction of the virus’s total transmission. Heterogeneity in the respiratory viral loads is high and some infected humans release tens to thousands of SARS-CoV-2 virions/min [38, 39]. Our findings recapitulate this in an animal model and provide further insights into mechanisms underlying successful transmission events. Quantitative assessment of virus and host parameters responsible for the size, duration and infectivity of exhaled aerosols may be critical to advance our understanding of factors governing the efficiency and heterogeneity of transmission for SARS-CoV-2, and potentially other respiratory viruses. In turn, these insights may lay the foundation for interventions targeting individuals and settings with high risk of superspreading, to achieve efficient control of virus transmission [40].”

      • The limitation section should mention that this animal model does not capture the large prevalence of pre-existing immunity at present in the population and may therefore be less relevant in the current epidemiologic context.

      We agree and have added this more clearly, see response above.

      • Limitation: it is unclear if airway and droplet dynamics in the hamster model are representative of humans.

      We have added the following sentence: Line 331: “It remains to be determined how well airway and particle size distribution dynamics in Syrian hamsters model those in humans.”

      • The mathematical model is termed semi-mechanistic but I think this is not accurate as the model appears to have no mechanistic assumptions.

      We describe the model as semi-mechanistic because it uses mechanistic descriptions of the shedding and uptake process (as described above), incorporating factors including respiration rate and environmental loss, and makes the mechanistic assumption that measurable swab and airborne shedding all stem from a shared within-host infection process that produces exponential growth of virus up to a peak, followed by exponential decay. The model is only semi-mechanistic, however, as we do not attempt a full model of within-host viral replication and shedding (e.g. a target-cell limited virus kinetics model).

    1. Author Response

      The following is the authors’ response to the previous reviews.

      We thank the reviewers for their reading of the manuscript, and their suggestions. We have extensively addressed all these concerns in the text, and also included several new data and figures in the revised version of the manuscript. We hope that our response and the new experimental data fully address the concerns raised by the reviewers. We include a detailed, pointby-point response to each of the reviewer concerns, pointing to new data and specific changes made in the main manuscript.

      Note: Do note that these new data have resulted in a new figure-figure 6, a new supplementary figure -figure 2-figure supplement 2, and an increase in the number of panels in each figure, as well as supplementary figures.

      General response comments, highlighting a few aspects missed by the reviewers

      This manuscript has an enormous amount of data in it. This is understandable, since in part we are proposing an entirely new hypothesis, and way to think about mitochondrial repression, built around substantial circumstantial evidences from diverse literature sources. But to keep the narrative readable and the main idea understandable, a lot of information had to be only very briefly mentioned in the text, and is therefore included as supplemental information. Due to that, it may not always be apparent that this study has set several technical benchmarks. These experiments are extremely challenging to perform, took many iterations to standardize, and in themselves are a first in the field. Yeast cells have the highest known rate of glycolytic flux for any organism. Measuring this glycolytic rate using the formation of intermediates is hard, and all current estimates have been in vitro, and using a stop-flow type set up. In this study, we optimized and directly measured the glycolytic flux using isotope labelled glucose (13C-glucose), which has never been reported before in highly glycolytic cells such as yeast. This is due to the very rapid label saturation (within seconds) after 13C glucose pulse (as is now shown in the figure 2-figure supplement 1). For brevity, this is summarized in this study with sufficient information to reproduce the method, but we will put out a more detailed, associated methodology paper describing several challenges, infrastructure requirements, and resources to be able to carry out these types of experiments using yeast. An added highlight of these experiments with WT and Ubp3 deletion strains is the most direct till date experimental demonstration that glycolytic flux in yeast in high glucose follows zero-order kinetics, and depends entirely on the amounts of the glycolytic enzymes (presumably operating at maximal activity). This nicely complements the recent study by Grigatis 2022 (cited in the discussion), that suggests this possibility.

      Separately, this study required the estimation of total inorganic phosphates, as well as mitochondrial pools of phosphates. Till date, there are no studies that have estimated mitochondrial pools of phosphate (for a variety of reasons). In this study, we also experimentally determined the changes in mitochondrial phosphate pools. For this, we had to establish and standardize a rapid mitochondrial isolation method in yeast. Thus, this study provides the first quantitative estimates of mitochondrial Pi amounts (in the context of measured mitochondrial outputs), as shown now in Figure 4. This component on mitochondrial isolation in yeast to assess metabolites may also be explored in future as a methods paper.

      Specific responses to the Reviews:

      Reviewer #1 (Public Review):

      The study by Vengayil et al. presented a role for Ubp3 for mediating inorganic phosphate (Pi) compartmentalization in cytosol and mitochondria, which regulates metabolic flux between cytosolic glycolysis and mitochondrial processes. Although the exact function of increased Pi in mitochondria is not investigated, findings have valuable implications for understanding the metabolic interplay between glycolysis and respiration under glucose-rich conditions. They showed that UBP3 KO cells regulated decreased glycolytic flux by reducing the key Pidependent-glycolytic enzyme abundances, consequently increasing Pi compartmentalization to mitochondria. Increased mitochondria Pi increases oxygen consumption and mitochondrial membrane potential, indicative of increased oxidative phosphorylation. In conclusion, the authors reported that the Pi utilization by cytosolic glycolytic enzymes is a key process for mitochondrial repression under glucose conditions.

      (1) However, the main claims are only partially supported by the low number of repeats and utilizing only one strain background, which decreased the overall rigor of the study. The fullpower yeast model could be utilized with testing findings in different backgrounds with increased biological repeats in many assays described in this study. In the yeast model, it has been well established that many phenotypes are genotype/strain dependent (Liti 2019, Gallone 2016, Boekout 2021, etc...). with some strains utilizing mitochondrial respiration even under high glucose conditions (Kaya 2021). It would be conclusive to test whether wild strains with increased respiration under high glucose conditions would also be characterized by increased mitochondrial Pi.

      “However, the main claims are only partially supported by the low number of repeats and utilizing only one strain background, which decreased the overall rigor of the study. The full-power yeast model could be utilized with testing findings in different backgrounds with increased biological repeats in many assays described in this study.”

      Thank you for the suggestion. We agree that a larger, universal statement cannot be made with data from a single strain, since yeasts do have substantial diversity. In this study, we had originally used a robust, prototrophic industrial strain (CEN.PK background). We have now utilized multiple, diverse strains of S. cerevisiae to test our findings. This includes strains from the common laboratory backgrounds – W303 and BY4742 – which have different auxotrophies, as well as another robust, highly flocculent strain from a prototrophic Σ1278 background. Using all these strains, we now comprehensively find that the role of altered Pi budgeting as a constraint for mitochondrial respiration, and the role of Ubp3 as a regulator of mitochondrial repression is very well conserved. In all tested strains of S. cerevisiae the loss of Ubp3 increases mitochondrial activity (as shown by increased mitochondrial membrane potential and increased Cox2 levels in Figure 6A, B). These data now expand the generality of our findings, and strengthen the manuscript. These results are included in the revised manuscript as a new figure- Figure 6 and the associated text.

      Some of the included data in the revised manuscript are shown below:

      Author response image 1.

      Mitochondrial activity and Cox2 levels in ubp3Δ in different genetic backgrounds

      We also used the W303 strain to assess Pi levels, and its role in increasing mitochondrial respiration. We find that the loss of Ubp3 in this genetic background also increases Pi levels and that the increased Pi is necessary for increasing mitochondrial respiration (Figure 6C, D).

      Author response image 2.

      Basal OCR in WT vs ubp3Δ (W303 strain background) in normal vs low Pi

      These experiments collectively have strengthened our findings on the critical role of intracellular Pi budgeting as a general constraint for mitochondrial respiration in high glucose.

      “It would be conclusive to test whether wild strains with increased respiration under high glucose conditions would also be characterized by increased mitochondrial Pi.”

      Addressed partially above. Right now the relative basal respiration in glucose across different strains is not well known. We measured mitotracker activity in high glucose in multiple WT strains of S. cerevisiae (W303, Σ1278, S288C and BY4742, compared to the CEN.PK strain). These strains all largely had similar mitotracker potential, except for a slight increase in mitochondrial membrane potential in Σ1278 strain, but not in other strains. We further characterized this using Cox2 protein levels as well as basal OCR, and found that these do not increase. These data is shown below, and is not included in the main text since it does not add any new component to the study.

      Author response image 3.

      Mitochondrial respiration in different WT strains

      We did find this suggestion very interesting though, and are exploring directions for future research based on this suggestion. Since we have now identified a role for intracellular Pi allocation in regulating the Crabtree effect, an interesting direction can be to understand the glucose dependent mitochondrial Pi transport in Crabtree negative yeast strains. We will have to bring in a range of new tools and strains for this, so these experiments are beyond the focus of this current study.

      We hope that these new experiments in different genetic backgrounds increases the breadth and generality of our findings, and stimulates new lines of thinking to address how important the role of Pi budgeting as a constraint for mitochondrial repression in high glucose might be.

      (2) It is not described whether the drop in glycolytic flux also affects TCA cycle flux. Are there any changes in the pyruvate level? If the TCA cycle is also impaired, what drives increased mitochondrial respiration?

      Thank you for pointing this out, and we agree this should be included. We have addressed these concerns in the revised version of the manuscript

      Since glucose derived pyruvate must enter the mitochondrial TCA cycle, one possibility is that a decrease in glycolytic rate could decrease the TCA flux. An alternate possibility is that the cells coincidently increase the pyruvate transport to mitochondria, to thereby maintain the TCA cycle flux comparable to that of WT cells. To test both these possibilities, we first measured the steady state levels of pyruvate and TCA cycle intermediates in WT vs ubp3Δ cells. We do not observe any significant change in the levels of pyruvate, or TCA cycle intermediates (except malate, which showed a significant decrease in ubp3Δ cells). This data is now included in the revised manuscript as Figure 2 – figure supplement 1D and figure supplement 2 A, along with associated text.

      Author response image 4.

      Pyruvate levels in WT vs ubp3Δ

      Author response image 5.

      Steady state TCA cycle intermediate levels

      Next, in order to address if the TCA cycle flux is impaired in ubp3Δ cells, we also measured the TCA cycle flux in WT vs ubp3Δ cells by pulsing the cells with 13C glucose and tracking 13C label incorporation from glucose into TCA cycle intermediates. This experiment first required substantial standardization, for the time of cell collection and quenching post 13C glucose addition, by measuring the kinetics of 13C incorporation into TCA cycle intermediates at different time points after 13C glucose addition. The standardization of this method is now included in the revised manuscript as Figure 2 – figure supplement 2 C, along with associated text, and is shown below for reference.

      Author response image 6.

      Kinetics of 13C labelling in TCA cycle intermediates

      Actual TCA cycle flux results: For measuring the TCA cycle flux, cells were treated with 1% 13C glucose, quenched and samples were collected at 7 mins post glucose addition which is in the linear range of 13C label incorporation (Figure 2- Figure 2 – figure supplement 2 C).

      Result: We did not observe any significant changes in the relative 13C label incorporation in TCA cycle intermediates. This data is included in the revised manuscript as Figure 2 – figure supplement 2 D, along with associated text, and is below for your reference.

      Author response image 7.

      TCA cycle flux

      What these data show is that the TCA cycle flux itself is not altered in ubp3Δ. A likely interpretation of this data is that this is due to the increase in the pyruvate transport to mitochondria in ubp3Δ cells, as indicated by the ~10-fold increase in Mpc3 (mitochondrial pyruvate transporter) protein levels (shown in Figure 5-figure supplement 5H), allowing the net same amount of pyruvate into the mitochondria. This increased mitochondrial pyruvate transport could support maintaining the TCA flux in ubp3Δ cells, and supporting the increased respiration. Putting a hierarchy together, the increased respiration in ubp3Δ cells could therefore be primarily due to increased Pi transport, followed by a consequent increase in ETC proteins. We leave it to the readers of this study to make this conclusion.

      We hope that we have addressed all concerns that the reviewer has with respect to TCA cycle flux in ubp3Δ cells.

      (3) In addition, some of the important literature was also missed in citation and discussion. For example, in a recent study (Ouyang et al., 2022), it was reported that phosphate starvation increases mitochondrial membrane potential independent of respiration in yeast and mammalian cells, and some of the conflicting results were presented in this study.

      We are very aware of the recent study by Ouyang et al, which reports that Pi starvation increases mitochondrial membrane potential independent of respiration. However, this study is distinct from the context of our case due to the reasons listed below.

      (a) The reviewer may have misinterpreted our low Pi condition as Pi starvation. There is no Pi ‘starvation’ in this study. Here, we cultured ubp3Δ and tdh2Δtdh3Δ cells in a low Pi medium with 1 mM Pi concentration in order to bring down the intracellular free Pi to that of WT levels. These cells are therefore not Pi-starved, but have been manipulated to have the same intracellular Pi levels as that of WT cells, as shown in Figure 4-figure supplement 1D. The Pi concentration in the medium is still in the millimolar range, and the cells are grown in this medium for a short time (~4 hrs) till they reach OD600 ~ 0.8. This is entirely different from the conditions used in Ouyang et al., 2022, where the cells were grown in a Pi-starvation condition with 1-100 micromolar Pi in the medium for a time duration of 6-8 hrs. Since cells respond differentially to changes in Pi concentrations over time (Vardi et al., 2014), the response to low Pi vs Pi starvation will be completely different.

      (b) In our study, mitochondrial membrane potential is used as only one of the readouts for mitochondrial activity. Our estimations of mitochondrial respiration are established by including other measurements such as Cox2 protein levels (as an indicator of the ETC) and basal OCR measurements (measuring respiration), all of which provide distinct information. The mitochondrial membrane potential can be regulated independent of mitochondrial respiration state (Liu et al., 2021), using membrane potential alone as a readout to estimate mitochondrial respiration can therefore be limiting in the information it provides. As indicated earlier, mitochondrial membrane potential can change, independent of mitochondrial respiration (Ouyang et al., 2022) and ATP synthesis (Liu et al., 2021). Since the focus of our study is mitochondrial respiration, and not just the change in membrane potential, making conclusions based on potential alone are ambiguous. Most studies in the field have in fact not used the comprehensive array of distinct estimates that we use in this study, and we believe the standards set in this study should become a norm for the field.

      (c) The only mutant that is similar to the Ouyang et al study is the Mir1 deletion mutant, which results in acute Pi starvation in mitochondria. In this strain, we find an increase in mitochondrial membrane potential. The data is not included in the manuscript but is shown below.

      Author response image 8.

      Mitochondrial potential in WT vs mir1Δ

      As clear from this data, mitochondrial membrane potential is significantly high in mir1Δ cells. However, the basal OCR and Cox2 protein levels clearly show decreased mitochondrial respiration which is expected in this mutant (Figure 5 A,B). This in fact highlights the limitations of solely relying on mitochondrial membrane potential measurements to draw conclusions, as doing so will lead to a misinterpretation of the actual mitochondrial activity in these cells. We do not wish to highlight limitations in other studies, but hope we make our point clear.

      (4) An additional experiment with strains lacking mitochondrial DNA under phosphate-rich and restricted conditions would further strengthen the result.

      Strains lacking mitochondrial DNA (Rho0 cells) cannot express the mitochondrially encoded ETC subunit proteins. These strains are therefore incapable of performing mitochondrial respiration. Since Rho0 cells are known to utilize alternate mechanisms to maintain their mitochondrial membrane potential (Liu et al., 2021), using mitotracker fluorescence as a readout of mitochondrial respiration in these strains under different Pi conditions is inconclusive and misleading due to the reasons mentioned in point number 3(b and c). However, since this was a concern raised by the reviewer, we now measured basal OCR in WT and Rho0 strains with Ubp3 deletion under normal vs low Pi medium. As expected, Rho0 cells show extremely low basal OCR values, an entire order of magnitude lower than WT cells. At these very low (barely detectable) levels the deletion of Ubp3 or change in Pi concentration in the medium does not change basal OCR, since these strains are not capable of respiration. We have included this data as Figure 4-figure supplement 1G.

      Author response image 9.

      Basal OCR in Rho0 cells

      (5) Western blot control panels should include entire membrane exposure, and non-cut western blots should be submitted as supplementary.

      The non-cut western blot images and the loading controls are now included in the revised manuscript as a supplementary file 2.

      (6) In Figure 4, it is shown that Pi addition decreases basal OCR to the WT level. However, the Cox2 level remains significantly higher. This data is confusing as to whether mitochondrial Pi directly regulates respiration or not.

      As described in the previous point, the Cox2 levels and the OCR provide distinct pieces of information. In figure 4, we show that culturing ubp3Δ in low Pi significantly decreases both Cox2 protein levels and basal OCR. Since Cox2 protein levels and basal OCR are different readouts for mitochondrial activity, there could be differences in the extent by which Pi availability controls each of these factors. Basal OCR is a direct readout for mitochondrial respiration, and is regulated by multiple factors including ETC protein levels, rate of ATP synthesis, rate of Pi transport etc. In figure 4, we find that culturing ubp3Δ in low Pi decreases basal OCR to WT level. This strongly suggests that high Pi levels are necessary to increase basal OCR in ubp3Δ.

      (7) Representative images of Ubx3 KO and wild-type strains stained with CMXRos are missing.

      Thank you for noticing this. This data is now included in the revised manuscript as Figure 1figure supplement 1C.

      Author response image 10.

      (8) Overall, mitochondrial copy number and mtDNA copy number should be analyzed in WT and Ubo3 KO cells as well as Pi-treated and non-treated cells, and basal OCR data should be normalized accordingly. The reported normalization against OD is not appropriate.

      This is a valid concern raised by the reviewer, and something we had extensively considered during the study. To normalize the total mitochondrial amounts in each strain, we always measure the protein levels of the mitochondrial outer membrane protein Tom70. While we had described this in the methods, it may not have been obvious in the text. But this information is included in Figure 1-figure supplement 1G. We did not observe any significant change in Tom70 levels, suggesting that the total mitochondrial amount does not change in ubp3Δ, and we have noted this in the manuscript (results section relevant to Figure 1). As an additional control, to directly measure the mitochondrial amount in these conditions, we have now measured the mitochondrial volume in ubp3Δ cells and WT cells treated with Pi. For this, we used a strain which encodes mitochondria targeted with mNeon green protein (described in Dua et al., JCB, 2023), and which can therefore independently assess total mitochondrial amount. We do not observe any changes in mitochondrial volume or amounts in ubp3Δ cells or WT+Pi, compared to that of WT cells. Therefore, the change in mitochondrial respiration in Ubp3 deletion and Pi addition are not due to changes in total amounts of mitochondria in these conditions. Given all these, the normalization of basal OCR using total cell number is therefore the most appropriate way for normalization. This is also conventionally used for basal OCR normalization in multiple studies.

      We have now included these additional data on mitochondrial volumes and amounts in the revised manuscript as Figure1-figure supplement 1F and Figure5-figure supplement 1D, and associated text, and is shown below.

      Author response image 11.

      Mitochondrial volume in WT vs ubp3Δ cells

      Author response image 12.

      Mitochondrial volume in WT and WT+Pi

      These data collectively address the reviewer’s concerns regarding changes in mitochondrial amounts in all the conditions and strains used in this study.

      Reviewer #2 (Public Review):

      Summary:

      Cells cultured in high glucose tend to repress mitochondrial biogenesis and activity, a prevailing phenotype type called Crabree effect that is observed in different cell types and cancer. Many signaling pathways have been put forward to explain this effect. Vengayil et al proposed a new mechanism involved in Ubp3/Ubp10 and phosphate that controls the glucose repression of mitochondria. The central hypothesis is that ∆ubp3 shifts the glycolysis to trehalose synthesis, therefore leading to the increase of Pi availability in the cytosol, then mitochondria receive more Pi, and therefore the glucose repression is reduced.

      Strengths:

      The strength is that the authors used an array of different assays to test their hypothesis. Most assays were well-designed and controlled.

      Weaknesses:

      I think the main conclusions are not strongly supported by the current dataset.

      (1) Although the authors discovered ∆ubp3 cells have higher Pi and mitochondrial activity than WT in high glucose, it is not known if WT cultured in different glucose concentration also change Pi that correlate with the mitochondrial activity. The focus of the research on ∆ubp3 is somewhat artificial because ∆ubp3 not only affects glycolysis and mitochondria, but many other cellular pathways are also changed. There is no idea whether culturing cells in low glucose, which derepress the mitochondrial activity, involves Ubp3 or not. Similarly, the shift of glycolysis to trehalose synthesis is also not relevant to the WT cells cultured in a low-glucose situation. “The focus of the research on ∆ubp3 is somewhat artificial because ∆ubp3 not only affects glycolysis and mitochondria, but many other cellular pathways are also changed. There is no idea whether culturing cells in low glucose, which de-repress the mitochondrial activity, involves Ubp3 or not.”

      We would like to clarify that the focus of this research is not on Ubp3, or to address mechanistic aspects of how Ubp3 regulates mitochondrial activity, or to identify the targets of Ubp3. That would be an entirely distinct study, with a very different approach.

      In this study, while carrying out a screen, we serendipitously found that ubp3Δ cells showed an increase in mitochondrial activity in high glucose. Subsequently, we used this observation, bolstered by diverse orthogonal approaches, to identify a general, systems-level principle that governs mitochondrial repression in high glucose. Through this, we identify a role of phosphate budgeting as a controller of mitochondrial repression in high glucose. In this study, our entire focus has been to use orthogonal approaches, as well as parsimonious interpretations, to establish this new hypothesis as a possibility. We hope this idea, supported by these data, will now enable researchers to pursue other experiments to establish the generality of this phenomenon.

      We have not focused our effort in identifying the role of Ubp3, or its regulation upon changes in glucose concentration in this context. That is a very specific, and separate effort, and misses the general point we address here. It is entirely possible that Ubp3 might also regulate mitochondrial activity by additional mechanisms other than mitochondrial Pi availability (such as via the reduction of key glycolytic enzymes at nodes of glycolysis, resulting in reduced glycolytic flux and rerouted glucose metabolism). Had the goal been to identify Ubp3 substrates, it is very likely that we would not have found the role of Pi homeostasis in controlling mitochondrial respiration. This is particularly because the loss of Ubp3 does not result in an acute disruption of glycolysis, unlike say a glycolytic enzyme mutant, which would have resulted in severe effects on growth and overall metabolic state. This would have made it difficult to dissect out finer details of metabolic principles that regulate mitochondrial respiration.

      In order to further corroborate our findings, we used the glycolysis defective mutant tdh2Δtdh3Δ cells, where we find a similar change in Pi balance. This complements the key observations made using ubp3Δ cells. Distinctly, we utilized the glycolytic inhibitor 2DG to independently assess the role of mitochondrial Pi transport in regulating respiration. Together, in this study we do not just relying on genetic mutants, but combine the Ubp3 deletion strain with a reduced GAPDH activity strain, and pharmacologic inhibition of glycolysis. Distinctly, we find that mitochondrial Pi transporter levels are repressed under high glucose (Figure 5C, Figure 5-figure supplement 1B). Further, we find that mitochondrial Pi transport is important in increasing mitochondrial respiration upon shift to low glucose and glycolytic inhibition by 2-DG. Therefore, we collectively unravel a more systems level principle that regulates glucose mediated mitochondrial repression, as opposed to a mechanistic study of Ubp3 targets.

      Of course, given the conservation of Ubp3, we are very excited to pursue a mechanistic study of Ubp3 targets in future. This is a general challenge for deubiquitinase enzymes, and till date there are very few bona fide substrates known for any deubiquitinase enzyme, from any cellular system (due to challenges in the field that we discuss separately, and have included in the discussion section of this text).

      “Similarly, the shift of glycolysis to trehalose synthesis is also not relevant to the WT cells cultured in a low-glucose situation”

      The reviewer is correct in pointing out that in low-glucose, the shift to trehalose synthesis might not be as relevant. We observe that the glycolysis defective mutant tdh2Δtdh3Δ cells does not show an increase in trehalose synthesis (Figure 3-figure supplement 1E). However, in this context, the decrease in the rate of GAPDH catalysed reaction alone appears to be sufficient to increase the Pi levels (Figure 3F) even without an increase in trehalose. Therefore, there might be differences in the relative contributions of these two arms towards Pi balance, based on whether it is low glucose in the environment, or a mutant such as ubp3 that modulates glycolytic flux. In ubp3Δ cells, the combination of low rate of GAPDH catalyzed reaction and high trehalose will happen (based on how glycolytic flux is modulated), vs only the low rate of GAPDH catalyzed reaction in tdh2Δtdh3Δ cells. As an end point the increase in Pi happens in both cases, but with slightly differing outcomes. It is also to be noted that in terms of free Pi sources a low-glucose condition (with low glycolytic rate) is very different from a no-glucose, respiratory condition (where cells perform very high gluconeogenesis). In high respiration conditions such as ethanol, cells switch to high gluconeogenesis, where there is a huge increase trehalose synthesis as a default (eg see Varahan et al 2019). In this condition, trehalose synthesis could be a major source for Pi (eg see Gupta 2021), and could support the increased mitochondrial respiration. In an ethanol medium, the directionality of GAPDH reaction is reversed. Therefore, this reaction will also now become an added source of Pi, instead of a consumer of Pi (see illustration in Figure 3G). Therefore, a reasonable interpretation is that a combination of increased trehalose and increased 1,3 BPG to G3P conversion can be a major Pi source to increasing mitochondrial respiration in a non-glucose, respiratory medium.

      “it is not known if WT cultured in different glucose concentration also change Pi that correlate with the mitochondrial activity”

      This is valid point raised by the reviewer. We have already found that the protein levels of mitochondrial Pi transporter is increased in a non-glucose respiratory (ethanol) medium and a low (0.1%) glucose medium (see Figure 5C, Figure5-figure supplement 1B). In addition, we have tried measuring mitochondrial Pi levels in cells grown in a high glucose medium vs a respiratory, ethanol medium. The results are shown below for the reviewer’s reference. Reviewer response image 3 – Mitochondrial Pi levels in ethanol vs glucose

      Author response image 13.

      We observe a clear trend where mitochondrial Pi levels are high in cells grown in ethanol medium compared to that of cells grown in high glucose. However, the estimation of Pi, and normalising the Pi levels in isolated mitochondria is extremely difficult in this condition (note that this has never been done before). This is likely due to a rapid rate of conversion of ADP and Pi to ATP (in ethanol) which increases the variation in the estimation of steady state Pi levels, and the high amounts of mitochondria in ethanol grown cells. Since the date shows high variation, we have not included this data in the manuscript, but we are happy to include it here in the response.

      Indeed, this study opens up the exciting question of addressing how intracellular Pi allocation is regulated in different conditions of glucose. This can be further extended to Crabtree negative strains such as K. lactis which do not show mitochondrial repression in high glucose. All of these are rich future research programs.

      (2) The central hypothesis that Pi is the key constraint behind the glucose repression of mitochondrial biogenesis/activity is supported by the data that limiting Pi will suppress mitochondrial activity increase in these conditions (e.g., ∆ubp3). However, increasing the Pi supply failed to increase mitochondrial activity. The explanation put forward by the authors is that increased Pi supply will increase glycolysis activity, and somehow even reduce the mitochondrial Pi. I cannot understand why only the increased Pi supply in ∆ubp3, but not the increased Pi by medium supplement, can increase mitochondrial activity. The authors said "...that ubp3Δ do not increase mitochondrial Pi by merely increasing the Pi transporters, but rather by increasing available Pi pools". They showed that ∆ubp3 mitochondria had higher Pi but WT cells with medium Pi supplement showed lower Pi, it is hard to understand why the same Pi increase in the cytosol had a different outcome in mitochondrial Pi. Later on, they showed that the isolated mito exposed to higher Pi showed increased activity, so why can't increased Pi in intact cells increase mito activity? Moreover, they first showed that ∆ubp3 had a Mir1 increase in Fig3A, then showed no changes in FigS4G. It is very confusing.

      “I cannot understand why only the increased Pi supply in ∆ubp3, but not the increased Pi by medium supplement, can increase mitochondrial activity.”

      This is an interesting point, that requires a nuanced explanation, which we try to provide below.

      For mitochondrial respiration to increase in the presence of high Pi, the cytosolic Pi has to be transported to the mitochondria sufficiently. In ubp3Δ the increased free Pi (as a consequence of rewired glycolysis) is transported to the mitochondria (Figure 4). This increased mitochondrial Pi can therefore increase mitochondrial respiration in ubp3Δ.

      In case of WT+Pi, the externally supplemented Pi cannot further enter mitochondria (as shown in Figure 5-Figure supplement 1C) and is most likely restricted to the cytosol. Because of this inability of the Pi to access mitochondria, the mitochondrial respiration does not increase in WT+Pi (Figure 5-Figure supplement 1E).

      The likely reason for this difference in mitochondrial Pi transport in ubp3Δ vs WT+Pi is the relative difference in their glycolytic rate. The glycolytic rate is inherently decreased in ubp3Δ, but not in WT+Pi. To dissect this possibility of glycolytic rate itself contributing to the Pi availability in the mitochondria, we inhibited glycolysis in WT cells (using 2DG), and then supplemented Pi. Compared to cells in the same glucose condition (with 2DG, but without supplementing excess Pi), now the WT+Pi (+2DG) has higher mitochondrial respiration (Figure 5-Figure supplement 1F). This suggests that a combination of low glycolysis and high Pi is required for increasing mitochondrial respiration (as elaborated in the discussion section of the manuscript).

      An obvious question that arises out of this observation is how does the change in glycolytic rate regulate mitochondrial Pi transport. One consequence of altering the glycolytic rate is a change in cytosolic pH. This itself will bear on the extent of Pi transport into mitochondria, as discussed in detail below.

      In mitochondria, Pi is co-transported along with protons. Therefore, changes in cytosolic pH (which changes the proton gradient) can control the mitochondrial Pi transport (Hamel et al., 2004). Glycolytic rate is a major factor that controls cytosolic pH. The cytosolic pH in highly glycolytic cells is ~7, and decreasing glycolysis results in cytosolic acidification (Orij et al., 2011). Therefore, under conditions of decreased glycolysis (such as loss of Ubp3), cytosolic pH becomes acidic. Since mitochondrial Pi transport is dependent on the proton gradient, a low cytosolic pH would favour mitochondrial Pi transport. Therefore, under conditions of decreased glycolysis (2DG treatment, or loss of Ubp3), where cytosolic pH would be acidic, increasing cytosolic Pi might indirectly increase mitochondria Pi transport, thereby leading to increased respiration.

      To explain this and integrate all these points, we have extended a discussion section in this manuscript. We include this section below:

      “Supplementing Pi under conditions of low glycolysis (where mitochondrial Pi transport is enhanced), as well as directly supplementing Pi to isolated mitochondria, increases respiration (Figure 5, Figure 5-figure supplement 1). Therefore, in order to derepress mitochondria, a combination of increased Pi along with decreased glycolysis is required. An additional systems-level phenomenon that might regulate Pi transport to the mitochondria is the decrease in cytosolic pH upon decreased glycolysis (60, 61). The cytosolic pH in highly glycolytic cells is ~7, and decreasing glycolysis results in cytosolic acidification (60, 61). Therefore, under conditions of decreased glycolysis (2DG treatment, deletion of Ubp3, and decreased GAPDH activity), cytosolic pH becomes acidic. Since mitochondrial Pi transport itself is dependent on the proton gradient, a low cytosolic pH would favour mitochondrial Pi transport (62). Therefore, under conditions of decreased glycolysis (2DG treatment, or loss of Ubp3, or decreased GAPDH activity), where cytosolic pH would be acidic, increasing cytosolic Pi might indirectly increase mitochondria Pi transport, thereby leading to increased respiration. Alternately, increasing mitochondrial Pi transporter amounts can achieve the same result, as seen by overexpressing Mir1 (Figure 5).”

      This possibility of changes in cytosolic pH regulating mitochondrial Pi transport and thereby respiration is a really interesting future research question, and an idea that has not yet been explored till date. This can stimulate new lines of thinking towards finding conserved biochemical principles that control mitochondrial repression in high glucose.

      “Moreover, they first showed that ∆ubp3 had a Mir1 increase in Fig3A, then showed no changes in FigS4G. It is very confusing”

      increase in Mir1 in ubp3Δ shown in figure 3A comes from the analysis of the proteomics dataset from a previous study (Isasa et al., 2015). Subsequently, we more systematically experimentally assessed Mir1 levels directly, and did not observe an increase in Mir1 (Figure 4figure supplement 1H in revised manuscript). It is entirely possible that in a large-scale study (as in Isasa 2015), some specific proteomic targets might not fully reproduce when tested very specifically (as is described in Handler et al., 2018 and Mehta et al., 2022). We do clearly indicate this in the text, but given the density of information in this study, it is understandable that this point was missed by the reviewer.

      (3) Given that there is no degradation difference for these glycolytic enzymes in ∆ubp3, and the authors found transcriptional level changes, suggests an alternative possibility where ∆ubp3 may signal through unknown mechanisms to parallelly regulate both mitochondrial biogenesis and glycolytic enzyme expression. The increase of trehalose synthesis usually happens in cells under proteostasis stress, so it is important to rule out whether ∆ubp3 signals these metabolic changes via proteostasis dysregulation. This echoes my first point that it is unknown whether wild-type cells use a similar mechanism as ∆ubp3 cells to regulate the glucose repression of mitochondria.

      We appreciate this point raised by the reviewer, but this again requires some clarification (as made earlier). The goal of this study was to identify systems-level principles that explain mitochondrial repression in high glucose. Although we started by performing a screen to identify proteostatic regulators of mitochondrial activity in high glucose, and identified Ubp3 as a mediator of mitochondrial activity, our approach was to use ubp3Δ cells as a model to understand the metabolic principles that regulate mitochondrial repression. This has been reiterated repeatedly in the manuscript – for example lines 123-124 “We therefore decided to use ubp3Δ cells to start delineating requirements for glucose-mediated mitochondrial repression.” and again in the discussion section – lines 442-460, where we discuss some unique advantages of using ubp3Δ cells to understand a general basis of mitochondrial regulation. To test this hypothesis, we also used orthogonal approaches, as well as other mutants and conditions with defective glycolysis, such as tdh2Δtdh3Δ cells and 2DG treatments. Only with these multiple converging evidences do we infer that there might be a role of the change in Pi balance (due to changes in glycolytic rate) in regulating mitochondrial activity.

      We certainly agree that there is great value in identifying the mechanistic details of how Ubp3 regulates mitochondria. But this requires very distinct approaches not pursued in this study. This is not the question that we are addressing in this story. Separately, identifying targets of DUBs is one of the exceptional challenges in biology, since there are currently no straightforward chemicalbiology approaches to do so for this class of proteins. Unlike kinase/phosphatase systems, or even ubiquitin ligases, substrate trapping mutants etc have proven to be abject failures in identifying direct targets of DUBs. A quantitative proteomics study might suggest some proteins/cellular processes regulated by Ubp3. This has been attempted for several DUBs, but rarely have any direct substrates of DUBs every been identified, in any system. A high quality quantitative, descriptive proteome dataset of ubp3Δ cells is already available from a previous study (Isasa et al., 2015), which we cite extensively in this manuscript, and indeed was invaluable for this study. We cannot improve the outstanding quality dataset already available. Interestingly, the findings of this study actually help substantiate our idea of an increased mitochondrial activity and change in Pi homeostasis in ubp3Δ cells. The Isasa et al dataset finds proteins involved in mitochondrial respiration that are high in ubp3Δ cells, and the glycolytic enzymes and PHO regulon proteins are reduced. In our study, using these data references, we were able to conceptually piece together how changes in glycolytic flux can alter Pi balance.

      Apart from identifying changes in protein levels, a separate challenge in making sense of this quantitative proteomics data is the difficulty in pinpointing any target of Ubp3 that specifically regulates these processes. A single DUB can have multiple substrates, and this could regulate the cellular metabolic state in a combinatorial manner. This is the essence of all signaling regulators in how they function, and it is therefore important to understand what their systems-level regulation of cell states are (separate from their specific individual substrates). Therefore, identifying the specific target of Ubp3 responsible for this metabolic rewiring can be very challenging. These experiments are well beyond the scope or interest of the current manuscript.

      If we had pursued that road in this study, we would not have made any general findings related to Pi balance, nor would this more general hypothesis have emerged.

      (4) Other major concerns:

      (a) The authors selectively showed a few proteins in their manuscript to support their conclusion. For example, only Cox2 and Tom70 were used to illustrate mitochondrial biogenesis difference in line 97. Later on, they re-analyzed the previous MS dataset from Isasa et al 2015 and showed a few proteins in Fig3A to support their conclusion that ∆ubp3 increases mitochondrial OXPHOS proteins. However, I checked that MS dataset myself and saw that many key OXPHOS proteins do not change, for example, both ATP1 and ATP2 do not change, which encode the alpha and beta subunits of F1 ATPase. They selectively reported the proteins' change in the direction along with their hypothesis.

      To clarify, we observe an increase in Cox2 protein levels but not in Tom70 levels which suggests that there is no increase in mitochondrial biogenesis. The increase is specific to some respiration related mitochondrial proteins such as Cox2 (Figure 1E, Figure 3A). We have clearly pointed out this in the manuscript. We used Cox2 protein levels as an additional readout for ETC activity, to validate our observations coming from the potentiometric mitotracker readouts, and basal oxygen consumption rate (OCR) measurements. This was for 3 reasons: Cox2 is a mitochondrial genome encoded subunit of the complex IV (cytochrome c oxidase) in the ETC, and has a redox centre critical for the cytochrome c oxidase activity. The biogenesis and assembly of complex IV subunits have been studied with respect to multiple conditions such as glucose availability and hypoxia and the expression and stability of the mitochondrial encoded complex IV subunits are exceptionally well correlated to changes in mitochondrial respiration (Fontanesi et al., 2006). Cox2 is very well characterised in S. cerevisiae, and the commercially available Cox2 antibodies are outstanding, which makes estimating Cox2 levels by western blotting unambiguous and reproducible.

      We re-analyzed the proteomic dataset from Isasa et al to find out additional information regarding the key nodes that are differentially regulated in ubp3Δ. We have not claimed at any point in the manuscript that all OXPHOS related proteins are upregulated in ubp3Δ, nor is there any need for that to be so. We identified Ubp3 from our screen, observed an increase in mitochondrial potential, basal OCR, and Cox2 levels. We later found out that the proteomic data set for ubp3Δ also supports our observations that mitochondrial respiration is upregulated in ubp3Δ. The reviewer points out that we “showed a few proteins in Fig3A to support their conclusion that ∆ubp3 increases mitochondrial OXPHOS proteins”. Our conclusion is that the deletion of Ubp3 increases mitochondrial respiration. The combined readouts which we used to reach this conclusion (OCR, mitochondrial potential, mitochondrial ATP production, Cox2 levels) are far more direct, comprehensive and conclusive than showing an increase in a few proteins related to OXPHOS, as also explained earlier toward a distinct reviewer query. Since different mitochondrial proteins are regulated by different mechanisms, we need not see an increase in all the OXPHOS proteins in a mutant like ubp3Δ where mitochondrial respiration is high. An increase in some key proteins would be sufficient to increase the respiration as seen in our case.

      To summarise, the proteomic dataset supports our observation, but our conclusions are not dependent on the increase in OXPHOS proteins observed in the dataset.

      (b) The authors said they deleted ETC component Cox2 in line 111. I checked their method and table S1, I cannot figure out how they selectively deleted COX2 from mtDNA. This must be a mistake.

      Yes, we understand that for mitochondrially encoded proteins, a simple knock-out strategy has limitations. However, we first tried to generate the Cox2 deletion mutant by a standard PCR mediated gene deletion strategy (Longtine 1998), with the optimistic assumption that even if all Cox2 is not lost, a substantial fraction of the Cox2 genes would be lost via recombination. We selected the transformants after strong antibiotic selection, and then we measured the Cox2 protein levels. Gratifyingly, we found that the mutant strain had substantially decreased Cox2 protein levels (but not a complete loss), and this was retained across generations. The data is shown below.

      Author response image 14.

      Cox2 levels in WT vs Cox2 mutants

      Since the mutants have decreased Cox2 levels, we went ahead and performed growth assays using this strain, in a WT or Ubp3 deletion background. Deletion of Ubp3 in the Cox2 mutant resulted in a more severe growth defect.

      However, we fully agree that this strain is not a complete Cox2 knockout, and it is possible that the decrease in Cox2 is due to modifications in some other unelated gene. In the text, we should also not have named this cox2Δ. Since we are not sure of the exact genetic modification in this mutant, we have removed this data from the revised manuscript.

      Instead, we have now repeated all experiments, utilizing a fully characterised Cox2 mutant -cox262, described in (5) which has defective respiration. In this revised version, we find that deletion of Ubp3 in this strain retains the originally observed severe growth defect in glucose. This is consistent with our conclusion that a functional mitochondria is required for proper growth in ubp3Δ mutant. To separately validate this conclusion, we also utilized a Rho0 strain which does not have mitochondrial DNA and thereby cannot perform mitochondrial respiration. We show that deletion of Ubp3 results in a more severe growth defect in a Rho0 strain. These results are included in the revised manuscript as figure 1-figure supplement 1 I.

      Author response image 15.

      Also, we further confirmed that the Rho0 strain and Rho0 ubp3 strain is incapable of respiration, using seahorse assay. This data is included in the revised manuscript as Figure 4-figure supplement 1G.

      Author response image 16.

      Basal OCR in Rho0 cells

      We hope that these new data address the reviewer’s concerns about the Cox2 mutant.

      (c) They used sodium azide in a lot of assays to inhibit complex IV. However, this chemical is nonspecific and broadly affects many ATPases as well. Not sure why they do not use more specific inhibitors that are commonly used to assay OCR in seahorse.

      We have now performed growth assays for WT and ubp3Δ cells in the presence of specific mitochondrial OXPHOS inhibitors - oligomycin and FCCP. We observe a more severe growth defect in ubp3Δ cells compared to WT cells in the presence of oligomycin and FCCP, similar to the results observed with sodium azide. All these data are now included in the revised manuscript as Figure 1I, Figure1-figure supplement 1H, along with associated text.

      Author response image 17.

      Growth rate in the presence of FCCP

      Author response image 18.

      Figure1-figure supplement 1H- Growth rate in the presence of oligomycin

      We hope that these new data addresses the reviewer’s concerns.

      (d) The authors measured cellular Pi level by grinding the entire cells to release Pi. However, this will lead to a mix of cytosolic and vacuolar Pi. Related to this caveat, the cytosol has ~50mM Pi, while only 1-2mM of these glycolysis metabolites, I am not sure why the reduction of several glycolysis enzymes will cause significant changes in cytosolic Pi levels and make Pi the limiting factor for mitochondrial respiration. One possibility is that the observed cytosolic Pi level changes were caused by the measurement issue mentioned above.

      The Pi estimation shown in figure 3 C, E, F and G is a measure of total Pi in the cells. The vacuole is a major storehouse of phosphate in cells. However, unlike plant cells where free phosphate is stored in vacuoles, yeast vacuoles store phosphate only in the form of polyphosphates (Yang et al., 2017, Hürlimann et al., 2007). The free Pi formed from the hydrolysis of polyphosphate is subsequently transported to cytosol via the exporter Pho91 (Hürlimann et al., 2007). This therefore makes cytosol and mitochondria the major storage of usable free Pi in yeast. Since the malachite green assay that we use for phosphate estimation is specific to free Pi, and not polyphosphate, the Pi estimates that we show in figure 3 come from a combination of cytosolic and mitochondrial Pi. As explained earlier, in order to specifically measure mitochondrial Pi, we have established methods to rapidly isolate mitochondria, and then followed this by estimating Pi in these isolated mitochondria (Figure 4B). Here we clearly see a large increase in mitochondrial Pi in the Ubp3 deletion cells. This allows us to estimate the changes in Pi levels that specific to mitochondria, without relying only on total Pi changes.

      “the cytosol has ~50mM Pi, while only 1-2mM of these glycolysis metabolites, I am not sure why the reduction of several glycolysis enzymes will cause significant changes in cytosolic Pi levels and make Pi the limiting factor for mitochondrial respiration”

      The reviewer has completely missed the fact that the glycolytic rate in yeast is the highest known for any cell. While the steady state levels of glycolytic metabolites might be ~2 mM, the process of glycolysis is not static but is rapid and continuous. Glucose is continuously broken down and converted to pyruvate, along with the consumption of Pi and generation of ATP. This is the reason for the rapid 13C label saturation (within seconds of 13C glucose addition) in yeast cells (Figure 2-figure supplement 1F). This instantaneous label saturation makes accurate flux measurements arduous because of which we had to optimize a method for measuring glycolytic flux in yeast cells (Figure 2-D, Figure 2-figure supplement 1F). Indeed, for that reason, our measurements of glycolytic flux in yeast are the first time this is being reported in the field. This in itself is an enormously challenging experiment, and establishes a new benchmark.

      In highly glycolytic cells, most of the ATP is synthesized via glycolysis and the rate of glycolysis and ATP synthesis is very high. In the reaction catalysed by GAPDH, Pi and ADP is converted to ATP. This ATP formed acts as a Pi donor to most of the Pi consuming reactions in the cells. Some of these processes such a protein translation utilizes ATP, but releases Pi and ADP and this Pi enters the cellular Pi pool. Several other reactions such as nucleotide biosynthesis, polyphosphate biosynthesis and protein phosphorylation use ATP as a Pi donor and the Pi is fixed in biomolecules. Increasing the rates of these ‘Pi sinks’ therefore can result in a decrease in Pi pools. This is a concept we have earlier tried to clarify more elaborately in (Gupta and Laxman, 2021). In fact, increasing nucleotide biosynthesis and polyphosphate synthesis has earlier been suggested to decrease available free Pi (Austin and Mayer 2020, Desfougères et al., 2016). When glycolytic flux is high, this is coupled/tuned to the consumption of Pi which will be correspondingly high due to increased ATP, nucleotide and polyphosphate synthesis. Pi levels rapidly decrease upon glucose addition, due to the continuous Pi consumption during glycolysis (Hohmann et al., 1996, Van Heerden et al., 2014 , Koobs et al., 1972). Therefore, changes in glycolytic rate due to change in glycolytic enzyme levels can result in significant changes in Pi levels due to changes in Pi consumption rate.

      Our results also show that the apart from Pi levels, the glycolytic state can regulate mitochondrial Pi transport as well. This is the reason for mitochondrial Pi levels and basal OCR not increasing merely by adding Pi to cells. We show that basal OCR can be increased by adding Pi in the presence of 2DG. This regulation of mitochondrial Pi transport is a major limiting factor for mitochondrial respiration and could be mediated partly by the regulating of Mir1 levels and also by the changes in the cytosolic pH which regulates the rate of mitochondrial Pi transport. We have discussed these points in the discussion section in our manuscript.

      We hope that this clarifies the reviewer’s concerns regarding how changes in glycolytic rate can regulate changes in cytosolic Pi levels.

      (e) The authors used ∆mir1 and MIR1 OE to show that Pi viability in the mitochondrial matrix is important for mitochondrial activity and biogenesis. This is not surprising as Pi is a key substrate required for OXPHOS activity. I doubt the approach of adding a control to determine whether Pi has a specific regulatory function, while other OXPHOS substrates, like ADP, O2 etc do not have the same effect.

      To clarify, we only used the mir1Δ cells to understand the requirement for Pi transport from cytosol to mitochondria in controlling respiration. The reviewer is correct in stating that deletion of Mir1 would reduce Pi import to mitochondria and thereby inhibit respiration. This is exactly the conclusion we suggest from this experiment as stated in the manuscript – “These data suggest that mitochondrial Pi transport (via Mir1) is critical for maintaining basal mitochondrial activity even in high glucose”. We have only used these experiments to support the idea that even though glycolysis and mitochondria are in different compartments, a change in Pi balance in one compartment (cytosol) can affect Pi levels in the other (mitochondria) since there is Pi transport between these two compartments. Since mitochondria has its own polyphosphate reserves, in the absence of these experiments with mir1Δ cells it can be imagined that mitochondria PolyP can be an additional source of Pi to support respiration, and therefore changes in cytosolic Pi may have only a minor effect on mitochondrial respiration. Our experiments with mir1Δ and Mir1-OEcells indubitably suggest that Pi transport to mitochondria from cytosol is important for respiration, and therefore changes in cytosolic Pi levels (or maintaining cytosolic Pi at a lower level due to the rate of glycolysis) will have rippling effects in mitochondrial Pi availability. Further, these data suggest that for example under glycolytic inhibition (low glucose, or 2DG), while all factors (signalling, substrate availability etc) favour respiration (and mitochondrial derepression), cells cannot unable to achieve this in the absence of ample Pi transport from cytosol. This therefore places Pi at the centre stage in controlling mitochondrial respiration.

      We conclude that Pi is a major, but not the only constraint for mitochondrial respiration. There certainly could be a role for ADP, oxygen availability etc in regulating respiration. However, these are beyond the scope of our study. We have discussed about the potential role of ADP in regulating mitochondrial repression in the discussion section. “An additional consideration is the possible contribution of changes in ADP in regulating mitochondrial activity, where the use of ADP in glycolysis might limit mitochondrial ADP. Therefore, when Pi changes as a consequence of glycolysis, it could be imagined that a change in ADP balance can coincidentally occur. However, prior studies show that even though cytosolic ADP decreases in the presence of glucose, this does not limit mitochondrial ADP uptake, or decrease respiration, due to the very high affinity of the mitochondrial ADP transporter.”

      We hope that this clarifies the reviewer’s concerns regarding the use of Mir1 OE and mir1Δ strains.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Some of the experiments should be repeated in other strain backgrounds for reproducibility and rigor.

      As discussed in the response to point number 1, we have now utilized multiple strains of S. cerevisiae to test our findings. We now find that our discoveries regarding the role of altered Pi budgeting as a constraint for mitochondrial respiration, and the role of Ubp3 as a regulator of mitochondrial repression are conserved across multiple genetic backgrounds of S. cerevisiae. These results are included in the revised manuscript as a new figure- Figure 6 and associated text. We used the W303, Σ1278 and BY4742 strains of S. cerevisiae to show that deletion of Ubp3 increases mitochondrial activity (as shown by increased mitochondrial membrane potential and increased Cox2 levels). Using the W303 strain we show that the deletion of Ubp3 increases Pi levels and that the increased Pi is necessary for increasing mitochondrial respiration (Figure 6C, D). These added experiments have substantially broadened the generality of our findings.

      The number of biological repeats needs to be increased in all experiments.

      We have increased the number of biological repeats in key experiments that shows that the increased Pi levels are necessary for the increased mitochondrial respiration in ubp3Δ and tdh2Δtdh3Δ cells (revised Figure 4F). Apart from a few basal OCR measurements and mitotracker data in supplementary figure, all our experiments are performed for 3 biological repeats. In case of basal OCR measurements, yeast cells have to be aliquoted to poly-L-lysine coated seahorse plates and centrifuged to ensure that the cells are properly settled. This is due to the non-adherent nature of yeast cells. During the centrifugation step, the wells in the two end rows cannot be utilized due to uneven settling of cells which affects the basal OCR readings in these wells. In case of several experiments that involve multiple samples, we were therefore limited to restrict the number of biological replicates to 2 (repeated independently), so that all samples could be accommodated in the plate.

      Full western blot images should be supplemented along with the other data.

      The complete western blot images are now included in the revised manuscript as supplementary file 2.

      TCA cycle flux should be analyzed and presented in the study to conclude some of the findings.

      As discussed in detail in the response to point number 2, we have performed steady state and flux measurements for TCA cycle intermediates. This data is now included as a new supplement figure- Figure 2-figure supplement 2.

      Reviewer #2 (Recommendations For The Authors):

      (1) In Fig. 2A, they should also include the gluconeogenesis enzymes (fructose 1,6 bisphosphatase, PEP carboxykinase, and pyruvate carboxylase) to exclude the possibility that glycolytic intermediates are not rerouted to gluconeogenesis.

      We measured the protein levels of Fbp1 (fructose 1,6 bisphosphatase) and Pck1 (PEP carboxykinase). We observed an increase in the protein levels in both enzymes in ubp3Δ. The data is shown below.

      Author response image 19.

      Fbp1 and Pck1 protein levels

      While we agree that this is an interesting observation which might help us in understanding the metabolic rewiring in ubp3Δ, we have not included this data in the current revised version of the manuscript due to two main reasons.

      (1) Since ubp3Δ cells have a defective glycolysis and therefore a defective glucose repression, the mRNA and protein levels of gluconeogenic enzymes which are usually glucose-repressed might increase. This might be a response at the level of transcription and translation of these enzymes and might or might not change the rate of gluconeogenesis in these cells. This is because of multiple other factors that regulate gluconeogenic flux such as allostery, mass action etc. Therefore, to avoid confounding our main points and since we cannot make a conclusive assumption on the gluconeogenic metabolism in these mutants, we don’t include this data. The primary focus of our story is the mitochondrial repression component. Understanding the feedback controls that alter gluconeogenesis in these mutants is beyond the scope of this study and could be addressed in a separate future study.

      (2) As we highlight extensively in the response letter and in the manuscript, our aim is not to understand the specific mechanistic role of Ubp3. In this manuscript, we identify the conserved constraints that control mitochondrial repression without focusing just on the role of Ubp3 in regulating this. Whether Ubp3 regulates gluconeogenesis is a question that could be addressed in a future study that focuses on identifying the altered signalling mechanisms in ubp3Δ and the targets of Ubp3.

      (2) In line 292, page 10, there is a typo "dermine".

      We apologize for this mistake. Corrected.

      (3) In Figure 5A, is there a reason why they chose 0.1% glucose condition as a low glucose condition? Also, is there a dose-dependent change in OCR or other mitochondrial functions according to the concentration of glucose?

      The glucose concentration of 0.1% was selected to decrease (but not completely remove) the available glucose. 0.1% glucose is considered as a standard low glucose condition in S. cerevisiae (Yin et al., 2003) and the effect of this glucose concentration on cellular processes has been extensively studied (Yin et al., 2003, Takeda et al., 2015 etc). <0.2% glucose is the critical threshold for activating respiratory metabolism (Takeda et al., 2015) and shifting cells to 0.1% glucose in our experiments will activate respiration, as we show in our data. However, this is very different from completely removing glucose or using an alternate carbon source such as ethanol, because this would result in full activation of gluconeogenesis. We further find that when cells are grown in ethanol, the gluconeogenic activation will also change the Pi homeostasis. This will in part be a result of the fully reversed direction of the GAPDH catalysed reaction (Figure 3G). If such a condition is used, it could lead to misinterpretations, and confound the conclusions that we make from these set of experiments where Pi homeostasis play a major role. In 0.1% glucose it has been shown that gluconeogenesis is still partly repressed (Yin et al., 2003). The pathways utilizing alternate carbon sources still remain repressed (even though to a lower extend compared to 2% glucose) in 0.1% glucose (Yin et al., 2003). We hope that this clarifies the concerns regarding the rationale behind using 0.1% glucose in our experiments.

      The extent of glucose repression is dependent on the concentration of glucose. Glucose concentration >1% has been shown to activate degradation of mRNAs involved in alternate carbon utilization. Different signaling pathways involved in growth under glucose and glucose repression is regulated by glucose concentration. This is discussed in detail in Yin et al., 2002. We (Figure 5figure supplement 1A) also observe a dose dependent increase in mitochondrial membrane potential in the presence of 2DG. This also suggests that the rate of glycolysis (which could be also mediated by changes in glucose concentration) can regulate the extent of mitochondrial derepression.

    1. Author response:

      The following is the authors’ response to the original reviews.

      We would like to thank the reviewers for their constructive feedback and overall positive response to our manuscript. Reviewer #1 had no specific recommendations, so below we address Reviewer #2’s comments.

      Reviewer #2 (Recommendations For The Authors):

      Specific points

      (1) In Fig. 1H peptides selected are much more stable than the positive control KHN-FT, but they appear to be less stable than randomly selected 5 amino acid sequences. Are the differences

      between the randomly selected sequences and the selected sequences statistically significant.

      Thank you for the feedback. Yes, the differences are statistically significant by one-way ANOVA and the Tukey’s multiple comparisons tests, we’ve updated the figure legend to indicate this fact.

      (2) In Fig. 1I the FACS profile of 4x looks like that of KHN, but it is very difficult to see in the figure. Looking at the quantitation in Fig. 1J it is impossible to compare KHN with 4x as the KHN is on the baseline. Could this be improved by using a log scale to present the data.

      Thank you for pointing this out. We’ve improved the figure so the KHN is easier to see. In addition, we’ve attempted different way to display these results, but settled on scaling the data between 1 and 0 as our comparison points. We’ve updated the main figure to more clearly show this result so the KHN is easier to compare.

      (3) In Fig. 2G and Fig. 2F don't really match up. It looks from Fig. 2G like there is still some degradation in the hrd1 deletion strain, but this is not reflected in the quantitation (Fig. 2H).

      To our eyes, the degradation in a hrd1null appears to be quite small, which seems to be reflected in the quantification (~20% decrease over 90 minutes). We included the figure in Author response image 1 for quick comparison.

      Author response image 1.

      (4) Throughout the paper the authors claim that the proteins are degraded by a cytosolic proteasome. I agree that the proteins are degraded via the proteasome, but I don't see any evidence that it is cytosolic.

      Thank you for pointing this out. We’ve adjusted the text to reflect the fact that the proteasomal degradation is not necessarily in the cytosol.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Reviews):  

      First, the metabolic network in this study is incomplete. For example, amino acid synthesis and lipid synthesis are important for biomass and growth, but they4 are not included in the three models used in this study. NADH and NADPH are as important as ATP/ADP/AMP, but they are not included in the models. In the future, a more comprehensive metabolic and biosynthesis model is required.  

      Thank you for the critical comment on the weakness of the present study. We actually tried to study a larger model like Turnborg et al (2021), which is a model of JCVI-syn3A, but we give up to include it in our model list to study in depth. This is because we noticed that the concentration of ATP in the model can be negative (we confirmed this with one of the authors of the paper). Another "big" kinetic model of metabolism that we could list would be Khodayari et al (2017). However, we could not find the models to compare the dynamics of this big model with. Therefore, we decided to use the model only for the central carbon metabolism for now. We would like to leave a more extended study for the near future.  

      We would like to mention that NADH and NADPH are included in Khodayari model and Boecker model, while NADH and NADPH are ramped up to NADH in the latter model.  

      Second, this work does not provide a mathematical explanation of the perturbation response χ. Since the perturbation analysis is performed close to the steady state (or at least belongs to the attractor of single-steady-state), local linear analysis would provide useful information. By complementing with other analysis in dynamical systems (described below) we can gain more logical insights about perturbation response.  

      We tried a linear stability analysis. However, with the perturbation strength we used here, the linearization of the model is no longer valid, in the sense that the linearized model

      leads to negative concentrations of the metabolites (xst+Δx < 0 for some metabolites). We have added a scatter plot of the response coefficient of trajectories sharing the initial condition, while the dynamics are computed by the original model and the linearized model, respectively. (Fig. S1). 

      Since the response coefficient is based on the logarithm of the concentrations, as the metabolite concentrations approach zero, the response coefficient becomes larger. The high response coefficient in the Boecker and Chassagnole model would be explained by this artifact.  The linearized Khodayari model shows either χ~1 or χ = 0 (one or more metabolite concentrations become negative). This could be due to the number of variables in the model. For the response coefficient to have a larger value, the perturbation should be along the eigenvector that leads to oscillatory dynamics with long relaxation time (i.e., the corresponding eigenvalue has a small real part in terms of absolute value and a non-zero imaginary part). However, since the Khodayari model has about 800 variables, if perturbations are along such directions, there is a high probability that one or more metabolite concentrations will become negative.

      We fully agree that if the perturbations on the metabolite concentrations are in the linear regime, the response to the perturbations can be estimated by checking the eigenvalues and eigenvectors. However, we would say that the relationship between the linearized model (and thus the spectrum of eigenvalues) and the original model is unclear in this regime.  We remarked this in Lines 158160.

      Recommendations for the authors:

      My major suggestion is about understanding the key quantity in this study: the response coefficient χ. When the perturbed state is close to the fixed point, one could adopt local stability analysis and consider the linearized system. For a linear system with one stable fixed point P, we consider the Jacobian matrix M on P. If all eigenvalues of M are real and negative, the perturbed trajectory will return to P with each component monotonically varies. If some eigenvalues have negative real part and nonzero imaginary part, then the perturbed trajectory will spiral inward to the fixed point. Depending on the spiral trajectory and the initially perturbed state, some components would deviate furthermore (transiently) from the fixed point on the spiral trajectory. This explains why the response coefficient χ can be greater than 1. 

      Mathematically, a locally linearized system has similar behavior to the linear system, and the examples in this study can be analyzed in the similar way. Specifically, if a system has many complex eigenvalues, then the perturbed trajectory is more likely to have further deviation. The metabolic network models investigated in this work are not extremely large, and hence the author could analyze its spectrum of the Jacobian matrix at the steady state. Since the steady state is stable, I expect the spectrum located in the left half of the complex plane. If the spectrum spread out away from the real axis, we expect to see more spiral trajectories under perturbation. I think the spectrum analysis will provide a complementary view with respect to analysis on χ.  The authors' major findings, about the network sparsity and cofactors, can also be investigated under the framework of the spectrum analysis.  

      Of course, when the nonlinear system is perturbed far away from the fixed point, there are other geometrical properties of the vector field that can cause the response coefficient χ to be greater than 1. This could also be investigated in the future by testing the behavior of small and large perturbations and observing if the systems have signatures of nonlinearity.  

      Since all perturbed states return to the steady state, the eigenvalues of the Jacobi matrix accompanying the linearized system around the steady state are in the left half complex plane (negative real value). Also, some eigenvalues have non-zero imaginary parts.    

      The reason we emphasize the "nonlinear regime" is that the linearization is no longer valid in this regime, i.e. the metabolite concentrations can be negative when we calculate the linearized system. Certainly, there are complex eigenvalues in the Jacobi matrix of any model. However, we would say that there is no clear relationship between the eigenvalues and the response coefficient.      

      Minor suggestions:  

      Line 127: Regarding the source of perturbation, cell division also generates unequal concentration of proteins and metabolites for two daughter cells, and it is an interesting mechanism to create metabolic perturbation. 

      Thank you for the insightful suggestion. We mentioned the cell division as another source of perturbation (Lines 130-131).

      Line 175: I do not quite understand the statement "fixing each metabolite concentration...", since the metabolite concentration in the ODE simulation would change immediately after this fixing.  

      We meant in the sentence that we fixed the concentration of the selected metabolite as the steady state concentration and set the dx/dt of that metabolite to zero. We have rewritten the sentences to avoid confusion (Lines 180-181).

      Figure 2: There are a lot of inconsistencies between the three models. Could we learn which model is more reasonable, or the conclusion here is that the cellular response under perturbation is model-specific? The latter explanation may not be quite satisfactory since we expect the overall cellular property should not be sensitive to the model details. 

      Ideally, the overall cellular property should be insensitive to model details. However, the reality is that the behavior of the models (e.g., steady-state properties, relaxation dynamics, etc.) depends on the specific parameter choices, including what regulation is implemented. I think this situation is part of the motivation for the ensemble modeling (by J. Liao and colleague) that has been developed.  

      Detailed responsiveness would be model specific. For example, FBP has a fairly strong effect in the Boecker model, but less so in the Khodayari model, and the opposite effect in the Chassagnole model (Fig. 2). Our question was whether there are common tendencies among kinetic models that tend to show model-specific behavior.  

      Reviewer 2 (Public Review):

      (1) In the study on determining key metabolites affecting responses to perturbations (starting from line 171), the authors fix the values of individual concentrations to their steady-state values and observe the responses. Such a procedure adds artificial constraints to the network because, in the natural responses of cells (and models) to perturbations, it is highly unlikely that metabolites will not evolve in time. By fixing the values of specific metabolites, the authors prohibit the metabolic network from evolving in the most optimal way to compensate for the perturbation. Instead of this procedure, have the authors considered for this task applying techniques from variance-based sensitivity analysis (Sobol, global sensitivity analysis), where they can calculate the first-order sensitivity index and total effect index? Using this technique, the authors would be able to determine the key metabolites while allowing for metabolic responses to perturbations without unnatural constraints. 

      Thank you for the useful suggestion for studying the roles of each metabolite for responsiveness. We have computed the total sensitivity index (Homma and Salteli, 1996) for each metabolite of each model (Fig.S5). The total sensitivity indices of ATP are high-ranked in Khodayari- and Chassagnole model, while it is middle-ranked in the Boecker model. We believe that the importance of the adenyl cofactors is highlighted also in terms of the Sobol’ sensitivity analysis (the figure is referred in Lines 193-195). 

      We have encountered a minor difficulty for computing the sensitivity index. For the computation of the sensitivity index, we need to carry out the following Monte Carlo integral, 

      where the superscript (m) is the sample number index. The subscript i represents the ith element of the vector x, and ~i represents the vector x except for the ith element. The tilde stands for resampling.  

      There are several conserved quantities in each model. For independent resampling, we need to deal with the conserved quantities. For the Boecker and Chassagnole models, we picked a single metabolite from each conservation law and solved its concentration algebraically to make the metabolite concentration the dependent variable. Then, we can resample the metabolite concentration of one metabolite without changing the concentrations of other metabolites, which are independent variables.  

      However, in the Khodayari model, it was difficult to solve the dependent variables because the model has about 800 variables. Therefore, we gave up the computations of the sensitivity indices of the metabolites whose concentration is part of any conserved quantities, namely NAD, NADH, NADP, NADPH, Q8, and Q8H2.

      (2) To follow up on the previous remark, the authors state that the metabolites that augment the response coefficient when their concentration is fixed tend to be allosteric regulators. The authors should report which allosteric regulations are implemented in each of the models so that one can compare against Figure 2. Again, the effect of allosteric regulation by a specific metabolite that is quantified the way the authors did is biased by fixing the concentration value - it is true that negative feedback is broken when the metabolite concentration is fixed, however, in the rate law, there is still the fixed inhibition term with its value corresponding to the inhibition at the steady state. To see the effect of allosteric regulation by a metabolite, one can change the inhibition constants instead of constraining the responses with fixed concentrations.  

      We have listed the substrate-level regulations (Table S1-3). Also, we re-ran the simulation with reduced the effect of the substrate-level regulations for the reactions that are suspected to influence the change of the response coefficient. Instead of fixing the concentrations (Fig. S6). 

      The impact of substrate-level regulations is discussed in Lines 203-212.   

      We replaced "allosteric regulation" with "substrate-level regulation" because we noticed that some regulations are not necessarily allosteric.

      (3) Given the role of ATP in metabolic processes, the authors' finding of the sensitivity of the three networks' responses to perturbations in the AXP concentrations seems reasonable. However, drawing such firm conclusions from only three models, with each of them built around one steady state and having one kinetic parameter set despite that they were built for different physiologies, raises some questions. It is well-known in studies related to basins of attraction of the steady states that the nonlinear responses also depend on the actual steady states, the values of kinetic parameters, and implemented kinetic rate law, i.e., not only on the topology of the underlying systems. In the population of only three models, we cannot exclude the possibility of overlaps and strong similarities in the values of kinetic parameters, steady states, and enzyme saturations that all affect and might bias the observed responses. Ideally, to eliminate the possibility of such biases, one should simulate responses of a large population of models for multiple physiologies (and the corresponding steady states) and multiple parameter sets per physiology. This can be a difficult task, but having more kinetic models in this work would go a long way toward more convincing results. Recently, E. coli nonlinear kinetic models from several groups appeared that might help in this task, e.g., Haiman et al., PLoS Comput Biol, 17(1): e1008208, (2021), Choudhury et al., Nat Mach Intell, 4, 710-719, (2022); Hu et al., Metab Eng, 82, 123-133 (2024), Narayanan et al., Nat Commun, 15:723, (2024). 

      We have computed the responsiveness of 215 models generated by the MASSpy package (Haiman et al, 2021). Several model realizations showed a strong responsiveness, i.e. a broader distribution of the response coefficient (Fig.S8), and mentioned in Lines 339-341.

      We would like to mention that the three models studied in the present manuscript have limited overlap in terms of kinetic rate law and, accordingly, parameter values. In the Khodayari model, all reactions are bi-uni or uni-uni reactions implemented by mass-action kinetics, while the Boecker and Chassagnole models use the generalized Michaelis-Menten type rate laws. Also, the relationship between the response coefficients of the original model and the linearized model highlights the differences between the models (Fig. S1). If the models were somewhat effectively similar, the scatter plots of the response coefficient of the original- and linearized model should look similar among the three models. However, the three panels show completely different trends. Thus, the three models have less similarity even when they are linearized around the steady states. 

      (4) Can the authors share their insights on what could be the underlying reasons for the bimodal distribution in Figure 1E? Even after adding random reactions, the distribution still has two modes - why is that?  

      We have not yet resolved why only the Khodayari model shows the bimodal distribution of the response coefficient. However, by examining the time courses, the dynamics of the Khodayari model look like those of the excitable systems. This feature may contribute to the bimodal distribution of the response coefficient. In the future, we would like to show whether the system is indeed the excitable system and whcih reactions contribute to such dynamics.

      (5) Considering the effects of the sparsity of the networks on the perturbation responses (from line 223 onwards), when we compare the three analyzed models, it is clear that the Khodayari et al. model is a superset of the other two models. Therefore, this model can be considered as, e.g., Chassagnole model with Nadd reactions (though not randomly added). Based on Figures 1b and S2, one can observe that the responses of the Khodayari models have stronger responses, which is exactly opposite to the authors' conclusion that adding the reactions weakens the responses.

      The authors should comment on this.  

      The sparsity of the network is defined by the ratio of the number of metabolites to the number of reactions. Note that the Khodayari model is a superset of the Boecker and Chassagnole models in terms of the number of reactions, but also in terms of the number of metabolites (Boecker does not have the pentose phosphate pathway, Chassagnole does not have the TCA cycle, and neither has oxyative phosphorylation). Thus, even if we manually add reactions to the Boecker model, for example, we cannot obtain a network that is equivalent to the Khodayari model.  We added one sentence to clarify the point (Lines 254-255).

      Recommendations for the authors: 

      (1) Some typos: Line 57, remove ?; Line 134, correct "relaxation". 

      Thank you for pointing out. We fixed the typos.

      (2) Lines 510-515, please rewrite/clarify, it is confusing what are you doing. 

      We rewrote the sentences (Lines 529-532). We are sorry for the confusion.

      (3) Line 522, where are the expressions above Leq and K*? 

      Leq appears in the original paper of the Boecker model, but we decided not to use Leq. We apologize for not removing Leq from the present manuscript. The * in K* is the wildcard for representing the subscripts. We added a description for the role of “*”. 

      (4) Lines 525-530, based on the wording, it seems like you test first for 128 initial concentrations if the models converge back to the steady state and then you generate another set of 128 initial concentrations - is this what you are doing, or you simply use the 128 initial concentrations that have passed the test? 

      We apologize for the confusion. We did the first thing. We have rewritten the sentence to make it clearer. 

      (5) Figure 3, caption, by "broken line," did the authors mean "dashed line"? 

      We meant dashed line. We changed “broken line” to “dashed line”.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      I applaud the authors' for providing a thorough response to my comments from the first round of review. The authors' have addressed the points I raised on the interpretation of the behavioral results as well as the validation of the model (fit to the data) by conducting new analyses, acknowledging the limitations where required and providing important counterpoints. As a result of this process, the manuscript has considerably improved. I have no further comments and recommend this manuscript for publication.

      We are pleased that our revisions have addressed all the concerns raised by Reviewer #1.

      Reviewer #2 (Public review):

      Summary:

      This manuscript proposes that the use of a latent cause model for assessment of memory-based tasks may provide improved early detection in Alzheimer's Disease as well as more differentiated mapping of behavior to underlying causes. To test the validity of this model, the authors use a previously described knock-in mouse model of AD and subject the mice to several behaviors to determine whether the latent cause model may provide informative predictions regarding changes in the observed behaviors. They include a well-established fear learning paradigm in which distinct memories are believed to compete for control of behavior. More specifically, it's been observed that animals undergoing fear learning and subsequent fear extinction develop two separate memories for the acquisition phase and the extinction phase, such that the extinction does not simply 'erase' the previously acquired memory. Many models of learning require the addition of a separate context or state to be added during the extinction phase and are typically modeled by assuming the existence of a new state at the time of extinction. The Niv research group, Gershman et al. 2017, have shown that the use of a latent cause model applied to this behavior can elegantly predict the formation of latent states based on a Bayesian approach, and that these latent states can facilitate the persistence of the acquisition and extinction memory independently. The authors of this manuscript leverage this approach to test whether deficits in production of the internal states, or the inference and learning of those states, may be disrupted in knock-in mice that show both a build-up of amyloid-beta plaques and a deterioration in memory as the mice age.

      Strengths:

      I think the authors' proposal to leverage the latent cause model and test whether it can lead to improved assessments in an animal model of AD is a promising approach for bridging the gap between clinical and basic research. The authors use a promising mouse model and apply this to a paradigm in which the behavior and neurobiology are relatively well understood - an ideal situation for assessing how a disease state may impact both the neurobiology and behavior. The latent cause model has the potential to better connect observed behavior to underlying causes and may pave a road for improved mapping of changes in behavior to neurobiological mechanisms in diseases such as AD.

      The authors also compare the latent cause model to the Rescorla-Wagner model and a latent state model allowing for better assessment of the latent cause model as a strong model for assessing reinstatement.

      Weaknesses:

      I have several substantial concerns which I've detailed below. These include important details on how the behavior was analyzed, how the model was used to assess the behavior, and the interpretations that have been made based on the model.

      (1) There is substantial data to suggest that during fear learning in mice separate memories develop for the acquisition and extinction phases, with the acquisition memory becoming more strongly retrieved during spontaneous recovery and reinstatement. The Gershman paper, cited by the authors, shows how the latent causal model can predict this shift in latent causes by allowing for the priors to decay over time, thereby increasing the posterior of the acquisition memory at the time of spontaneous recovery. In this manuscript, the authors suggest a similar mechanism of action for reinstatement, yet the model does not appear to return to the acquisition memory after reinstatement, at least based on the simulation and examples shown in figures 1 and 3. More specifically, in figure 1, the authors indicate that the posterior probability of the latent cause,z<sub>A</sub> (the putative acquisition memory), increases, partially leading to reinstatement. This does not appear to be the case as test 3 (day 36) appears to have similar posterior probabilities for z<sub>A</sub> as well as similar weights for the CS as compared to the last days of extinction. Rather, the model appears to mainly modify the weights in the most recent latent cause, z<sub>B</sub> - the putative the 'extinction state', during reinstatement. The authors suggest that previous experimental data have indicated that spontaneous recovery or reinstatement effects are due to an interaction of the acquisition and extinction memory. These studies have shown that conditioned responding at a later time point after extinction is likely due to a balance between the acquisition memory and the extinction memory, and that this balance can shift towards the acquisition memory naturally during spontaneous recovery, or through artificial activation of the acquisition memory or inhibition of the extinction memory (see Lacagnina et al. for example). Here the authors show that the same latent cause learned during extinction, z<sub>B</sub>, appears to dominate during the learning phase of reinstatement, with rapid learning to the context - the weight for the context goes up substantially on day 35 - in z<sub>B</sub>. This latent cause, z<sub>B</sub>, dominates at the reinstatement test, and due to the increased associative strength between the context and shock, there is a strong CR. For the simulation shown in figure 1, it's not clear why a latent cause model is necessary for this behavior. This leads to the next point.

      We would like to first clarify that our behavioral paradigm did not last for 36 days, as noted by the reviewer. Our reinstatement paradigm contained 7 phases and 36 trials in total: acquisition (3 trials), test 1 (1 trial), extinction 1 (19 trials), extinction 2 (10 trials), test 2 (1 trial), unsignaled shock (1 trial), test 3 (1 trial). The day is labeled under each phase in Figure 2A. 

      We have provided explanations on how the reinstatement is explained by the latent cause model in the first round of the review. Briefly, both acquisition and extinction latent causes contribute to the reinstatement (test 3). The former retains the acquisition fear memory, and the latter has the updated w<sub>context</sub> from unsignaled shock. Although the reviewer is correct that the z<sub>B</sub> in Figure 1D makes a great contribution during the reinstatement, we would like to argue that the elevated CR from test 2 (trial 34) to test 3 (trial 36) is the result of the interaction between z<sub>A</sub> and z<sub>B</sub>.

      We provided Author response image 1 using the same data in Figure 1D and 1E to further clarify this point. The posterior probability of z<sub>A</sub> increased after an unsignaled shock (trial 35), which may be attributed to the return of acquisition fear memory. The posterior probability of z<sub>A</sub> then decreased again after test 3 (trial 36) because there was no shock in this trial. Along with the weight change, the expected shock change substantially in these three trials, resulting in reinstatement. Note that the mapping of expected shock to CR in the latent cause model is controlled by parameter θ and λ. Once the expected shock exceeds the threshold θ, the CR will increase rapidly if λ is smaller.

      Lastly, accepting the idea that separate memories are responsible for acquisition and extinction in the memory modification paradigm, the latent cause model (LCM) is a rational candidate modeling this idea. Please see the following reply on why a simple model like the Rescorla-Wagner (RW) model is not sufficient to fully explain the behaviors observed in this study.

      Author response image 1.

      The sum posterior probability (A), the sum of associative weight of CS (B), and the sum of associative weight of context (C) of acquisition and extinction latent causes in Figure 1D and 1E.

      (2) The authors compared the latent cause model to the Rescorla-Wagner model. This is very commendable, particularly since the latent cause model builds upon the RW model, so it can serve as an ideal test for whether a more simplified model can adequately predict the behavior. The authors show that the RW model cannot successfully predict the increased CR during reinstatement (Appendix figure 1). Yet there are some issues with the way the authors have implemented this comparison:

      (2A) The RW model is a simplified version of the latent cause model and so should be treated as a nested model when testing, or at a minimum, the number of parameters should be taken into account when comparing the models using a method such as the Bayesian Information Criterion, BIC.

      We acknowledge that the number of parameters was not taken into consideration when we compared the models. We thank the reviewer for the suggestion to use the Bayesian Information Criterion (BIC). However, we did not use BIC in this study for the following reasons. We wanted a model that can explain fear conditioning, extinction and reinstatement, so our first priority is to fit the test phases. Models that simulate CRs well in non-test phases can yield lower BIC values even if they fail to capture reinstatement. When we calculate the BIC by using the half normal distribution (μ = 0, σ \= 0.3) as the likelihood for prediction error in each trial, the BIC of the 12-month-old control is -37.21 for the RW model (Appendix 1–figure 1C) and -11.60 for the LCM (Figure 3C). Based on this result, the RW model would be preferred, yet the LCM was penalized by the number of parameters, even though it fit better in trial 36. Because we did not think this aligned with our purpose to model reinstatement, we chose to rely on the practical criteria to determine whether the estimated parameter set is accepted or not for our purpose (see Materials and Methods). The number of accepted samples can thus roughly be seen as the model's ability to explain the data in this study. These exclusion criteria then created imbalances in accepted samples across models (Appendix 1–figure 2). In the RW model, only one or two samples met the criteria, preventing meaningful statistical comparisons of BIC within each group. Overall, though we agreed that BIC is one of the reasonable metrics in model comparison, we did not think it aligns with our purpose in this study.

      (2B) The RW model provides the associative strength between stimuli and does not necessarily require a linear relationship between V and the CR. This is the case in the original RW model as well as in the LCM. To allow for better comparison between the models, the authors should be modeling the CR in the same manner (using the same probit function) in both models. In fact, there are many instances in which a sigmoid has been applied to RW associative strengths to predict CRs. I would recommend modeling CRs in the RW as if there is just one latent cause. Or perhaps run the analysis for the LCM with just one latent cause - this would effectively reduce the LCM to RW and keep any other assumptions identical across the models.

      Regarding the suggestion to run the analysis using the LCM with one latent cause, we agree that this method is almost identical to the RW model, which is also mentioned in the original paper (Gershman et al., 2017). Importantly, it would also eliminate the RW model’s advantage of assigning distinct learning rates to different stimuli, highlighted in the next comment (2C).

      We thank the reviewer for suggesting applying the transformation of associative strength (V) to CR as in the LCM. We examined this possibility by heuristically selecting parameter values to test how such a transformation would influence the RW model (Author response image 2A). Specifically, we set α<sub>CS</sub> = 0.5, α<sub>context</sub> \= 1, β = 1, and introduced the additional parameters θ and λ, as in the LCM. This parameter set is determined heuristically to address the reviewer’s concern about a higher learning rate of context. The dark blue line is the plain associative strength. The remaining lines are CR curves under different combinations of θ and λ.

      Consistent with the reviewer’s comment, under certain parameter settings (θ \= 0.01, λ = 0.01), the extended RW model can reproduce higher CRs at test 3, thereby approximating the discrimination index observed in the 12-month-old control group. However, this modification changes the characteristics of CRs in other phases from those in the plain RW model. In the acquisition phase, the CRs rise more sharply. In the extinction phase, the CRs remain high when θ is small. Though changing λ can modulate the steepness, the CR curve is flat on the second day of the extinction phase, which does not reproduce the pattern in observed data (Figure 2B). These trade-offs suggest that the RW model with the sigmoid transformation does not improve fit quality and, in fact, sacrifices features that were well captured by simpler RW simulations (Appendix 1–figure 1A to 1D). To further evaluate this extended RW model (RW*), we applied the same parameter estimation method used in the LCM for individual data (see Materials and Methods). For each animal, α<sub>CS</sub>, α<sub>context</sub>, β, θ, and λ were estimated with their lower and upper bounds set as previously described (see Appendix 1, Materials and Methods). The results showed that the number of accepted samples slightly increased compared to the RW model without sigmoidal transformation of CR (RW* vs. RW in Author response image 2B, 2C). However, this improvement did not surpass the LCM (RW* vs. LCM in Author response image 2B, Author response image 1C). Overall, these results suggest that while using the same method to map the expected shock to CR, the RW model does not outperform the LCM. Practically, further extension, such as adding novel terms, might improve the fitting level. We would like to note that such extensions should be carefully validated if they are reasonable and necessary for an internal model, which is beyond the scope of this study. We hope this addresses the reviewer's concerns about the implementation of the RW model. 

      Author response image 2.

      Simulation (A) and parameter estimation (B and C) in the extended Rescorla-Wagner model.

      (2C) In the paper, the model fits for the alphas in the RW model are the same across the groups. Were the alphas for the two models kept as free variables? This is an important question as it gets back to the first point raised. Because the modeling of the reinstatement behavior with the LCM appears to be mainly driven by latent cause z<sub>B</sub>, the extinction memory, it may be possible to replicate the pattern of results without requiring a latent cause model. For example, the 12-month-old App NL-G-F mice behavior may have a deficit in learning about the context. Within the RW model, if the alpha for context is set to zero for those mice, but kept higher for the other groups, say alpha_context = 0.8, the authors could potentially observe the same pattern of discrimination indices in figure 2G and 2H at test. Because the authors don't explicitly state which parameters might be driving the change in the DI, the authors should show in some way that their results cannot simply be due to poor contextual learning in the 12 month old App NL-G-F mice, as this can presumably be predicted by the RW model. The authors' model fits using RW don't show this, but this is because they don't consider this possibility that the alpha for context might be disrupted in the 12-month-old App NL-G-F mice. Of course, using the RW model with these alphas won't lead to as nice of fits of the behavior across acquisition, extinction, and reinstatement as the authors' LCM, the number of parameters are substantially reduced in the RW model. Yet the important pattern of the DI would be replicated with the RW model (if I'm not mistaken), which is the important test for assessment of reinstatement.

      We would like to clarify that we estimated three parameters in the RW model for individuals:  α<sub>CS</sub>,  α<sub>context</sub>, and β. Even if we did so, many samples did not satisfy our criteria (Appendix 1–figure 2). Please refer to the “Evaluation of model fit” in Appendix 1 and the legend of Appendix 1–figure 1A to 1D, where we have written the estimated parameter values.

      We did not agree that paralyzing the contextual learning by setting  α<sub>context</sub>  as 0 in the RW model can explain the CR curve of 12-month-old AD mice well. Specifically, the RW model cannot capture the between-day extinction dynamics (i.e., the increase in CR at the beginning of day 2 extinction)  and the higher CR at test 3 relative to test 2 (i.e., DI between test 3 and test 2 is greater than 0.5). In addition, because the context input (= 0.2) was relatively lower than the CS input (= 1), and there is only a single unsignaled shock trial, even setting  α<sub>context</sub> = 1 results in only a limited increase in CR (Appendix 1–figure 1A to 1D; see also Author response image 2 9). Thus, the RW model cannot replicate the reinstatement effect or the critical pattern of discrimination index, even under conditions of stronger contextual learning.  

      (3) As stated by the authors in the introduction, the advantage of the fear learning approach is that the memory is modified across the acquisition-extinction-reinstatement phases. Although perhaps not explicitly stated by the authors, the post-reinstatement test (test 3) is the crucial test for whether there is reactivation of a previously stored memory, with the general argument being that the reinvigorated response to the CS can't simply be explained by relearning the CS-US pairing, because re-exposure the US alone leads to increase response to the CS at test. Of course there are several explanations for why this may occur, particularly when also considering the context as a stimulus. This is what I understood to be the justification for the use of a model, such as the latent cause model, that may better capture and compare these possibilities within a single framework. As such, it is critical to look at the level of responding to both the context alone and to the CS. It appears that the authors only look at the percent freezing during the CS, and it is not clear whether this is due to the contextual-US learning during the US re-exposure or to increased responding to the CS - presumably caused by reactivation of the acquisition memory. The authors do perform a comparison between the preCS and CS period, but it is not clear whether this is taken into account in the LCM. For example, the instance of the model shown in figure 1 indicates that the 'extinction cause', or cause z6, develops a strong weight for the context during the reinstatement phase of presenting the shock alone. This state then leads to increased freezing during the final CS probe test as shown in the figure. If they haven't already, I think the authors must somehow incorporate these different phases (CS vs ITI) into their model, particularly since this type of memory retrieval that depends on assessing latent states is specifically why the authors justified using the latent causal model. In more precise terms, it's not clear whether the authors incorporate a preCS/ITI period each day the cue is presented as a vector of just the context in addition to the CS period in which the vector contains both the context and the CS. Based on the description, it seemed to me that they only model the CRs during the CS period on days when the CS is presented, and thereby the context is only ever modeled on its own (as just the context by itself in the vector) on extinction days when the CS is not presented. If they are modeling both timepoints each day that the CS I presented, then I would recommend explicitly stating this in the methods section.

      In this study, we did not model the preCS freezing rate, and we thank the reviewer for the suggestion to model preCS periods as separate context-only trials. In our view, however, this approach is not consistent with the assumptions of the LCM. Our rationale is that the available periods of context and the CS are different. We assume that observation of the context lasts from preCS to CS. If we simulate both preCS (context) and CS (context and tone), the weight of context would be updated twice. Instead, we follow the same method as described in the original code from Gershman et al. (2017) to consider the context effect. We agree that explicitly modeling preCS could provide additional insights, but we believe it would require modifying or extending the LCM. We consider this an important direction for future research, but it is outside the scope of this study.

      (4) The authors fit the model using all data points across acquisition and learning. As one of the other reviewers has highlighted, it appears that there is a high chance for overfitting the data with the LCM. Of course, this would result in much better fits than models with substantially fewer free parameters, such as the RW model. As mentioned above, the authors should use a method that takes into account the number of parameters, such as the BIC.

      Please refer to the reply to public review (2A) for the reason we did not take the suggestion to use BIC. In addition, we feel that we have adequately addressed the concern of overfitting in the first round of the review. 

      (5) The authors have stated that they do not think the Barnes maze task can be modeled with the LCM. Whether or not this is the case, if the authors do not model this data with the LCM, the Barnes maze data doesn't appear valuable to the main hypothesis. The authors suggest that more sophisticated models such as the LCM may be beneficial for early detection of diseases such as Alzheimer's, so the Barnes maze data is not valuable for providing evidence of this hypothesis. Rather, the authors make an argument that the memory deficits in the Barnes maze mimic the reinstatement effects providing support that memory is disrupted similarly in these mice. Although, the authors state that the deficits in memory retrieval are similar across the two tasks, the authors are not explicit as to the precise deficits in memory retrieval in the reinstatement task - it's a combination of overgeneralizing latent causes during acquisition, poor learning rate, over differentiation of the stimuli.

      We would like to clarify that we valued the latent cause model not solely because it is more sophisticated and fits more data points, but it is an internal model that implicates the cognitive process. Please also see the reply to the recommendations to authors (3) about the reason why we did not take the suggestion to remove this data.

      Reviewer #3 (Public review):

      Summary:

      This paper seeks to identify underlying mechanisms contributing to memory deficits observed in Alzheimer's disease (AD) mouse models. By understanding these mechanisms, they hope to uncover insights into subtle cognitive changes early in AD to inform interventions for early-stage decline.

      Strengths:

      The paper provides a comprehensive exploration of memory deficits in an AD mouse model, covering early and late stages of the disease. The experimental design was robust, confirming age-dependent increases in Aβ plaque accumulation in the AD model mice and using multiple behavior tasks that collectively highlighted difficulties in maintaining multiple competing memory cues, with deficits most pronounced in older mice.

      In the fear acquisition, extinction, and reinstatement task, AD model mice exhibited a significantly higher fear response after acquisition compared to controls, as well as a greater drop in fear response during reinstatement. These findings suggest that AD mice struggle to retain the fear memory associated with the conditioned stimulus, with the group differences being more pronounced in the older mice.

      In the reversal Barnes maze task, the AD model mice displayed a tendency to explore the maze perimeter rather than the two potential target holes, indicating a failure to integrate multiple memory cues into their strategy. This contrasted with the control mice, which used the more confirmatory strategy of focusing on the two target holes. Despite this, the AD mice were quicker to reach the target hole, suggesting that their impairments were specific to memory retrieval rather than basic task performance.

      The authors strengthened their findings by analyzing their data with a leading computational model, which describes how animals balance competing memories. They found that AD mice showed somewhat of a contradiction: a tendency to both treat trials as more alike than they are (lower α) and similar stimuli as more distinct than they are (lower σx) compared to controls.

      Weaknesses:

      While conceptually solid, the model struggles to fit the data and to support the key hypothesis about AD mice's inability to retain competing memories. These issues are evident in Figure 3:

      (1) The model misses trends in the data, including the gradual learning of fear in all groups during acquisition, the absence of a fear response at the start of the experiment, and the faster return of fear during reinstatement compared to the gradual learning of fear during acquisition. It also underestimates the increase in fear at the start of day 2 of extinction, particularly in controls.

      (2) The model explains the higher fear response in controls during reinstatement largely through a stronger association to the context formed during the unsignaled shock phase, rather than to any memory of the conditioned stimulus from acquisition (as seen in Figure 3C). In the experiment, however, this memory does seem to be important for explaining the higher fear response in controls during reinstatement (as seen in Author Response Figure 3). The model does show a necessary condition for memory retrieval, which is that controls rely more on the latent causes from acquisition. But this alone is not sufficient, since the associations within that cause may have been overwritten during extinction. The Rescorla-Wagner model illustrates this point: it too uses the latent cause from acquisition (as it only ever uses a single cause across phases) but does not retain the original stimulus-shock memory, updating and overwriting it continuously. Similarly, the latent cause model may reuse a cause from acquisition without preserving its original stimulus-shock association.

      These issues lead to potential overinterpretation of the model parameters. The differences in α and σx are being used to make claims about cognitive processes (e.g., overgeneralization vs. over differentiation), but the model itself does not appear to capture these processes accurately.

      The authors could benefit from a model that better matches the data and captures the retention and retrieval of fear memories across phases. While they explored alternatives, including the Rescorla-Wagner model and a latent state model, these showed no meaningful improvement in fit. This highlights a broader issue: these models are well-motivated but may not fully capture observed behavior.

      Conclusion:

      Overall, the data support the authors' hypothesis that AD model mice struggle to retain competing memories, with the effect becoming more pronounced with age. While I believe the right computational model could highlight these differences, the current models fall short in doing so.

      We thank the reviewer for the insightful comments. For the comments (1) and (2), please refer to our previous author response to comments #26 and #27. We recognize that the models tested in this study have limitations and, as noted, do not fully capture all aspects of the observed behavioral data. We see this as an important direction for future research and value the reviewer’s suggestions.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      I have maintained some of the main concerns included in the first round of reviews as I think they remain concerns with the new draft, even though the authors have included substantially more analysis of their data, which is appreciated. I particularly found the inclusion of the comparative modeling valuable, although I think the analysis comparing the models should be improved.

      (1) This relates to point 1 in the public assessment or #16 in the response to reviewers from the authors. The authors raise the point that even a low posterior can drive behavioral expression (lines 361-365 in the response to authors), and so the acquisition latent cause may partially drive reinstatement. Yet in the stimulation shown in figure 1D, this does not seem to be the case. As I mentioned in the public response, in figure 1, the posteriors for z<sub>A</sub> are similar on day 34 and day 36, yet only on day 36 is there a strong CR. At least in this example, it does not appear that z<sub>A</sub> contributes to the increased responding from day 34 (test 2) to day 36 (test 3). There may be a slight increase in z1 in figure 3C, but the dominant change from day 34 to day 36 appears to be the increase in the posterior of z3 and the substantial increase in w3. The authors then cite several papers which have shown the shift in balance between what it is the putative acquisition memory and extinction memory (i.e. Lacagnina et al.). Yet I do not see how this modeling fits with most of the previous findings. For example, in the Lacagnina et al. paper, activation of the acquisition ensemble or inhibition of the extinction ensemble drives freezing, whereas the opposite pattern reduces freezing. What appears to be the pattern in the modeling in this paper is primarily learning of context in the extinction latent cause to predict the shock. As I mention in point 2C of the public review, it's not clear why this pattern of results would require a latent cause model. Would a high alpha for context and not the CS not give a similar pattern of results in the RW model? At least for giving similar results of the DIs in figure 2?

      First, we would like to clarify that the x-axis in Figure 1D is labeled “Trial,” not “Day.” Please refer to the reply to public review (1), where we clarified the posterior probability of the latent cause from trials 34 to 36. Second, although we did not have direct neural circuit evidence in this study, we discussed the similarities between previous findings and the modeling in the first review. Briefly, our main point focuses on the interaction between acquisition and extinction memory. In other words, responses at different times arise from distinct internal states made up of competing memories. We assume that the reviewer expects a modeling result showing nearly full recovery of acquisition memory, which aligns with previous findings where optogenetic activation of the acquisition engram can partially mimic reinstatement (Zaki et al., 2022; see also the response to comment #12 in the first round of review). We acknowledge that such a modeling result cannot be achieved with the latent cause model and see it as a potential future direction for model improvement.

      Please also refer to the reply to public review (2) about how a high alpha for context in the RW model cannot explain the pattern we observed in the reinstatement paradigm.

      (2) This is related to point 3 in the public comments and #13 in the response to reviewers. I raised the question of comparing the preCS/ITI period with the CS period, but my main point was why not include these periods in the LCM itself as mentioned in more detail in point 3 in the current public review. The inclusion of the comparisons the authors performed helped, but my main point was that the authors could have a better measure of wcontext if they included the preCS period as a stimulus each day (when only the context is included in the stimulus). This would provide better estimates of wcontext. As stated in the public review, perhaps the authors did this, but my understanding of the methods this was not the case, rather, it seems the authors only included the CS period for CRs within the model (at least on days when the CS was present).

      Please refer to the reply to public review (3) about the reason why we did not model the preCS freezing rate.

      (3) This relates to point 4 in the public review and #15 and #24 in the response to authors. The authors have several points for why the two experiments are similar and how results may be extrapolated - lines 725-733. The first point is that associative learning is fundamental in spatial learning. I'm not sure that this broad connection between the two studies is particularly insightful for why one supports the other as associative learning is putatively involved in most behavioral tasks. In the second point about reversals, why not then use a reversal paradigm that would be easier to model with LCM? This data is certainly valuable and interesting, yet I don't think it's helpful for this paper to state qualitatively the similarities in the potential ways a latent cause framework might predict behavior on the Barnes maze. I would recommend that the authors either model the behavior with LCM, remove the experiment from the paper, or change the framing of the paper that LCM might be an ideal approach for early detection of dementia or Alzheimer's disease.

      We would like to clarify that our aim was not to present the LCM as an ideal tool for early detection of AD symptoms. Rather, our focus is on the broader idea of utilizing internal models and estimating individual internal states in early-stage AD. Regarding using a reversal paradigm that would be easier to model with LCM, the most straightforward approach is to use another type of paradigm for fear conditioning, then to examine the extent to which similar behavioral characteristics are observed between paradigms within subjects. However, re-exposing the same mice to such paradigms is constrained by strong carry-over effects, limiting the feasibility of this experiment. Other behavioral tasks relevant to AD that avoid shock generally involve action selection for subsequent observation (Webster et al., 2014), which falls outside the structure of LCM. Our rationale for including the Barnes maze task is that spatial memory deficit is implicated in the early stage of AD, making it relevant for translational research. While we acknowledge that exact modeling of Barnes maze behavior would require a more sophisticated model (as discussed in the first round of review), our intention to use the reversal Barnes maze paradigm is to suggest a presumable memory modification learning in a non-fear conditioning paradigm. We also discussed whether similar deficits in memory modification could be observed across two behavioral tasks.

      (4) Reviewer # mentioned that the change in pattern of behavior only shows up in the older mice questioning the clinical relevance of early detection. I do think this is a valid point and maybe should be addressed. There does seem to be a bit of a bump in the controls on day 23 that doesn't appear in the 6-month group. Perhaps this was initially a spontaneous recovery test indicated by the dotted vertical line? This vertical line does not appear to be defined in the figure 1 legend, nor in figures 2 and 3.

      We would like to emphasize that the App<sup>NL-G-F</sup> knock-in mouse is widely considered a model of early-stage AD, characterized by Aβ accumulation with little to no neurofibrillary tangle pathology or neuronal loss (see Introduction). By examining different ages, we can assess the contribution of both the amount and duration of Aβ accumulation as well as age-related factors. Modeling the deficit in the memory modification process in the older App<sup>NL-G-F</sup> knock-in mice, we suggested a diverged internal state in early-stage AD in older age, and this does not diminish the relevance of the model for studying early cognitive changes in AD.

      We would also like to clarify again that the x-axis in the figure is “Trial,” not “Day.” The vertical dashed lines in these figures indicate phase boundaries, and they were defined in the figure legend: in Figure 1C, “The vertical dashed lines separate the phases.”; in Figure 2B, “The dashed vertical line separates the extinction 1 and extinction 2 phases.”; in Figure 3, “The vertical dashed lines indicate the boundaries of phases.”

      (5) Are the examples in figure 3 good examples? The example for the 12-month-old control shows a substantial increase in weights for the context during test 3, but not for the CS. Yet in the bar plots in Figure 4 G and H, this pattern seems to be different. The weights for the context appear to substantially drop in the "after extinction" period as compared to the "extinction" period. It's hard to tell the change from "extinction" to "after extinction" for the CS weights (the authors change the y-axis for the CS weights but not for the context weights from panels G to H).

      We would like to clarify that in Figure 3C, the increase in weights for context is not presented during test 3 (trial 36), noted by the reviewer; rather, it is the unsignaled shock phase (trial 35).

      We assumed that the reviewer might misunderstand that the labels on the left in Figure 4, “Acquisition”, “Extinction”, and “After extinction”, indicate the time point. However, the data shown in Figure 4C to 4H are all from the same time point: test 3 (trial 36). The grouping reflects the classification of latent causes based on the trial in which they were inferred. In addition, for Figures 4G and 4H, the y‐axis limits were not set identically because the data range for “Sum of w<sub>CS</sub>” varied. This was done to ensure the visibility of all data points. In Figure 4, each dot represents one animal. Take Figure 3D as an example. The point in Figure 4G is the sum of w3 and w4 in trial 36, and the point in Figure 4H is w5 in trial 36, note that the subscript numerals indicate latent cause index. We hope this addresses the reviewer’s question about the difference between the two figures.


      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      Summary:

      The authors show certain memory deficits in a mouse knock-in model of Alzheimer's Disease (AD). They show that the observed memory deficits can be explained by a computational model, the latent cause model of associative memory. The memory tasks used include the fear memory task (CFC) and the 'reverse' Barnes maze. Research on AD is important given its known huge societal burden. Likewise, better characterization of the behavioral phenotypes of genetic mouse models of AD is also imperative to advance our understanding of the disease using these models. In this light, I applaud the authors' efforts.

      Strengths:

      (1) Combining computational modelling with animal behavior in genetic knock-in mouse lines is a promising approach, which will be beneficial to the field and potentially explain any discrepancies in results across studies as well as provide new predictions for future work.

      (2) The authors' usage of multiple tasks and multiple ages is also important to ensure generalization across memory tasks and 'modelling' of the progression of the disease.

      Weaknesses:

      [#1] (1) I have some concerns regarding the interpretation of the behavioral results. Since the computational model then rests on the authors' interpretation of the behavioral results, it, in turn, makes judging the model's explanatory power difficult as well. For the CFC data, why do knock-in mice have stronger memory in test 1 (Figure 2C)? Does this mean the knock-in mice have better memory at this time point? Is this explained by the latent cause model? Are there some compensatory changes in these mice leading to better memory? The authors use a discrimination index across tests to infer a deficit in re-instatement, but this indicates a relative deficit in re-instatement from memory strength in test 1. The interpretation of these differential DIs is not straightforward. This is evident when test 1 is compared with test 2, i.e., the time point after extinction, which also shows a significant difference across groups, Figure 2F, in the same direction as the re-instatement. A clarification of all these points will help strengthen the authors' case.

      We appreciate the reviewer for the critical comments. According to the latent cause framework, the strength of the memory is influenced by at least 2 parameters: associative weight between CS and US given a latent cause and posterior probability of the latent cause. The modeling results showed that a higher posterior probability of acquisition latent cause, but not higher associative weight, drove the higher test 1 CR in App<sup>NL-G-F</sup> mice (Results and Discussion; Figure 4 – figure supplement 3B, 3C). In terms of posterior, we agree that App<sup>NL-G-F</sup> mice have strong fear memory. On the other hand, this suggests that App<sup>NL-G-F</sup> mice exhibited a tendency toward overgeneralization, favoring modification of old memories, which adversely affected the ability to retain competing memories. The strong memory in test 1 would be a compensatory effect of overgeneralization.    

      To estimate the magnitude of reinstatement, at least, one would have to compare CRs between test 2 (extinction) and test 3 (reinstatement), as well as those between test 1 (acquisition) and test 3. These comparisons represent the extent to which the memory at the reinstatement is far from that in the extinction, and close to that in the acquisition. Since discrimination index (DI) has been widely used as a normalized measure to evaluate the extent to which the system can distinguish between two conditions, we applied DI consistently to behavioral and simulated data in the reinstatement experiment, and the behavioral data in the reversal Barnes maze experiment, allowing us to evaluate the discriminability of an agent in these experiments. In addition, we used DI to examine its correlation with estimated parameters, enabling us to explore how individual discriminability may relate to the internal state. We have already discussed the differences in DI between test 3 and test 1, as well as CR in test 1 between control and App<sup>NL-G-F</sup> in the manuscript and further elaborated on this point in Line 232, 745-748.   

      [#2] (2) I have some concerns regarding the interpretation of the Barnes maze data as well, where there already seems to be a deficit in the memory at probe test 1 (Figure 6C). Given that there is already a deficit in memory, would not a more parsimonious explanation of the data be that general memory function in this task is impacted in these mice, rather than the authors' preferred interpretation? How does this memory weakening fit with the CFC data showing stronger memories at test 1? While I applaud the authors for using multiple memory tasks, I am left wondering if the authors tried fitting the latent cause model to the Barnes maze data as well.

      While we agree that the deficits shown in probe test 1 may imply impaired memory function in App<sup>NL-G-F</sup> mice in this task, it would be difficult to explain this solely in terms of impairments in general memory function. The learning curve and the daily strategy changes suggested that App<sup>NL-G-F</sup> mice would have virtually intact learning ability in the initial training phase (Figure 6B, 6F, Figure 6 – figure supplement 1 and 3). For the correspondence relationship between the reinstatement and the reversal Barnes maze learning from the aspect of memory modification process, please also see our reply to comment #24. We have explained why we did not fit the latent cause model to the Barnes maze data in the provisional response.

      [#3] (3) Since the authors use the behavioral data for each animal to fit the model, it is important to validate that the fits for the control vs. experimental groups are similar to the model (i.e., no significant differences in residuals). If that is the case, one can compare the differences in model results across groups (Figures 4 and 5). Some further estimates of the performance of the model across groups would help.

      We have added the residual (i.e., observed CR minus simulated CR) in Figure 3 – figure supplement 1D and 1E. The fit was similar between control and App<sup>NL-G-F</sup> mice groups in the test trials, except test 3 in the 12-month-old group. The residual was significantly higher in the 12-month-old control mice than App<sup>NL-G-F</sup> mice, suggesting the model underestimated the reinstatement in the control, yet the DI calculated from the simulated CR replicates the behavioral data (Figure 3 – figure supplement 1A to 1C). These results suggest that the latent cause model fits our data with little systematic bias such as an overestimation of CR for the control group in the reinstatement, supporting the validity of the comparisons in estimated parameters between groups. These results and discussion have been added in the manuscript Line 269-276.

      One may notice that the latent cause model overestimated the CR in acquisition trials in all groups in Figure 3 – figure supplement 1D and 1E. We have discussed this point in the reply to comment #26, 34 questioned by reviewer 3.

      [#4] (4) Is there an alternative model the authors considered, which was outweighed in terms of prediction by this model? 

      Yes, we have further evaluated two alternative models: the Rescorla-Wagner (RW; Rescorla & Wagner, 1972) model and the latent state model (LSM; Cochran & Cisler, 2019). The RW model serves as a baseline, given its known limitations in explaining fear return after extinction. The LSM is another contemporary model that shares several concepts with the latent cause model (LCM) such as building upon the RW model, assuming a latent variable inferred by Bayes’ rule, and involving a ruminative update for memory modification. We evaluated the three models in terms of the prediction accuracy and reproducibility of key behavioral features. Please refer to the Appendix 1 for detailed methods and results for these two models.

      As expected, the RW model fit well to the data till the end of extinction but failed to reproduce reinstatement (Appendix 1 – figure 1A to 1D). Due to a large prediction error in test 3, few samples met the acceptance criteria we set (Appendix 1 – figure 2 and 3A). Conversely, the LSM reproduced reinstatement, as well as gradual learning in acquisition and extinction phases, particularly in the 12month-old control (Appendix 1 – figure 1G). The number of accepted samples in the LSM was higher than in the RW model but generally lower than in the LCM (Appendix 1 – figure 2). The sum of prediction errors over all trials in the LSM was comparable to that in the LCM in the 6-month-old group (Appendix 1 – figure 4A), it was significantly lower in the 12-month-old group (Appendix 1 – figure 4B). Especially the LSM generated smaller prediction errors during the acquisition trials than in the LCM, suggesting that the LSM might be better at explaining the behaviors of acquisition (Appendix 1 – figure 4A and 4B; but see the reply for comment #34). While the LSM generated smaller prediction errors than the LCM in test 2 of the control group, it failed to replicate the observed DIs, a critical behavioral phenotype difference between control and App<sup>NL-G-F</sup> mice (Appendix 1 – figure 6A to 6C; cf. Figure 2F to 2H, Figure 3 – figure supplement 1A to 1C).

      Thus, although each model could capture different aspects of reinstatement, standing on the LCM to explain the reinstatement better aligns with our purpose. It should also be noted that we did not explore all parameter spaces of the LSM, hence we cannot rule out the possibility that alternative parameter sets could provide a better fit and explain the memory modification process well. A more comprehensive parameter search in the LSM may be a valuable direction for future research. 

      [#5] One concern here is also parameter overfitting. Did the authors try leaving out some data (trials/mice) and predicting their responses based on the fit derived from the training data?

      Following the reviewer’s suggestion, we confirmed if overfitting occurred using all trials to estimate parameters. Estimating parameters while actually leaving out trials would disorder the time lapse across trials, and thereby the prior of latent causes in each trial. Instead, we removed the constraint of prediction error by setting the error threshold to 1 for certain trials to virtually leave these trials out. We treated these trials as a virtual “training” dataset, while the rest of the trials were a “test” dataset. For the median CR data of each group (Figure 3), we estimated parameters under 6 conditions with unique training and test trials, then evaluated the prediction error for the training and test trials. Note that training and test trials were arbitrarily decided. Also, the error threshold for the acquisition trial was set to 1 as described in Materials and Methods, which we have further discussed the reason in the reply to comment #34 and treated acquisition trials separately from the test trials. We expect that the contribution of the data from the acquisition and test trials for parameter estimation could be discounted compared to those from the training trials with the constraint, and if overfitting occurred, the prediction error in the test data would be worse than that in the training trials.

      Author response image 1A to 1F showed the simulated and observed CR under each condition, where acquisition trials were in light-shaded areas, test trials were in dark-shaded areas, and the rest of the trials were training trials. Author response image 1G showed mean squared prediction error across the acquisition, training and test trials under each condition. The dashed gray line showed the mean squared prediction error of training trials in Figure 3 as a baseline.

      In conditions i and ii, where two or four trials in the extinction were used for training (Author response image 1A and 1B), the prediction error was generally higher in test trials than in training trials. In conditions iii and iv where ten trials in the extinction were used for training (Author response image 1C and 1D), the difference in prediction error between testing and training trials became smaller. These results suggest that providing more extinction trial data would reduce overfitting. In condition v (Author response image 1E), the results showed that using trials until extinction can predict reinstatement in control mice but not App<sup>NL-G-F</sup> mice. Similarly, in condition vi (Author response image 1F), where test phase trials were left out, the prediction error differences were greater in App<sup>NL-G-F</sup> mice. These results suggest that the test trials should be used for the parameter estimation to minimize prediction error for all groups. Overall, this analysis suggests that using all trials would reduce prediction error with few overfitting. 

      Author response image 1.

      Leaving trials out in parameter estimation in the latent cause model. (A – F) The observed CR (colored line) is the median freezing rate during the CS presentation over the mice within each group, which is the same as that in Figure 3. The colors indicate different groups: orange represents 6-month-old control, light blue represents 6-month-old App<sup>NL-G-F</sup> mice, pink represents 12-month-old control, and dark blue represents 12-month-old App<sup>NL-G-F</sup> mice. Under six different leave-out conditions (i – vi), parameters were estimated and used for generating simulated CR (gray line). In each condition, trials were categorized as acquisition (light-shaded area), training data (white area), and test data (dark-shaded area) based on the error threshold during parameter estimation. Only the error threshold of the test data trial was different from the original method (see Material and Method) and set to 1. In conditions i to vi, the number of test data trials is 27, 25, 19, and 19 in extinction phases. In condition v, the number of test data trials is 2 (trials 35 and 36). In condition vi, test data trials were the 3 test phases (trials 4, 34, and 36). (G) Each subplot shows the mean squared prediction error for the test data trial (gray circles), training data trial (white squares), and acquisition trial (gray triangles) in each group. The left y-axis corresponds to data from test and training trials, and the right y-axis corresponds to data from acquisition trials. The dashed line indicates the results calculated from Figure 3 as a baseline.  

      Reviewer #1 (Recommendations for the authors):

      Minor:

      [#6] (1) I would like the authors to further clarify why 'explaining' the reinstatement deficit in the AD mouse model is important in working towards the understanding of AD i.e., which aspect of AD this could explain etc.

      In this study, we utilized the reinstatement paradigm with the latent cause model as an internal model to illustrate how estimating internal states can improve understanding of cognitive alteration associated with extensive Aβ accumulation in the brain. Our findings suggest that misclassification in the memory modification process, manifesting as overgeneralization and overdifferentiation, underlies the memory deficit in the App<sup>NL-G-F</sup> knock-in model mice. 

      The parameters in the internal model associated with AD pathology (e.g., α and σ<sub>x</sub><sup>2</sup> in this study) can be viewed as computational phenotypes, filling the explanatory gap between neurobiological abnormalities and cognitive dysfunction in AD. This would advance the understanding of cognitive symptoms in the early stages of AD beyond conventional behavioral endpoints alone.

      We further propose that altered internal states in App<sup>NL-G-F</sup> knock-in mice may underlie a wide range of memory-related symptoms in AD as we observed that App<sup>NL-G-F</sup> knock-in mice failed to retain competing memories in the reversal Barnes maze task. We speculate on how overgeneralization and overdifferentiation may explain some AD symptoms in the manuscript:

      - Line 565-569: overgeneralization may explain deficits in discriminating highly similar visual stimuli reported in early-stage AD patients as they misclassify the lure as previously learned object

      - Line 576-579: overdifferentiation may explain impaired ability to transfer previously learned association rules in early-stage AD patients as they misclassify them as separated knowledge. 

      - Line 579-582: overdifferentiation may explain delusions in AD patients as an extended latent cause model could simulate the emergence of delusional thinking

      We provide one more example here that overgeneralization may explain that early-stage AD patients are more susceptible to proactive interference than cognitively normal elders in semantic memory tests (Curiel Cid et al., 2024; Loewenstein et al., 2015, 2016; Valles-Salgado et al., 2024), as they are more likely to infer previously learned material. Lastly, we expect that explaining memory-related symptoms within a unified framework may facilitate future hypothesis generation and contribute to the development of strategies for detecting the earliest cognitive alteration in AD.  

      [#7] (2) The authors state in the abstract/introduction that such computational modelling could be most beneficial for the early detection of memory disorders. The deficits observed here are pronounced in the older animals. It will help to further clarify if these older animals model the early stages of the disease. Do the authors expect severe deficits in this mouse model at even later time points?

      The early stage of the disease is marked by abnormal biomarkers associated with Aβ accumulation and neuroinflammation, while cognitive symptoms are mild or absent. This stage can persist for several years during which the level of Aβ may reach a plateau. As the disease progresses, tau pathology and neurodegeneration emerge and drive the transition into the late stage and the onset of dementia. The App<sup>NL-G-F</sup> knock-in mice recapitulate the features present in the early stage (Saito et al., 2014), where extensive Aꞵ accumulation and neuroinflammation worsen along with ages (Figure 2 – figure supplement 1). Since App<sup>NL-G-F</sup> knock-in mice are central to Aβ pathology without tauopathy and neurodegeneration, it should be noted that it does not represent the full spectrum of the disease even at advanced ages. Therefore, older animals still model the early stages of the diseases and are suitable to study the long-term effect of Aβ accumulation and neuroinflammation. 

      The age tested in previous reports using App<sup>NL-G-F</sup> mice spanned a wide range from 2 months old to 24 months old. Different behavioral tasks have varied sensitivity but overall suggest the dysfunction worsens with aging (Bellio et al., 2024; Mehla et al., 2019; Sakakibara et al., 2018). We have tested the reinstatement experiment with 17-month-old App<sup>NL-G-F</sup> mice before (Author response image 2). They showed more advanced deficits with the same trends observed in 12-month-old App<sup>NL-G-F</sup> mice, but their freezing rates were overall at a lower level. There is a concern that possible hearing loss may affect the results and interpretation, therefore we decided to focus on 12-month-old data.

      Author response image 2.

      Freezing rate across reinstatement paradigm in the 17-month-old App<sup>NL-G-F</sup> mice. Dashed and solid lines indicate the median freezing rate over 34 mice before (preCS) and during (CS) tone presentation, respectively. Red, blue, and yellow backgrounds represent acquisition, extinction, and unsignaled shock in Figure 2A. The dashed vertical line separates the extinction 1 and extinction 2 phases.

      [#8] (3) There are quite a few 'marginal' p-values in the paper at p>0.05 but near it. Should we accept them all as statistically significant? The authors need to clarify if all the experimental groups are sufficiently powered.

      For our study, we decided a priori that p < 0.05 would be considered statistically significant, as described in the Materials and Methods. Therefore, in our Results, we did not consider these marginal values as statistically significant but reported the trend, as they may indicate substantive significance.

      We described our power analysis method in the manuscript Line 897-898 and have provided the results in Tables S21 and S22.

      [#9] (4) The authors emphasize here that such computational modelling enables us to study the underlying 'reasoning' of the patient (in the abstract and introduction), I do not see how this is the case. The model states that there is a latent i.e. another underlying variable that was not previously considered.

      Our use of the term “reasoning” was to distinguish the internal model, which describes how an agent makes sense of the world, from other generative models implemented for biomarker and disease progression prediction. However, we agree that using “reasoning” may be misleading and imprecise, so to reduce ambiguity we have removed this word in our manuscript Line 27: Nonetheless, internal models of the patient remain underexplored in AD; Line 85: However, previous approaches did not suppose an internal model of the world to predict future from current observation given prior knowledge.   

      [#10] (5) The authors combine knock-in mice with controls to compute correlations of parameters of the model with behavior of animals (e.g. Figure 4B and Figure 5B). They run the risk of spurious correlations due to differences across groups, which they have indeed shown to exist (Figure 4A and 5A). It would help to show within-group correlations between DI and parameter fit, at least for the control group (which has a large spread of data).

      We agree that genotype (control, App<sup>NL-G-F</sup>) could be a confounder between the estimated parameters and DI, thereby generating spurious correlations. To address this concern, we have provided withingroup correlation in Figure 4 – figure supplement 2 for the 12-month-old group and Figure 5 – figure supplement 2 for the 6-month-old group.

      In the 12-month-old group, the significant positive correlation between σx2 and DI remained in both control and App<sup>NL-G-F</sup> mice even if we adjusted the genotype effect, suggesting that it is very unlikely that the correlations in Figure 4B are due to the genotype-related confounding. On the other hand, the positive correlation between α and DI was found to be significant in the control mice but not in the App<sup>NL-G-F</sup> mice. Most of α were distributed around the lower bound in App<sup>NL-G-F</sup> mice, which possibly reduced the variance and correlation coefficient. These results support our original conclusion that α and σ<sub>x</sub><sup>2</sup> are parameters associated with a lower magnitude of reinstatement in aged App<sup>NL-G-F</sup> mice.

      In the 6-month-old group, the correlations shown in Figure 5B were not preserved within subgroups, suggesting genotype would be a confounder for α, σ<sub>x</sub><sup>2</sup>, and DI. We recognized that significant correlations in Figure 5B may arise from group differences, increased sample size, or greater variance after combining control and App<sup>NL-G-F</sup> mice. 

      Therefore, we concluded that α and σ<sub>x</sub><sup>2</sup> are associated with the magnitude of reinstatement but modulated by the genotype effect depending on the age. 

      We have added interpretations of within-group correlation in the manuscript Line 307-308, 375-378.

      [#11] (6) It is unclear to me why overgeneralization of internal states will lead to the animals having trouble recalling a memory. Would this not lead to overgeneralization of memory recall instead?

      We assume that the reviewer is referring to “overgeneralization of internal states,” a case in which the animal’s internal state remained the same regardless of the observation, thereby leading to “overgeneralization of memory recall.” We agree that this could be one possible situation and appears less problematic than the case in which this memory is no longer retrievable. 

      However, in our manuscript, we did not deal with the case of “overgeneralization of internal states”. Rather, our findings illustrated how the memory modification process falls into overgeneralization or overdifferentiation and how it adversely affects the retention of competing memories, thereby causing App<sup>NL-G-F</sup> mice to have trouble recalling the same memory as the control mice. 

      According to the latent cause model, retrieval failure is explained by a mismatch of internal states, namely when an agent perceives that the current cue does not match a previously experienced one, the old latent cause is less likely to be inferred due to its low likelihood (Gershman et al., 2017). For example, if a mouse exhibited higher CR in test 2, it would be interpreted as a successful fear memory retrieval due to overgeneralization of the fear memory. However, it reflects a failure of extinction memory retrieval due to the mismatch between the internal states at extinction and test 2. This is an example that overgeneralization of memory induces the failure of memory retrieval. 

      On the other hand, App<sup>NL-G-F</sup> mice exhibited higher CR in test 1, which is conventionally interpreted as a successful fear memory retrieval. When estimating their internal states, they would infer that their observation in test 1 well matches those under the acquisition latent causes, that is the overgeneralization of fear memory as shown by a higher posterior probability in acquisition latent causes in test 1 (Figure 4 – figure supplement 3). This is an example that over-generalization of memory does not always induce retrieval failure as we explained in the reply to comment #1. 

      Reviewer #2 (Public review):

      Summary:

      This manuscript proposes that the use of a latent cause model for the assessment of memory-based tasks may provide improved early detection of Alzheimer's Disease as well as more differentiated mapping of behavior to underlying causes. To test the validity of this model, the authors use a previously described knock-in mouse model of AD and subject the mice to several behaviors to determine whether the latent cause model may provide informative predictions regarding changes in the observed behaviors. They include a well-established fear learning paradigm in which distinct memories are believed to compete for control of behavior. More specifically, it's been observed that animals undergoing fear learning and subsequent fear extinction develop two separate memories for the acquisition phase and the extinction phase, such that the extinction does not simply 'erase' the previously acquired memory. Many models of learning require the addition of a separate context or state to be added during the extinction phase and are typically modeled by assuming the existence of a new state at the time of extinction. The Niv research group, Gershman et al. 2017, have shown that the use of a latent cause model applied to this behavior can elegantly predict the formation of latent states based on a Bayesian approach, and that these latent states can facilitate the persistence of the acquisition and extinction memory independently. The authors of this manuscript leverage this approach to test whether deficits in the production of the internal states, or the inference and learning of those states, may be disrupted in knock-in mice that show both a build-up of amyloid-beta plaques and a deterioration in memory as the mice age.

      Strengths:

      I think the authors' proposal to leverage the latent cause model and test whether it can lead to improved assessments in an animal model of AD is a promising approach for bridging the gap between clinical and basic research. The authors use a promising mouse model and apply this to a paradigm in which the behavior and neurobiology are relatively well understood - an ideal situation for assessing how a disease state may impact both the neurobiology and behavior. The latent cause model has the potential to better connect observed behavior to underlying causes and may pave a road for improved mapping of changes in behavior to neurobiological mechanisms in diseases such as AD.

      Weaknesses:

      I have several substantial concerns which I've detailed below. These include important details on how the behavior was analyzed, how the model was used to assess the behavior, and the interpretations that have been made based on the model.

      [#12] (1) There is substantial data to suggest that during fear learning in mice separate memories develop for the acquisition and extinction phases, with the acquisition memory becoming more strongly retrieved during spontaneous recovery and reinstatement. The Gershman paper, cited by the authors, shows how the latent causal model can predict this shift in latent states by allowing for the priors to decay over time, thereby increasing the posterior of the acquisition memory at the time of spontaneous recovery. In this manuscript, the authors suggest a similar mechanism of action for reinstatement, yet the model does not appear to return to the acquisition memory state after reinstatement, at least based on the examples shown in Figures 1 and 3. Rather, the model appears to mainly modify the weights in the most recent state, putatively the 'extinction state', during reinstatement. Of course, the authors must rely on how the model fits the data, but this seems problematic based on prior research indicating that reinstatement is most likely due to the reactivation of the acquisition memory. This may call into question whether the model is successfully modeling the underlying processes or states that lead to behavior and whether this is a valid approach for AD.

      We thank the reviewer for insightful comments. 

      We agree that, as demonstrated in Gershman et al. (2017), the latent cause model accounts for spontaneous recovery via the inference of new latent causes during extinction and the temporal compression property provided by the prior. Moreover, it was also demonstrated that even a relatively low posterior can drive behavioral expression if the weight in the acquisition latent cause is preserved. For example, when the interval between retrieval and extinction was long enough that acquisition latent cause was not dominant during extinction, spontaneous recovery was observed despite the posterior probability of acquisition latent cause (C1) remaining below 0.1 in Figure 11D of Gershman et al. (2017). 

      In our study, a high response in test 3 (reinstatement) is explained by both acquisition and extinction latent cause. The former preserves the associative weight of the initial fear memory, while the latter has w<sub>context</sub> learned in the unsignaled shock phase. These positive w were weighted by their posterior probability and together contributed to increased expected shock in test 3. Though the posterior probability of acquisition latent cause was lower than extinction latent cause in test 3 due to time passage, this would be a parallel instance mentioned above. To clarify their contributions to reinstatement, we have conducted additional simulations and the discussion in reply to the reviewer’s next comment (see the reply to comment #13).

      We recognize that our results might appear to deviate from the notion that reinstatement results from the strong reactivation of acquisition memory, where one would expect a high posterior probability of the acquisition latent cause. However, we would like to emphasize that the return of fear emerges from the interplay of competing memories. Previous studies have shown that contextual or cued fear reinstatement involves a neural activity switch back to fear state in the medial prefrontal cortex (mPFC), including the prelimbic cortex and infralimbic cortex, and the amygdala, including ventral intercalated amygdala neurons (ITCv), medial subdivision of central nucleus of the amygdala (CeM), and the basolateral amygdala (BLA) (Giustino et al., 2019; Hitora-Imamura et al., 2015; Zaki et al., 2022). We speculate that such transition is parallel to the internal states change in the latent cause model in terms of posterior probability and associative weight change.

      Optogenetic manipulation experiments have further revealed how fear and extinction engrams contribute to extinction retrieval and reinstatement. For instance, Gu et al. (2022) used a cued fear conditioning paradigm and found that inhibition of extinction engrams in the BLA, ventral hippocampus (vHPC), and mPFC after extinction learning artificially increased freezing to the tone cue. Similar results were observed in contextual fear conditioning, where silencing extinction engrams in the hippocampus dentate gyrus (DG) impaired extinction retrieval (Lacagnina et al., 2019). These results suggest that the weakening extinction memory can induce a return of fear response even without a reminder shock. On the other hand, Zaki et al. (2022) showed that inhibition of fear engrams in the BLA, DG, or hippocampus CA1 attenuated contextual fear reinstatement. However, they also reported that stimulation of these fear engrams was not sufficient to induce reinstatement, suggesting these fear engram only partially account for reinstatement. 

      In summary, reinstatement likely results from bidirectional changes in the fear and extinction circuits, supporting our interpretation that both acquisition and extinction latent causes contribute to the reinstatement. Although it remains unclear whether these memory engrams represent latent causes, one possible interpretation is that w<sub>context</sub> update in extinction latent causes during unsignaled shock indicates weakening of the extinction memory, while preservation of w in acquisition latent causes and their posterior probability suggests reactivation of previous fear memory. 

      [#13] (2) As stated by the authors in the introduction, the advantage of the fear learning approach is that the memory is modified across the acquisition-extinction-reinstatement phases. Although perhaps not explicitly stated by the authors, the post-reinstatement test (test 3) is the crucial test for whether there is reactivation of a previously stored memory, with the general argument being that the reinvigorated response to the CS can't simply be explained by relearning the CS-US pairing, because re-exposure the US alone leads to increase response to the CS at test. Of course there are several explanations for why this may occur, particularly when also considering the context as a stimulus. This is what I understood to be the justification for the use of a model, such as the latent cause model, that may better capture and compare these possibilities within a single framework. As such, it is critical to look at the level of responding to both the context alone and to the CS. It appears that the authors only look at the percent freezing during the CS, and it is not clear whether this is due to the contextual US learning during the US re-exposure or to increased response to the CS - presumably caused by reactivation of the acquisition memory. For example, the instance of the model shown in Figure 1 indicates that the 'extinction state', or state z6, develops a strong weight for the context during the reinstatement phase of presenting the shock alone. This state then leads to increased freezing during the final CS probe test as shown in the figure. By not comparing the difference in the evoked freezing CR at the test (ITI vs CS period), the purpose of the reinstatement test is lost in the sense of whether a previous memory was reactivated - was the response to the CS restored above and beyond the freezing to the context? I think the authors must somehow incorporate these different phases (CS vs ITI) into their model, particularly since this type of memory retrieval that depends on assessing latent states is specifically why the authors justified using the latent causal model.

      To clarify the contribution of context, we have provided preCS freezing rate across trials in Figure 2 – figure supplement 2. As the reviewer pointed out, the preCS freezing rate did not remain at the same level across trials, especially within the 12-month-old control and App<sup>NL-G-F</sup> group (Figure 2 – figure supplement 2A and 2B), suggesting the effect context. A paired samples t-test comparing preCS freezing (Figure 2 – figure supplement 2E) and CS freezing (Figure 2E) in test 3 revealed significant differences in all groups: 6-month-old control, t(23) = -6.344, p < 0.001, d = -1.295; 6-month-old App<sup>NL-G-F</sup>, t(24) = -4.679, p < 0.001, d = -0.936; 12-month-old control, t(23) = -4.512, p < 0.001, d = 0.921; 12-month-old App<sup>NL-G-F</sup>, t(24) = -2.408, p = 0.024, d = -0.482. These results indicate that the response to CS was above and beyond the response to context only. We also compared the change in freezing rate (CS freezing rate minus preCS freezing rate) in test 2 and test 3 to examine the net response to the tone. The significant difference was found in the control group, but not in the App<sup>NL-GF</sup> group (Author response image 3). The increased net response to the tone in the control group suggested that the reinstatement was partially driven by reactivation of acquisition memory, not solely by the contextual US learning during the unsignaled shock phase. We have added these results and discussion in the manuscript Line 220-231.

      Author response image 3.

      Net freezing rate in test 2 and test 3. Net freezing rate is defined as the CS freezing rate (i.e., freezing rate during 1 min CS presentation) minus the preCS freezing rate (i.e., 1 min before CS presentation). The dashed horizontal line indicates no freezing rate change from the preCS period to the CS presentation. *p < 0.05 by paired-sample Student’s t-test, and the alternative hypothesis specifies that test 2 freezing rate change is less than test 3. Colors indicate different groups: orange represents 6-month-old control (n = 24), light blue represents 6-month-old App<sup>NL-G-F</sup> mice (n = 25), pink represents 12-month-old control (n = 24), and dark blue represents 12-month-old App<sup>NL-G-F</sup> mice (n = 25). Each black dot represents one animal. Statistical results were as follows: t(23) = -1.927, p = 0.033, Cohen’s d = -0.393 in 6-month-old control; t(24) = -1.534, p = 0.069, Cohen’s d = -0.307 in 6-month-old App<sup>NL-G-F</sup>; t(23) = -1.775, p = 0.045, Cohen’s d = -0.362 in 12-month-old control; t(24) = 0.86, p = 0.801, Cohen’s d = 0.172 in 12-monthold App<sup>NL-G-F</sup>

      According to the latent cause model, if the reinstatement is merely induced by an association between the context and the US in the unsignaled shock phase, the CR given context only and that given context and CS in test 3 should be equal. However, the simulation conducted for each mouse using their estimated parameters confirmed that this was not the case in this study. The results showed that simulated CR was significantly higher in the context+CS condition than in the context only condition (Author response image 4). This trend is consistent with the behavioral results we mentioned above.

      Author response image 4.

      Simulation of context effect in test 3. Estimated parameter sets of each sample were used to run the simulation that only context or context with CS was present in test 3 (trial 36). The data are shown as median with interquartile range, where white bars with colored lines represent CR for context only and colored bars represent CR for context with CS. Colors indicate different groups: orange represents 6-month-old control (n = 15), light blue represents 6-month-old App<sup>NL-G-F</sup> mice (n = 12), pink represents 12-month-old control (n = 20), and dark blue represents 12-month-old App<sup>NL-G-F</sup> mice (n = 18). Each black dot represents one animal. **p < 0.01, and ***p < 0.001 by Wilcoxon signed-rank test comparing context only and context + CS in each group, and the alternative hypothesis specifies that CR in context is not equal to CR in context with CS. Statistical results were as follows: W = 15, p = 0.008, effect size r = -0.66 in 6-month-old control; W = 0, p < 0.001, effect size r = -0.88 in 6-month-old App<sup>NL-G-F</sup>; W = 25, p = 0.002, effect size r = -0.67 in 12-month-old control; W = 9, p = 0.002 , effect size r = -0.75 in 12-month-old App<sup>NL-G-F</sup>

      [#14] (3) This is related to the second point above. If the question is about the memory processes underlying memory retrieval at the test following reinstatement, then I would argue that the model parameters that are not involved in testing this hypothesis be fixed prior to the test. Unlike the Gershman paper that the authors cited, the authors fit all parameters for each animal. Perhaps the authors should fit certain parameters on the acquisition and extinction phase, and then leave those parameters fixed for the reinstatement phase. To give a more concrete example, if the hypothesis is that AD mice have deficits in differentiating or retrieving latent states during reinstatement which results in the low response to the CS following reinstatement, then perhaps parameters such as the learning rate should be fixed at this point. The authors state that the 12-month-old AD mice have substantially lower learning rate measures (almost a 20-fold reduction!), which can be clearly seen in the very low weights attributed to the AD mouse in Figure 3D. Based on the example in Figure 3D, it seems that the reduced learning rate in these mice is most likely caused by the failure to respond at test. This is based on comparing the behavior in Figures 3C to 3D. The acquisition and extinction curves appear extremely similar across the two groups. It seems that this lower learning rate may indirectly be causing most of the other effects that the authors highlight, such as the low σx, and the changes to the parameters for the CR. It may even explain the extremely high K. Because the weights are so low, this would presumably lead to extremely low likelihoods in the posterior estimation, which I guess would lead to more latent states being considered as the posterior would be more influenced by the prior.

      We thank the reviewer for the suggestion about fitting and fixing certain parameters in different phases.

      However, this strategy may not be optimal for our study for the following scientific reasons.

      Our primary purpose is to explore internal states in the memory modification process that are associated with the deficit found in App<sup>NL-G-F</sup> mice in the reinstatement paradigm. We did not restrict the question to memory retrieval, nor did we have a particular hypothesis such that only a few parameters of interest account for the impaired associative learning or structure learning in App<sup>NL-G-F</sup> mice while all other parameters are comparable between groups. We are concerned that restricting questions to memory retrieval at the test is too parsimonious and might lead to misinterpretation of the results. As we explain in reply to comment #5, removing trials in extinction during parameter estimation reduces the model fit performance and runs the risk of overfitting within the individual. Therefore, we estimated all parameters for each animal, with the assumption that the estimated parameter set represents individual internal state (i.e., learning and memory characteristics) and should be fixed within the animal across all trials.  

      Figure 3 is the parameter estimation and simulation results using the median data of each group as an individual. The estimated parameter value is one of the possible cases in that group to demonstrate how a typical learning curve fits the latent cause model. The reviewer mentioned “20-fold reduction in learning rate” is the comparison of two data points, not the actual comparison between groups. The comparison between control and App<sup>NL-G-F</sup> mice in the 12-month-old group for all parameters was provided in Table S7. The Mann-Whitney U test did not reveal a significant difference in learning rate (η): 12-month-old control (Mdn = 0.09, IQR=0.23) vs. 12-month-old App<sup>NL-G-F</sup> (Mdn = 0.12, IQR=0.23), U = 199, p = 0.587.  

      We agree that lower learning rate could bias the learning toward inferring a new latent cause. However, this tendency may depend on the value of other parameters and varied in different phases in the reinstatement paradigm. Here, we used ⍺ as an example and demonstrate their interaction in Appendix 2 – table 2 with relatively extreme values: ⍺ \= {1, 3} and η \= {0.01, 0.5} while the rest of the parameters fixed at the initial guess value. 

      When ⍺ = 1, the number of latent causes across phases (K<sub>acq</sub>, K<sub>ext</sub>, K<sub>rem</sub>) remain unchanged and their posterior probability in test 3 were comparable even if η increased from 0.01 to 0.5. This is an example that lower η does not lead to inferring new latent causes because of low ⍺. The effect of low learning rate manifests in test 3 CR due to low w<sub>context, acq</sub> and w<sub>context, ext</sub>

      When ⍺ = 3, the number of acquisition latent causes (K<sub>acq</sub>) was higher in the case of η = 0.01 than that of η = 0.5, showing the effect mentioned by the reviewer. However, test 1 CR is much lower when η = 0.01, indicating unsuccessful learning even after inferring a new latent cause. This is none of the cases observed in this study. During extinction phases, the effect of η is surpassed by the effect of high ⍺, where the number of extinction latent causes (K<sub>ext</sub>) is high and not affected by η. After the extinction phases, the effect of K kicks in as the total number of latent causes reaches its value (K = 33 in this example), especially in the case of η = 0.01. A new latent cause is inferred after extinction in the condition of η = 0.5, but the CR 3 is still high as the w<sub>context, acq</sub> and w<sub>context, ext</sub> are high. This is an example that a new latent cause is inferred in spite of higher η

      Overall, the learning rate would not have a prominent effect alone throughout the reinstatement paradigm, and it has a joint effect with other parameters. Note that the example here did not cover our estimated results, as the estimated learning rate was not significantly different between control and App<sup>NL-G-F</sup> mice (see above). Please refer to the reply to comment #31 for more discussion about the interaction among parameters when the learning rate is fixed. We hope this clarifies the reviewer’s concern.

      [#15] (4) Why didn't the authors use the latent causal model on the Barnes maze task? The authors mention in the discussion that different cognitive processes may be at play across the two tasks, yet reversal tasks have been suggested to be solved using latent states to be able to flip between the two different task states. In this way, it seems very fitting to use the latent cause model. Indeed, it may even be a better way to assess changes in σx as there are presumably 12 observable stimuli/locations.

      Please refer to our provisional response about the application of the latent cause model to the reversal Barnes maze task. Briefly, it would be difficult to directly apply the latent cause model to the Barnes maze data because this task involves operant learning, and thereby almost all conditions in the latent cause model are not satisfied. Please also see our reply to comment #24 for the discussion of the link between the latent cause model and Barnes maze task. 

      Reviewer #2 (Recommendations for the authors):

      [#16] (1) I had a bit of difficulty finding all the details of the model. First, I had to mainly rely on the Gershman 2017 paper to understand the model. Even then, there were certain aspects of the model that were not clear. For instance, it's not quite clear to me when the new internal states are created and how the maximum number of states is determined. After reading the authors' methods and the Gershman paper, it seems that a new internal state is generated at each time point, aka zt, and that the prior for that state decays onwards from alpha. Yet because most 'new' internal states don't ever take on much of a portion of the posterior, most of these states can be ignored. Is that a correct understanding? To state this another way, I interpret the equation on line 129 to indicate that the prior is determined by the power law for all existing internal states and that each new state starts with a value of alpha, yet I don't see the rule for creating a new state, or for iterating k other than that k iterates at each timestep. Yet this seems to not be consistent with the fact that the max number of states K is also a parameter fit. Please clarify this, or point me to where this is better defined.

      I find this to be an important question for the current paper as it is unclear to me when the states were created. Most notably, in Figure 3, it's important to understand why there's an increase in the posterior of z<sub>5</sub> in the AD 12-month mice at test. Is state z<sub>5</sub> generated at trial 5? If so, the prior would be extremely small by trial 36, making it even more perplexing why z<sub>5</sub> has such a high posterior. If its weights are similar to z<sub>3</sub> and z<sub>4</sub>, and they have been much more active recently, why would z<sub>5</sub> come into play?

      We assume that the “new internal state" the reviewer is referring to is the “new latent cause." We would like to clarify that “internal state" in our study refers to all the latent causes at a given time point and observation. As this manuscript is submitted as a Research Advance article in eLife, we did not rephrase all the model details. Here, we explain when a new latent cause is created (i.e., the prior probability of a new latent cause is greater than 0) with the example of the 12-month-old group (Figure 3C and 3D). 

      Suppose that before the start of each trial, an agent inferred the most likely latent cause with maximum posterior, and it inferred k latent causes so far. A new latent cause can be inferred at the computation of the prior of latent causes at the beginning of each trial.  

      In the latent cause model, it follows a distance-dependent Chinese Restaurant Process (CRP; Blei and Frazier, 2011). The prior of each old latent cause is its posterior probability, which is the final count of the EM update before the current. In addition, the prior of old latent causes is sensitive to the time passage so that it exponentially decreases as a forgetting function modulated by g (see Figure 2 in Gershman et al., 2017). Simultaneously, the prior of a new cause is assigned ⍺. The new latent cause is inferred at this moment. Hence, the prior of latent causes is jointly determined by ⍺, g and its posterior probability. The maximum number of latent causes K is set a priori and does not affect the prior while k < K (see also reply to comment #30 for the discussion of boundary set for K and comment #31 for the discussion of the interaction between ⍺ and K). Note that only one new latent cause can be inferred in each trial, and (k+1)<sup>th</sup> latent cause, which has never been inferred so far, is chosen as the new latent cause.

      In our manuscript, the subscript number in zₖ denotes the order in which they were inferred, not the trial number. In Figures 3C and 3D, z<sub>3</sub> and z<sub>4</sub> were inferred in trials 5 and 6 during extinction; z<sub>5</sub> is a new latent cause inferred in trial 36. Therefore, the prior of z<sub>5</sub> is not extremely small compared to z<sub>4</sub> and z<sub>3</sub>.

      In both control and App<sup>NL-G-F</sup> mice in the 12-month-old (Figures 3C and 3D), z<sub>3</sub> is dominant until trial 35. The unsignaled shock at trial 35 generates a large prediction error as only context is presented and followed by the US. This prediction error reduces posterior of z<sub>3</sub>, while increasing the posterior of z<sub>4</sub> and w<sub>context</sub> in z<sub>3</sub> and z<sub>4</sub>. This decrease of posterior of z<sub>3</sub> is more obvious in the App<sup>NL-G-F</sup> than in the control group, prompting them to infer a new latent cause z<sub>5</sub> (Figure 3C and 3D). Although Figure 3C and 3D are illustrative examples as we explained in the reply to comment #14, this interpretation would be plausible as the App<sup>NL-G-F</sup> group inferred a significantly larger number of latent causes after the extinction with slightly higher posteriors of them than those in the control group (Figure 4E).

      [#17] (2) Related to the above, Are the states z<sub>A</sub> and z<sub>B</sub> defined by the authors to help the reader group the states into acquisition and extinction states, or are they somehow grouped by the model? If the latter is true, I don't understand how this would occur based on the model. If the former, could the authors state that these states were grouped together by the author?

      We used z<sub>A</sub> and z<sub>B</sub> annotations to assist with the explanation, so this is not grouped by the model. We have stated this in the manuscript Line 181-182.

      [#18] (3) This expands on the third point above. In Figure 3D, internal states z<sub>3</sub>, z<sub>4</sub>, and z<sub>5</sub> appear to be pretty much identical in weights in the App group. It's not clear to me why then the posterior of z<sub>5</sub> would all of a sudden jump up. If I understand correctly, the posterior is the likelihood of the observations given the internal state (presumably this should be similar across z<sub>3</sub>, z<sub>4</sub>, and z<sub>5</sub>), multiplied by the prior of the state. Z3 and Z4 are the dominant inferred states up to state 36. Why would z<sub>5</sub> become more likely if there doesn't appear to be any error? I'm inferring no error because there are little or no changes in weights on trial 36, most prominently no changes inz<sub>3</sub> which is the dominant internal state in step 36. If there's little change in weights, or no errors, shouldn't the prior dominate the calculation of the posterior which would lead to z<sub>3</sub> and z<sub>4</sub> being most prominent at trial 36?

      We have explained how z<sub>5</sub> of the 12-month-old App<sup>NL-G-F</sup> was inferred in the reply to comment #16. Here, we explain the process underlying the rapid changes of the posterior of z<sub>3</sub>, z<sub>4</sub>, and z<sub>5</sub> from trial 35 to 36.

      During the extinction, the mice inferred z<sub>3</sub> given the CS and the context in the absence of US. In trial 35, they observed the context and the unsignaled shock in the absence of the CS. This reduced the likelihood for the CS under z<sub>3</sub> and thereby the posterior of z<sub>3</sub>, while relatively increasing the posterior of z<sub>4</sub>. The associative weight between the context and the US , w<sub>context</sub>, indeed increased in both z<sub>3</sub> and z<sub>4</sub>, but w<sub>context</sub> of z<sub>4</sub> was updated more than that of z<sub>3</sub> due to its higher posterior probability. At the beginning of trial 36, a new latent cause z<sub>5</sub> was inferred with a certain prior (see also the reply for comment #16), and w<sub>5</sub> = w<sub>0</sub>, where w<sub>0</sub> is the initial value of weight. After normalizing the prior over latent causes, the emergence of z<sub>5</sub> reduced the prior probability of other latent causes compared to the case where the prior of z<sub>5</sub> is 0. Since the CS was presented while the US was absent in trial 36, the likelihood of the CS and that of the US under z<sub>3</sub>, and especially z<sub>4</sub>, given the cues and w became lower than the case in which z<sub>5</sub> has not been inferred yet. Consequently, the posterior of z<sub>5</sub> became salient (Figure 3D).

      To maintain consistency across panels, we used a uniform y-axis range. However, we acknowledge that this may make it harder to notice the changes of associative weights in Figure 3D. We have provided the subpanel in Figure 3D with a smaller y-axis limit to reveal the weight changes at trial 35 in Author response image 5.

      Author response image 5.

      Magnified view of w<sub>context</sub> and wCS in the last 3 trials in Figure 3D. The graph format is the same as in Figure 3D. The weight for CS (w<sub>CS</sub>) and that for context (w<sub>context</sub>) in each latent cause across trial 34 (test 2), 35 (unsignaled shock), and 36 (test 3) in 12-month-old App<sup>NL-G-F</sup> in Figure 3D was magnified in the upper and lower magenta box, respectively.

      [#19] (8) In Figure 4B - The figure legend didn't appear to indicate at which time points the DIs are plotted.

      We have amended the figure legend to indicate that DI between test 3 and test 1 is plotted.

      [#20] (9) Lines 301-303 state that the posterior probabilities of the acquisition internal states in the 12month AD mice were much higher at test 1 and that this resulted in different levels of CR across the control and 12-month App group. This is shown in the Figure 4A supplement, but this is not apparent in Figure 3 panels C and D. Is the example shown in panel D not representative of the group? The CRs across the two examples in Figure 3 C and D look extremely similar at test 1. Furthermore, the posteriors of the internal states look pretty similar across the two groups for the first 4 trials. Both the App and control have substantial posterior probabilities for the acquisition period, I don't see any additional states at test 1. The pattern of states during acquisition looks strikingly similar across the two groups, whereas the weights of the stimuli are considerably different. I think it would help the authors to use an example that better represents what the authors are referring to, or provide data to illustrate the difference. Figure 4C partly shows this, but it's not very clear how strong the posteriors are for the 3rd state in the controls.

      Figure 3 serves as an example to explain the internal states in each group (see also the third paragraph in the reply to comment #14). Figure 4C to H showed the results from each sample for between-group comparison in selected features. Therefore, the results of direct comparisons of the parameter values and internal states between genotypes in Figure 3 are not necessarily the same as those in Figure 4. Both examples in Figure 3C and 3D inferred 2 latent causes during the acquisition. In terms of posterior till test 1 (trial 4), the two could be the same. However, such examples were not rare, as the proportion of the mice that inferred 2 latent causes during the acquisition was slightly lower than 50% in the control, and around 90% in the App<sup>NL-G-F</sup> mice (Figure 4C). The posterior probability of acquisition latent cause in test 1 showed a similar pattern (Figure 4 – figure supplement 3), with values near 1 in around 50% of the control mice and around 90% of the App<sup>NL-G-F</sup> mice.  

      [#21] (10) Line 320: This is a confusing sentence. I think the authors are saying that because the App group inferred a new state during test 3, this would protect the weights of the 'extinction' state as compared to the controls since the strength of the weight updates depends on the probability of the posterior.

      In order to address this, we have revised this sentence to “Such internal states in App<sup>NL-G-F</sup> mice would diverge the associative weight update from those in the control mice after extinction.” in the manuscript Line 349-351.

      [#22] (11) In lines 517-519 the authors address the difference in generalizing the occurrence of stimuli across the App and control groups. It states that App mice with lower alpha generalized observations to an old cause rather than attributing it as a new state. Going back to statement 3 above, I think it's important to show that the model fit of a reduction in alpha does not go hand-in-hand with a reduction in the learning rates and hence the weights. Again, if the likelihoods are diminished due to the low weights, then the fit of alpha might be reduced as well. To reiterate my point above, if the observations in changes in generalization and differentiation occur because of a reduction in the learning rate, the modeling may not be providing a particularly insightful understanding of AD, other than that poor learning leads to ineffectual generalization and differentiation. Do these findings hold up if the learning rates are more comparable across the control and App group?

      These findings were explained on the basis of comparable learning rates between control and App<sup>NL-GF</sup> mice in the 12-month-old group (see the reply to comment #14). In addition, we have conducted simulation for different ⍺ and σ<sub>x</sub><sup>2</sup> values under the condition of the fixed learning rate, where overgeneralization and overdifferentaiton still occurred (see the reply to comment #26).  

      [#23] (12) Lines 391 - 393. This is a confusing sentence. "These results suggest that App NL-G-F mice could successfully form a spatial memory of the target hole, while the memory was less likely to be retrieved by a novel observation such as the absence of the escape box under the target hole at the probe test 1." The App mice show improved behavior across days of approaching the correct hole. Is this statement suggesting that once they've approached the target hole, the lack of the escape box leads to a reduction in the retention of that memory?

      We speculated that when the mice observed the absence of the escape box, a certain prediction error would be generated, which may have driven the memory modification. In App<sup>NL-G-F</sup> mice, such modification, either overgeneralization or overdifferentiation, could render the memory of the target hole vulnerable; if overgeneralization occurred, the memory would be quickly overwritten as the goal no longer exists in this position in this maze, while if overdifferentiation occurred, a novel memory such that the goal does not exist in the maze different from previous one would be formed. In either case of misclassification, the probability of retrieving the goal position would be reduced. To reduce ambiguity in this sentence, we have revised the description in the manuscript Line 432-434 as follows: “These results suggest that App<sup>NL-G-F</sup> mice could successfully form a spatial memory of the target hole, while they did not retrieve the spatial memory of the target hole as strongly as control mice when they observed the absence of the escape box during the probe test.”

      [#24] (13) The connection between the results of Barnes maze and the fear learning paradigm is weak. How can changes in overgeneralization due to a reduction in the creation of inferred states and differentiation due to a reduced σx lead to the observations in the Barnes maze experiment?

      We extrapolated our interpretation in the reinstatement modeling to behaviors in a different behavioral task, to explore the explanatory power of the latent cause framework formalizing mechanisms of associative learning and memory modification. Here, we explain the results of the reversal Barnes maze paradigm in terms of the latent cause model, while conferring the reinstatement paradigm.

      Whilst we acknowledge that fear conditioning and spatial learning are not fully comparable, the reversal Barnes maze paradigm used in our study shares several key learning components with the reinstatement paradigm. 

      First, associative learning is fundamental in spatial learning (Leising & Blaisdell, 2009; Pearce, 2009). Although we did not make any specific assumptions of what kind of associations were learned in the Barnes maze, performance improvements in learning phases likely reflect trial-and-error updates of these associations involving sensory preconditioning or secondary conditioning. Second, the reversal training phases could resemble the extinction phase in the reinstatement paradigm, challenge previously established memory. In terms of the latent cause model, both the reversal learning phase in the reversal Barnes maze paradigm and the extinction phase in the reinstatement paradigm induce a mismatch of the internal state. This process likely introduces large prediction errors, triggering memory modification to reconcile competing memories.  

      Under the latent cause framework, we posit that the mice would either infer new memories or modify existing memories for the unexpected observations in the Barnes maze (e.g., changed location or absence of escape box) as in the reinstatement paradigm, but learn a larger number of association rules between stimuli in the maze compared to those in the reinstatement. In the reversal Barnes maze paradigm, the animals would infer that a latent cause generates the stimuli in the maze at certain associative weights in each trial, and would adjust behavior by retaining competing memories.

      Both overgeneralization and overdifferentiation could explain the lower exploration time of the target hole in the App<sup>NL-G-F</sup> mice in probe test 1. In the case of overgeneralization, the mice would overwrite the existing spatial memory of the target hole with a memory that the escape box is absent. In the case of overdifferentiation, the mice would infer a new memory such that the goal does not exist in the novel field, in addition to the old memory where the goal exists in the previous field. In both cases, the App<sup>NL-G-F</sup> mice would not infer that the location of the goal is fixed at a particular point and failed to retain competing spatial memories of the goal, leading to relying on a less precise, non-spatial strategy to solve the task.  

      Since there is no established way to formalize the Barnes maze learning in the latent cause model, we did not directly apply the latent cause model to the Barnes maze data. Instead, we used the view above to explore common processes in memory modification between the reinstatement and the Barnes maze paradigm. 

      The above description was added to the manuscript on page 13 (Line 410-414) and page 19-20 (Line 600-602, 626-639).

      [#25] (14) In the fear conditioning task, it may be valuable to separate responding to the context and the cue at the time of the final test. The mice can learn about the context during the reinstatement, but there must be an inference to the cue as it's not present during the reinstatement phase. This would provide an opportunity for the model to perhaps access a prior state that was formed during acquisition. This would be more in line with the original proposal by Gershman et al. 2017 with spontaneous recovery.

      Please refer to the reply to comment #13 regarding separating the response to context in test 3.  

      Reviewer #3 (Public review):

      Summary:

      This paper seeks to identify underlying mechanisms contributing to memory deficits observed in Alzheimer's disease (AD) mouse models. By understanding these mechanisms, they hope to uncover insights into subtle cognitive changes early in AD to inform interventions for early-stage decline.

      Strengths:

      The paper provides a comprehensive exploration of memory deficits in an AD mouse model, covering the early and late stages of the disease. The experimental design was robust, confirming age-dependent increases in Aβ plaque accumulation in the AD model mice and using multiple behavior tasks that collectively highlighted difficulties in maintaining multiple competing memory cues, with deficits most pronounced in older mice.

      In the fear acquisition, extinction, and reinstatement task, AD model mice exhibited a significantly higher fear response after acquisition compared to controls, as well as a greater drop in fear response during reinstatement. These findings suggest that AD mice struggle to retain the fear memory associated with the conditioned stimulus, with the group differences being more pronounced in the older mice.

      In the reversal Barnes maze task, the AD model mice displayed a tendency to explore the maze perimeter rather than the two potential target holes, indicating a failure to integrate multiple memory cues into their strategy. This contrasted with the control mice, which used the more confirmatory strategy of focusing on the two target holes. Despite this, the AD mice were quicker to reach the target hole, suggesting that their impairments were specific to memory retrieval rather than basic task performance.

      The authors strengthened their findings by analyzing their data with a leading computational model, which describes how animals balance competing memories. They found that AD mice showed somewhat of a contradiction: a tendency to both treat trials as more alike than they are (lower α) and similar stimuli as more distinct than they are (lower σx) compared to controls.

      Weaknesses:

      While conceptually solid, the model struggles to fit the data and to support the key hypothesis about AD mice's ability to retain competing memories. These issues are evident in Figure 3:

      [#26] (1) The model misses key trends in the data, including the gradual learning of fear in all groups during acquisition, the absence of a fear response at the start of the experiment, the increase in fear at the start of day 2 of extinction (especially in controls), and the more rapid reinstatement of fear observed in older controls compared to acquisition.

      We acknowledge these limitations and explained why they arise in the latent cause model as follows.

      a. Absence of a fear response at the start of the experiment and the gradual learning of fear during acquisition 

      In the latent cause model, the CR is derived from a sigmoidal transformation from the predicted outcome with the assumption that its mapping to behavioral response may be nonlinear (see Equation 10 and section “Conditioned responding” in Gershman et al., 2017). 

      The magnitude of the unconditioned response (trial 1) is determined by w<sub>0</sub>, θ, and λ. An example was given in Appendix 2 – table 3. In general, a higher w<sub>0</sub> and a lower θ produce a higher trial 1 CR when other parameters are fixed. During the acquisition phase, once the expected shock exceeds θ, CR rapidly approaches 1, and further increases in expected shock produce few changes in CR. This rapid increase was also evident in the spontaneous recovery simulation (Figure 11) in Gershman et al. (2017). The steepness of this rapid increase is modulated by λ such that a higher value produces a shallower slope. This is a characteristic of the latent cause model, assuming CR follows a sigmoid function of expected shock, while the ordinal relationship over CRs is maintained with or without the sigmoid function, as Gershman et al. (2017) mentioned. If one assumes that the CR should be proportional to the expected shock, the model can reproduce the gradual response as a linear combination of w and posteriors of latent causes while omitting the sigmoid transformation (Figure 3). 

      b. Increase in fear at the start of day 2 extinction

      This point is partially reproduced by the latent cause model. As shown in Figure 3, trial 24 (the first trial of day 2 extinction) showed an increase in both posterior probability of latent cause retaining fear memory and the simulated CRs in all groups except the 6-month-old control group, though the increase in CR was small due to the sigmoid transformation (see above). This can be explained by the latent cause model as 24 h time lapse between extinction 1 and 2 decreases the prior of the previously inferred latent cause, leading to an increase of those of other latent causes. 

      Unlike other groups, the 6-month-old control did not exhibit increased observed CR at trial 24

      but at trial 25 (Figure 3A). The latent cause model failed to reproduce it, as there was no increase in posterior probability in trial 24 (Figure 3A). This could be partially explained by the low value of g, which counteracts the effect of the time interval between days: lower g keeps prior of the latent causes at the same level as those in the previous trial. Despite some failures in capturing this effect, our fitting policy was set to optimize prediction among the test trials given our primary purpose of explaining reinstatement.

      c. more rapid reinstatement of fear observed in older controls compared to acquisition

      We would like to point out that this was replicated by the latent cause model as shown in Figure 3 – figure supplement 1C. The DI between test 3 and test 1 calculated from the simulated CR was significantly higher in 12-month-old control than in App<sup>NL-G-F</sup> mice (cf. Figure 2C to E).  

      [#27] (2) The model attributes the higher fear response in controls during reinstatement to a stronger association with the context from the unsignaled shock phase, rather than to any memory of the conditioned stimulus from acquisition. These issues lead to potential overinterpretation of the model parameters. The differences in α and σx are being used to make claims about cognitive processes (e.g., overgeneralization vs. overdifferentiation), but the model itself does not appear to capture these processes accurately. The authors could benefit from a model that better matches the data and that can capture the retention and recollection of a fear memory across phases.

      First, we would like to clarify that the latent cause model explains the reinstatement not only by the extinction latent cause with increased w<sub>context</sub> but also the acquisition latent cause with preserved wCS and w<sub>context</sub> (see also reply to comment #13). Second, the latent cause model primarily attributes the higher fear reinstatement in control to a lower number of latent causes inferred after extinction (Figure 4E) and higher w<sub>context</sub> in extinction latent cause (Figure 4G). We noted that there was a trend toward significance in the posterior probability of latent causes inferred after extinction (Figure 4E), which in turn influences those of acquisition latent causes. Although the posterior probability of acquisition latent cause appeared trivial and no significance was detected between control and App<sup>NL-G-F</sup> mice (Figure 4C), it was suppressed by new latent causes in App<sup>NL-G-F</sup> mice (Author response image 6).

      This indicates that App<sup>NL-G-F</sup> mice retrieved acquisition memory less strongly than control mice. Therefore, we argue that the latent cause model attributed a higher fear response in control during reinstatement not solely to the stronger association with the context but also to CS fear memory from acquisition. Although we tested whether additional models fit the reinstatement data in individual mice, these models did not satisfy our fitting criteria for many mice compared to the latent cause model (see also reply to comment #4 and #28).

      Author response image 6.

      Posterior probability of acquisition, extinction, and after extinction latent causes in test 3. The values within each bar indicate the mean posterior probability of acquisition latent cause (darkest shade), extinction latent cause (medium shade), and latent causes inferred after extinction (lightest shade) in test 3 over mice within genotype. Source data are the same as those used in Figure 4C–E (posterior of z).

      Conclusion:

      Overall, the data support the authors' hypothesis that AD model mice struggle to retain competing memories, with the effect becoming more pronounced with age. While I believe the right computational model could highlight these differences, the current model falls short in doing so.

      Reviewer #3 (Recommendations for the authors):

      [#28] Other computational models may better capture the data. Ideally, I'd look for a model that can capture the gradual learning during acquisition, and, in some mice, the inferring of a new latent cause during extinction, allowing the fear memory to be retained and referenced at the start of day 2 extinction and during later tests.

      We have further evaluated another computational model, the latent state model, and compared it with the latent cause model. The simulation of reinstatement and parameter estimation method of the latent state model were described in the Appendix.

      The latent state model proposed by Cochran and Cisler (2019) shares several concepts with the latent cause model, and well replicates empirical data under certain conditions. We expect that it can also explain the reinstatement. 

      Following the same analysis flow for the latent cause model, we estimated the parameters and simulated reinstatement in the latent state model from individual CRs and median of them. In the median freezing rate data of the 12-month-old control mice, the simulated CR replicated the observed CR well and exhibited the ideal features that the reviewer looked for: gradual learning during acquisition and an increased fear at the start of the second-day extinction (Appendix 1 – figure 1G). However, a lot of samples did not fit well to the latent state model. The number of anomalies was generally higher than that in the latent cause model (Appendix 1 – figure 2). Within the accepted samples, the sum of squared prediction error in all trials was significantly lower in the latent state model, which resulted from lower prediction error in the acquisition trials (Appendix 1 – figure 4A and 4B). In the three test trials, the squared prediction error was comparable between the latent state model and the latent cause model except for the test 2 trials in the control group (Appendix 1 – figure 4A and 4B, rightmost panel). On the other hand, almost all accepted samples continued to infer the acquisition latent states during extinction without inferring new states (Appendix 1 – figure 5B and 5E, left panel), which differed from the ideal internal states the reviewer expected. While the latent state model fit performance seems to be better than the latent cause model, the accepted samples cannot reproduce the lower DI between test 3 and test 1 in aged App<sup>NL-G-F</sup> mice (Appendix 1 – figure 6C). These results make the latent state model less suitable for our purpose and therefore we decided to stay with the latent cause model. It should also be noted that we did not explore all parameter spaces of the latent state model hence we cannot rule out the possibility that alternative parameter sets could provide a better fit and explain the memory modification process well. A more comprehensive parameter search in the LSM may be a valuable direction for future research.

      If you decide not to go with a new model, my preference would be to drop the current modeling. However, if you wish to stay with the current model, I'd like to see justification or acknowledgment of the following:

      [#29] (1) Lower bound on alpha of 1: This forces the model to infer new latent causes, but it seems that some mice, especially younger AD mice, might rely more on classical associative learning (e.g., Rescorla-Wagner) rather than inferring new causes.

      We acknowledge that the default value set in Gershman et al. (2017) is 0.1, and the constraint we set is a much higher value. However, ⍺ = 1 does not always force the model to infer new latent causes.

      In the standard form Chinese restaurant process (CRP), the prior that n<sup>th</sup> observation is assigned to a new cluster is given by ⍺ / (n - 1 + ⍺) (Blei & Gershman, 2012). When ⍺ = 1, the prior of the new cluster for the 2nd observation will be 0.5; when ⍺ = 3, this prior increases to 0.75. Thus, when ⍺ > 1, the prior of the new cluster is above chance early in the sequence, which may relate to the reviewer’s concern. However, this effect diminishes as the number of observations increases. For instance, the prior of the new cluster drops to 0.1 and 0.25 for the 10th observation when ⍺ = 1 and 3, respectively. Furthermore, the prior in the latent cause model is governed by not only α but also g, a scaling parameter for the temporal difference between successive observations (see Results in the manuscript) following “distance-dependent” CRP, then normalized over all latent causes including a new latent cause. Thus, it does not necessarily imply that ⍺ greater than 1 forces agents to infer a new latent cause_. As shown in Appendix 2 – table 4, the number of latent causes does not inflate in each trial when _α = 1. On the other hand, the high number of latent causes due to α = 2 can be suppressed when g = 0.01. More importantly, the driving force is the prediction error generated in each trial (see also comment #31 about the interaction between ⍺ and σ<sub>x</sub><sup>2</sup>). Raising the value of ⍺ per se can be viewed as increasing the probability to infer a new latent cause, not forcing the model to do so by higher α alone. 

      During parameter exploration using the median behavioral data under a wider range of ⍺ with a lower boundary at 0.1, the estimated value eventually exceeded 1. Therefore, we set the lower bound of ⍺ to be 1 is to reduce inefficient sampling. 

      [#30] (2) Number of latent causes: Some mice infer nearly as many latent causes as trials, which seems unrealistic.

      We set the upper boundary for the maximum number of latent causes (K) to be 36 to align with the infinite features of CRP. This allowed some mice to infer more than 20 latent causes in total. When we checked the learning curves in these mice, we found that they largely fluctuated or did not show clear decreases during the extinction (Author response image 7, colored lines). The simulated learning curves were almost flat in these trials (Author response image 7, gray lines). It might be difficult to estimate the internal states of such atypical mice if the sampling process tried to fit them by increasing the number of latent causes. Nevertheless, most of the samples have a reasonable total number of latent causes: 12-month-old control mice, Mdn = 5, IQR = 4; 12-month-old App<sup>NL-G-F</sup> mice, Mdn = 5, IQR = 1.75; 6-month-old control mice, Mdn = 7, IQR = 12.5; 6-month-old App<sup>NL-G-F</sup> mice, Mdn = 5, IQR = 5.25. These data were provided in Tables S9 and S12.  

      Author response image 7.

      Samples with a high number of latent causes. Observed CR (colored line) and simulated CR (gray line) for individual samples with a total number of inferred latent causes exceeding 20. 

      [#31] (3) Parameter estimation: With 10 parameters fitting one-dimensional curves, many parameters (e.g., α and σx) are likely highly correlated and poorly identified. Consider presenting scatter plots of the parameters (e.g., α vs σx) in the Supplement.

      We have provided the scatter plots with a correlation matrix in Figure 4 – figure supplement 1 for the 12-month-old group and Figure 5 – figure supplement 1 for the 6-month-old group. As pointed out by the reviewer, there are significant rank correlations between parameters including ⍺ and σ<sub>x</sub><sup>2</sup> in both the 6 and 12-month-old groups. However, we also noted that there are no obvious linear relationships between the parameters.

      The correlation above raises a potential problem of non-identifiability among parameters. First, we computed the variance inflation index (VIF) for all parameters to examine the risk of multicollinearity, though we did not consider a linear regression between parameters and DI in this study. All VIF values were below the conventional threshold 10 (Appendix 2 – table 5), suggesting that severe multicollinearity is unlikely to bias our conclusions. Second, we have conducted the simulation with different combinations of ⍺, σ<sub>x</sub><sup>2</sup>, and K to clarify their contribution to overgeneralization and overdifferentiation observed in the 12-month-old group. 

      In Appendix 2 – table 6, the values of ⍺ and σ<sub>x</sub><sup>2</sup> were either their upper or lower boundary set in parameter estimation, while the value K was selected heuristically to demonstrate its effect. Given the observed positive correlation between alpha and σ<sub>x</sub><sup>2</sup>, and their negative correlation with K (Figure 4 - figure supplement 1), we consider the product of K \= {4, 35}, ⍺ \= {1, 3} and σ<sub>x</sub><sup>2</sup> \= {0.01, 3}. Among these combinations, the representative condition for the control group is α = 3, σ<sub>x</sub><sup>2</sup> = 3, and that for the App<sup>NL-G-F</sup> group is α = 1, σ<sub>x</sub><sup>2</sup> = 0.01. In the latter condition, overgeneralization and overdifferentiation, which showed higher test 1 CR, lower number of acquisition latent causes (K<sub>acq</sub>), lower test 3 CR, lower DI between test 3 and test 1, and higher number of latent causes after extinction (K<sub>rem</sub>), was extremely induced. 

      We found conditions that fall outside of empirical correlation, such as ⍺ = 3, σ<sub>x</sub><sup>2</sup> = 0.01, also reproduced overgeneralization and overdifferentiation. Similarly, the combination, ⍺ = 1, σ<sub>x</sub><sup>2</sup> = 3, exhibited control-like behavior when K = 4 but shifted toward App<sup>NL-G-F</sup>-like behavior when K = 36. The effect of K was also evident when ⍺ = 3 and σ<sub>x</sub><sup>2</sup> = 3, where K = 36 led to over-differentiation. We note that these conditions were artificially set and likely not representative of biologically plausible. These results underscore the non-identifiability concern raised by the reviewer. Therefore, we acknowledge that merely attributing overgeneralization to lower ⍺ or overdifferentiation to lower σ<sub>x</sub><sup>2</sup> may be overly reductive. Instead, these patterns likely arise from the joint effect of ⍺, σ<sub>x</sub><sup>2</sup>, and K. We have revised the manuscript accordingly in Results and Discussion (page 11-13, 18-19).

      [#32] (4) Data normalization: Normalizing the data between 0 and 1 removes the interpretability of % freezing, making mice with large changes in freezing indistinguishable seem similar to mice with small changes.

      As we describe in our reply to comment #26, the conditioned response in the latent cause model was scaled between 0 and 1, and we assume 0 and 1 mean the minimal and maximal CR within each mouse, respectively. Furthermore, although we initially tried to fit simulated CRs to raw CRs, we found that the fitting level was low due to the individual difference in the degree of behavioral expression: some mice exhibited a larger range of CR, while others showed a narrower one. Thus, we decided to normalize the data. We agree that this processing will make the mice with high changes in freezing% indistinguishable from those with low changes. However, the freezing% changes within the mouse were preserved and did not affect the discrimination index.

      [#33] (5) Overlooking parameter differences: Differences in parameters, like w<sub>0</sub>, that didn't fit the hypothesis may have been ignored.

      Our initial hypothesis is that internal states were altered in App<sup>NL-G-F</sup> mice, and we did not have a specific hypothesis on which parameter would contribute to such a state. We mainly focus on the parameters (1) that are significantly different between control and App</sup>NL-G</sup>- mice and (2) that are significantly correlated to the empirical behavioral data, DI between test 3 and test 1. 

      In the 12-month-old group, besides ⍺ and σ<sub>x</sub><sup>2</sup>, w<sub>0</sub> and K showed marginal p-value in Mann-Whitney U test (Table S7) and moderate correlation with the DI (Table S8). While differences in K were already discussed in the manuscript, we did miss the point that w<sub>0</sub> could contribute to the differences in w between control and App<sup>NL-G-F</sup> (Figure 4G) in the previous manuscript. We explain the contribution of w<sub>0</sub> on the reinstatement results here. When other parameters are fixed, higher w<sub>0</sub> would lead to higher CR in test 3, because higher w<sub>0</sub> would allow increasing w<sub>context</sub> by the unsignaled shock, leading to reinstatement (Appendix 2 – table 7). It is likely that higher w<sub>0</sub> would be sampled through the parameter estimation in the 12-month-old control but not App<sup>NL-G-F</sup>. On the other hand, the number of latent causes is not sensitive to w<sub>0</sub> when other parameters were fixed at the initial guess value (Appendix 2 – table 1), suggesting w<sub>0</sub> has a small contribution to memory modification process. 

      Thus, we speculate that although the difference in w<sub>0</sub> between control and App<sup>NL-G-F</sup> mice may arise from the sampling process, resulting in a positive correlation with DI between test 3 and test 1, its contribution to diverged internal states would be smaller relative to α or σ<sub>x</sub><sup>2</sup> as a wide range of w<sub>0</sub> has no effect on the number of latent causes (Appendix 2 – table 7). We have added the discussion of differences in w<sub>0</sub> in the 12-month-old group in manuscript Line 357-359.

      In the 6-month-old group, besides ⍺ and σ<sub>x</sub><sup>2</sup>, 𝜃 is significantly higher in the AD mice group (Table S10) but not correlated with the DI (Table S11). We have already discussed this point in the manuscript.  

      [#34] (6) Initial response: Higher initial responses in the model at the start of the experiment may reflect poor model fit.

      Please refer to our reply to comment #26 for our explanation of what contributes to high initial responses in the latent cause model.

      In addition, achieving a good fit for the acquisition CRs was not our primary purpose, as the response measured in the acquisition phase includes not only a conditioned response to the CS and context but also an unconditioned response to the novel stimuli (CS and US). This mixed response presumably increased the variance of the measured freezing rate over individuals, therefore we did not cover the results in the discussion.

      Rather, we favor models at least replicating the establishment of conditioning, extinction and reinstatement of fear memory in order to explain the memory modification process. As we mentioned in the reply for comment #4, alternative models, the latent state model and the Rescorla-Wagner model, failed to replicate the observation (cf. Figure 3 – figure supplement 1A-1C). Thus, we chose to stand on the latent cause model as it aligns better with the purpose of this study. 

      [#35] In addition, please be transparent if data is excluded, either during the fitting procedure or when performing one-way ANCOVA. Avoid discarding data when possible, but if necessary, provide clarity on the nature of excluded data (e.g., how many, why were they excluded, which group, etc?).

      We clarify the information of excluded data as follows. We had 25 mice for the 6-month-old control group, 26 mice for the 6-month-old App<sup>NL-G-F</sup> group, 29 mice for the 12-month-old control group, and 26 mice for the 12-month-old App<sup>NL-G-F</sup> group (Table S1). 

      Our first exclusion procedure was applied to the freezing rate data in the test phase. If the mouse had a freezing rate outside of the 1.5 IQR in any of the test phases, it is regarded as an outlier and removed from the analysis (see Statistical analysis in Materials and Methods). One mouse in the 6-month-old control group, one mouse in the 6-month-old App<sup>NL-G-F</sup> group, five mice in the 12-month-old control group, and two mice in the 12-month-old App<sup>NL-G-F</sup> group were excluded.

      Our second exclusion procedure was applied during the fitting and parameter estimation (see parameter estimation in Materials and Methods). We have provided the number of anomaly samples during parameter estimation in Appendix 1 – figure 2.   

      Lastly, we would like to state that all the sample sizes written in the figure legends do not include outliers detected through the exclusion procedure mentioned above.

      [#36] Finally, since several statistical tests were used and the differences are small, I suggest noting that multiple comparisons were not controlled for, so p-values should be interpreted cautiously.

      We have provided power analyses in Tables S21 and S22 with methods described in the manuscript (Line 897-898) and added a note that not all of the multiple comparisons were corrected for in the manuscript (Line 898-899).

      References cited in the response letter only 

      Bellio, T. A., Laguna-Torres, J. Y., Campion, M. S., Chou, J., Yee, S., Blusztajn, J. K., & Mellott, T. J. (2024). Perinatal choline supplementation prevents learning and memory deficits and reduces brain amyloid Aβ42 deposition in App<sup>NL-G-F</sup> Alzheimer’s disease model mice. PLOS ONE, 19(2), e0297289. https://doi.org/10.1371/journal.pone.0297289

      Blei, D. M., & Frazier, P. I. (2011). Distance Dependent Chinese Restaurant Processes. Journal of Machine Learning Research, 12(74), 2461–2488.

      Cochran, A. L., & Cisler, J. M. (2019). A flexible and generalizable model of online latent-state learning. PLOS Computational Biology, 15(9), e1007331. https://doi.org/10.1371/journal.pcbi.1007331

      Curiel Cid, R. E., Crocco, E. A., Duara, R., Vaillancourt, D., Asken, B., Armstrong, M. J., Adjouadi, M., Georgiou, M., Marsiske, M., Wang, W., Rosselli, M., Barker, W. W., Ortega, A., Hincapie, D., Gallardo, L., Alkharboush, F., DeKosky, S., Smith, G., & Loewenstein, D. A. (2024). Different aspects of failing to recover from proactive semantic interference predicts rate of progression from amnestic mild cognitive impairment to dementia. Frontiers in Aging Neuroscience, 16. https://doi.org/10.3389/fnagi.2024.1336008

      Giustino, T. F., Fitzgerald, P. J., Ressler, R. L., & Maren, S. (2019). Locus coeruleus toggles reciprocal prefrontal firing to reinstate fear. Proceedings of the National Academy of Sciences, 116(17), 8570–8575. https://doi.org/10.1073/pnas.1814278116

      Gu, X., Wu, Y.-J., Zhang, Z., Zhu, J.-J., Wu, X.-R., Wang, Q., Yi, X., Lin, Z.-J., Jiao, Z.-H., Xu, M., Jiang, Q., Li, Y., Xu, N.-J., Zhu, M. X., Wang, L.-Y., Jiang, F., Xu, T.-L., & Li, W.-G. (2022). Dynamic tripartite construct of interregional engram circuits underlies forgetting of extinction memory. Molecular Psychiatry, 27(10), 4077–4091. https://doi.org/10.1038/s41380-022-01684-7

      Lacagnina, A. F., Brockway, E. T., Crovetti, C. R., Shue, F., McCarty, M. J., Sattler, K. P., Lim, S. C., Santos, S. L., Denny, C. A., & Drew, M. R. (2019). Distinct hippocampal engrams control extinction and relapse of fear memory. Nature Neuroscience, 22(5), 753–761. https://doi.org/10.1038/s41593-019-0361-z

      Loewenstein, D. A., Curiel, R. E., Greig, M. T., Bauer, R. M., Rosado, M., Bowers, D., Wicklund, M., Crocco, E., Pontecorvo, M., Joshi, A. D., Rodriguez, R., Barker, W. W., Hidalgo, J., & Duara, R. (2016). A Novel Cognitive Stress Test for the Detection of Preclinical Alzheimer’s Disease: Discriminative Properties and Relation to Amyloid Load. The American Journal of Geriatric Psychiatry : Official Journal of the American Association for Geriatric Psychiatry, 24(10), 804–813. https://doi.org/10.1016/j.jagp.2016.02.056

      Loewenstein, D. A., Greig, M. T., Curiel, R., Rodriguez, R., Wicklund, M., Barker, W. W., Hidalgo, J., Rosado, M., & Duara, R. (2015). Proactive Semantic Interference Is Associated With Total and Regional Abnormal Amyloid Load in Non-Demented Community-Dwelling Elders: A Preliminary Study. The American Journal of Geriatric Psychiatry : Official Journal of the American Association for Geriatric Psychiatry, 23(12), 1276–1279. https://doi.org/10.1016/j.jagp.2015.07.009

      Valles-Salgado, M., Gil-Moreno, M. J., Curiel Cid, R. E., Delgado-Á lvarez, A., Ortega-Madueño, I., Delgado-Alonso, C., Palacios-Sarmiento, M., López-Carbonero, J. I., Cárdenas, M. C., MatíasGuiu, J., Díez-Cirarda, M., Loewenstein, D. A., & Matias-Guiu, J. A. (2024). Detection of cerebrospinal fluid biomarkers changes of Alzheimer’s disease using a cognitive stress test in persons with subjective cognitive decline and mild cognitive impairment. Frontiers in Psychology, 15. https://doi.org/10.3389/fpsyg.2024.1373541

      Zaki, Y., Mau, W., Cincotta, C., Monasterio, A., Odom, E., Doucette, E., Grella, S. L., Merfeld, E., Shpokayte, M., & Ramirez, S. (2022). Hippocampus and amygdala fear memory engrams reemerge after contextual fear relapse. Neuropsychopharmacology, 47(11), 1992–2001. https://doi.org/10.1038/s41386-022-01407-0

    1. Author Response

      On behalf of my co-authors, I thank you very much for sending our manuscript (# eLifeRP-RA-2023-91223) entitled “Elimination of subtelomeric repeat sequences exerts little effect on telomere functions in Saccharomyces cerevisiae” for review and providing us an opportunity for revision. We also thank the reviewers for their critical and constructive comments and suggestions which have helped us to strengthen our study. We have performed more experiments to address the concerns the reviewers raised, and we have also revised or corrected some of our statements as the reviewers suggested.

      Reviewer #1

      1) The author’s data indicate that cells with many chromosomes are more dependent on possibly homologous recombination than SY12 cells with three chromosomes. Telomerase-deficient cells exhibit the type I and type II telomere structures, whereas telomerase-deficient SY12 cells often generate different telomere structures (named Type X survivors or atypical survivors). Type I survivor depends on Rad51 possessing tandem Y' elements whereas Type II survivor depends on Rad59 carrying long TG sequences (line 60-70). Both types require Rad52 (line 66-70). At the moment, it is not determined how Type X or atypical survivors are generated in telomerase-deficient SY12 cells.

      The authors need to determine whether Type X or atypical survivors depend on other repair pathways from Type I and Type II, and what DNA sequences are retained adjacent to telomeres in Type X or atypical survivors by sequencing analysis (Fig. 2).

      We thank the reviewer’s valuable comments and suggestions. Atypical survivor is a subtype of survivor that exhibits non-uniform telomere patterns, distinct from those observed in Type I, Type II, Type X, or circular survivors. To further determine its genetic requirements, we deleted RAD52 in SY12 tlc1Δ, SY12YΔ tlc1Δ, SY12XYΔ tlc1Δ, and SY12XYΔ+Y tlc1Δ strains. Southern blotting results showed that neither Type I nor Type II survivors were found in the series of strains; circular survivor was in the predomination; beside circular survivor, some survivors exhibiting non-uniform telomere patterns suggested they were atypical survivor. These results have been presented as Figure 2—figure supplement 6B, Figure 5—figure supplement 2B and Figure 6—figure supplement 4B in the revised version. The results showed that atypical survivors still emerged when Rad52 pathway was repressed, indicating that the formation of atypical survivors does not strictly rely on the homologous recombination.

      Given that "atypical" clones exhibit non-uniform telomere patterns, it’s not surprising that their chromosome structures are variable and tanglesome. Consequently, it is hard for us to amplify and sequence the DNA sequences retained adjacent to telomeres.

      Since no Type X survivor was detected in SY12 tlc1Δ rad52Δ strain (Author response image 1A), we deleted RAD50 or RAD51 in SY12 tlc1Δ strain to investigate on which pathway the formation of the Type X survivor relied. Results showed that Type X survivor emerged in the absence of Rad51 but not Rad50, suggesting that the formation of Type X survivor depended on Rad50 pathway. These results have been presented as Figure 2—figure supplement 7.

      To determine the chromosomal end structure of the Type X survivor, we randomly selected a typical Type X survivor, and performed PCR-sequencing analysis. The results revealed the intact chromosome ends for I-L, X-R, XIII-L, XI-R, and XIV-R, albeit with some mismatches compared with the S. cerevisiae S288C genome, which possibly arising from recombination events that occurred during survivor formation. Notably, the sequence of the Y’-element in XVI-L could not be detected, while the X-element remained intact. Figure 2—figure supplement 5 in the revised manuscript.

      2) Survivor generation of each type (Type I, Type II, Type X or atypical and circularization) needs to be accurately quantitated. The authors concluded that X or Y' elements are not strictly necessary for survivor formation (Fig. 5 and Fig. 6). However, their removal appears to increase atypical survivor and chromosome circularization (Fig. 2 vs Fig. 5 and 6).

      We are grateful for the reviewer’s critical and constructive suggestions. According to the reviewer’s requirement, we quantified each type of survivors in SY12 tlc1Δ, SY12YΔ tlc1Δ, SY12XYΔ tlc1Δ and SY12XYΔ+Y tlc1Δ strains (Figure 2D, 5C, 6A and 6B). In SY12 tlc1Δ strain, Type I survivors accounted for 16%, Type II survivors for 2%, Type X survivors for 24%, circular survivors for 20% and atypical survivors for 38%. In SY12YΔ tlc1Δ strain, 4% were Type II survivors, 52% were circular survivors and 44% were atypical survivors.

      For the SY12XYΔ tlc1Δ strain, 8% were Type II survivors, 48% were circular survivors and 44% were atypical survivors. In SY12XYΔ+Y tlc1Δ strain, the proportions of Type II, circular and atypical survivors were 14%, 44%, and 42%, respectively (Author response image 1).

      In comparing SY12YΔ with SY12XYΔ, we observed a similar ratio of circular and atypical survivors. This result indicates that the remove of X-elements exert little effect on the formation of circular and atypical survivors. Similarly, in SY12XYΔ+Y strain, the proportions of circular and atypical survivors were comparable to those in SY12XYΔ strain, indicating that Y’-elements also have little effect on the formation of circular and atypical survivors. However, due to the unknown frequency of survivor formation, alternative explanations of these data are possible. For example, subtelomeric elements previously suggested to have no impact on the formation of any survivor types might influence every type to similar extents, leading to similar ratios across all survivor types. With our present data, it is still unclear whether the absence of X and Y'-elements enhances the formation of circular and atypical survivors. Therefore, we did not present these results in the revised manuscript.

      Author response image 1.

      Quantitation of each survivor type in SY12 subtelomerice engineered strains. The ratio of survivor types in SY12 tlc1Δ, SY12YΔ tlc1Δ, SY12XYΔ tlc1Δ and SY12XYΔ+Y tlc1Δ strains. Type I, pulper; Type II, green; Type X, gray; atypical survivor, orange; circular survivor, blue.

      3)The authors asked whether X and Y' elements are required for cell proliferation, stress response, telomere length control and telomere silencing (Fig. 4). Similar studies have been previously carried out by using synthetic chromosomes (see PMID: 28300123). The authors need to discuss this point.

      Thanks for your suggestion, we have added the information in the revised version. (p.24 line 449-453)

      4) The Fig. 7 data support that circular chromosomes do not require Ku-dependent DNA end protection. This is consistent with the current view that Ku binds and protects DNA ends. This finding by itself does not contribute significantly to our understanding of telomere maintenance. The authors need to more extensively discuss the significance of their findings in SY12 cells compared to wild-type cells with 16 chromosomes.

      We agree with the logic that this reviewer has pointed out. Our results demonstrate that combinatorial deletion of YKU70 and TLC1 caused synthetic lethality in SY12 cells, which possess three linear chromosomes, However, it did not affect the viability of "circular survivors", supporting the notion that telomere deprotection leads to the synthetic lethality in yku70Δ tlc1Δ double mutants. Nevertheless, this conclusion merely confirms the current view observed in wild-type cells that Ku binds and protects DNA ends.

      To avoid confusing readers and maintain the logical flow of the manuscript, we have deleted this section in the revised version.

      Minor issues:

      1) Line 112-113: " for SY13, which contains two chromosomes, could also have a high probability of circularizing all chromosomes for survival": The reference or the supplemental data are required.

      Thank this reviewer for the suggestion. According to the reviewer’s comments, we performed a Southern blotting assay to examine the types of survivors in SY13 tlc1Δ strain. We found that the majority of SY13 tlc1Δ clones exhibited hybridization signal similar to SY14 tlc1Δ circular survivors, pointing to the possibility that two chromosomes in these survivors may undergo intra-chromosomal fusions. This result has been added to figure 1D in the revised version.

      2) Line 349-350: The BY4742 mre11Δ haploid strain serves as a negative control. The authors need to explain why mre11 cells serve as a negative control.

      Thank this reviewer for the comment. We employed mre11Δ as negative control because Mre11 is a member of the RAD52 epistasis group, which is involved in the repair of double-stranded breaks in DNA, and mutants in MRE11 exhibit defects in the repair of DNA damages caused by DNA damage drugs (Krogh and Symington, 2004; Lewis et al., 2004; Symington, 2002). (p.23 line 420-422)

      Reviewer #2

      1) The qualification of survivor types mostly relies on molecular patterns in Southern blots. While this is a valid method for a standard strain, it might be more difficult to apply to the strains used in this study. For example, in SY8, SY11 and SY12, the telomere signal at 1-1.2 kb can be very faint due to the small number of terminal Y' elements left. As another example, for the Y'-less strain, it might seem obvious that no Type I survivor can emerge given that Y' amplification is a signature of Type I, but maybe Type-I-specific molecular mechanisms might still be used. To reinforce the characterization of survivor types, an analysis of the genetic requirements for Type I and Type II survivors (e.g. RAD51, RAD54, RAD59, RAD50) could complement the molecular characterization in specific result sections.

      We thank this reviewer for his/her constructive comments and suggestions. To investigate whether Type-I-specific molecular mechanisms are still utilized in the survivor formation in Y'-less strain, we deleted RAD51 in SY12XYΔ tlc1Δ. SY12XYΔ tlc1Δ rad51Δ strain was able to generate three types of survivors, including Type II survivor, circular survivor and atypical survivor, similar to the observations in SY12XYΔ tlc1Δ strain. However, the ratios of circular and atypical survivors were 36% and 32%, respectively, lower than the 48% and 44% observed in SY12XYΔ tlc1Δ strain (supplementary file 5). This result indicates that Type-I-specific molecular mechanisms contribute to the survivor formation. Given that our work primarily focuses on the function of subtelomeric elements, we chose not to include this result in our revised manuscript to maintain a coherent logical flow.

      To reinforce the characterization of survivor types, we deleted RAD50, RAD51 and RAD52 in SY12 tlc1Δ strain, respectively. Southern blotting assay revealed that in the absence of Rad51, no Type I survivor was detected; in the absence of Rad50, neither Type I nor Type X survivor was detected. However, circular and atypical survivors still emerged in the absence of Rad52, suggesting that the RAD52-mediated homologous recombination is not strictly necessary for the formation of circular and atypical survivors. These results have been presented as Figure 2—figure supplement 6 and Figure 2— figure supplement 7.

      2) In the title, the abstract and throughout the discussion, the authors chose to focus on the effect of X- and Y'-element deletion on different phenotypes and on survivor formation, as the main message to convey. While it is a legitimate and interesting message, other important results of this work might benefit from more spotlight. Namely, the observation that strains with different chromosome numbers show different survivor patterns and that several survival strategies beyond Type I and II exist and can reach substantial frequencies depending on the chromosomal context.

      Thanks for your valuable suggestion. While we value your suggestion to highlight additional aspects of our work, we would like to express our perspective on the current emphasis on the effect of X- and Y'-element deletion. We believe that by maintaining this focus, we can present a more coherent and impactful narrative for our readers. Additionally, we recognize that the relationship between chromosome numbers and survivor type frequencies is complex and warrants further experimental validation. We are considering exploring this aspect in more detail in our future projects. However, we fully acknowledge the importance of the observations you raised concerning strains with different chromosome numbers and the diversity of survival strategies.

      3) In SY12 strain, while X- and Y'-elements are not essential for survivor emergence, they do modulate the frequency of each type of survivors, with more chromosome circularization events observed for SY12YΔ, SY12XYΔ and SY12XYΔ+Y strains. This result should be stated and discussed, maybe alongside the change in survivor patterns in the other SY strains, to more accurately assess the roles of these subtelomeric elements.

      Following the reviewer’s suggestion, we compared the circular survivor ratios in SY12 tlc1Δ, SY12YΔ tlc1Δ, SY12XYΔ tlc1Δ and SY12XYΔ+Y tlc1Δ strains (supplementary file 5). It appears that the formation of circular survivors is less efficient in the SY12 tlc1Δ, with a ratio of 20%, much lower than that in SY12YΔ tlc1Δ, SY12XYΔ tlc1Δ or SY12XYΔ+Y tlc1Δ strains. However, it should be noted that SY12 tlc1Δ can generate Type I and Type X survivors, potentially decreasing the ratio of circular survivors.

      Therefore, we further compared the circular survivor ratios in SY12YΔ tlc1Δ, SY12XYΔ tlc1Δ and SY12XYΔ+Y tlc1Δ strains. In the SY12YΔ tlc1Δ strain, circular survivors accounted for 52% (26/50), comparable to 48% (24/50) in the SY12XYΔ tlc1Δ strain, indicating that X- elements exert little effect on the formation of circular survivor. Additionally, the ratio of circular survivors was 44% (22/50) in SY12XYΔ+Y tlc1Δ strain, also comparable to 48% (24/50) in the SY12XYΔ tlc1Δ strain, suggesting that Y’-element also has little effect on chromosome circularization. However, due to the unknown frequency of survivor formation, alternative explanations of these data are possible. For example, subtelomeric elements previously suggested to have no impact on the formation of any survivor types might influence every type to similar extents, resulting in similar ratios across all survivor types. With our current data, it is still uncertain whether X and Y'-elements modulate the frequency of each type of survivors. Therefore, we did not include these results in the revised manuscript.

      4) The authors might want to update some general information about subtelomere structure and their diversity across yeast strain with the recent paper by O'Donnell et al. 2023 Nature Genetics, "Telomere-to-telomere assemblies of 142 strains characterize the genome structural landscape in Saccharomyces cerevisiae".

      Thanks for your advice. We have added this information in the revised manuscript. (p.3 line 51-54)

      5) Although it is cited in the discussion, the recent work by the Malkova lab (Kockler et al. 2021 Mol Cell) could be mentioned in the introduction as it conceptually changes our views on survivor formation, its dynamics and the categorization into Type I and Type II.

      Thanks for your advice. We have added this information in the revised manuscript. (p.5 line 75-78)

      6) p.7 line 128-130: rather than chromosome number, the ratio of survivor types might be controlled by the fraction of subtelomeres with Y'-elements and their relative configuration across chromosomes. A map of the structure of remaining subtelomeres in the SYn strains might be good to have.

      We have added this information in supplementary file 2 in the revised manuscript.

      7) Fig. 1C: in SY9 tlc1Δ, the lane with triangle mark looks like a type II.

      The hybridization pattern of SY9 tlc1Δ clone 2 has both amplified Y’L-element and long heterogeneous TG1-3 repeats, it might be the “hybrid” survivor mentioned by Kockler’s work (Kockler et al., 2021). Therefore, we classify it as a no-classical survivor.

      8) p.9 line 149: the title of this result section "Y'-element is not essential for the viability of cells carrying linear chromosomes" doesn't reflect well the content of the section, which is more about characterizing the survivor pattern in SY12.

      Thanks for your advice. We have changed the title of this section into “Characterizing the survivor pattern in SY12” in the revised manuscript. (p.9 line 155)

      9) p.10 line 167: that type I can emerge in SY12 indicates that multiple Y'-elements in tandem are not required for type I recombination. I am not sure if this was already known, but it could be noted.

      We appreciate the reviewer’s comment. We have added this information in the revised manuscript. (p.10 175-177)

      10) p.18 line 318-320: the deletion of the Y' element also seems to remove the centromere-proximal telomere sequence adjacent to it. Maybe it should be stated as well. Even more importantly, in lines 327-329, the Y'-element that is reintroduced in the strain does not include the centromere-proximal short telomere sequence. This is important to interpret the Southern blots.

      We thank the reviewer for this critical suggestion. The deletion of Y'-element including both Y’- and X- element sequence in XVI-L (supplementary file 4), and the Y’element in the XVI-L does not contain the centromere-proximal telomere sequence. The Y'-element reintroduced into the left arm of Chr 3 in SY12XYΔ strain was cloned from native left arm of XVI in SY12 strain which does not contain the centromere-proximal short telomere sequence. Besides listing these details in supplementary file 4, we also emphasize it in the revised manuscript (p.21 line 397-398).

      11) p.29 lines 496-497: it seems that X and Y'-elements tend to inhibit formation of circular survivors either directly (by participating in end protection), or by promoting type I and type II, thus reducing the fraction of circular survivors. Maybe this could be added to the conclusion of this section.

      We thank the reviewer for his/her comments and have analyzed survivor types in SY12 tlc1Δ, SY12YΔ tlc1Δ, SY12XYΔ tlc1Δ and SY12XYΔ+Y tlc1Δ strains (supplementary file 5). Circular survivor formation appears less efficient in the SY12 tlc1Δ, with a ratio of 20%, significantly lower than SY12YΔ tlc1Δ, SY12XYΔ tlc1Δ or SY12XYΔ+Y tlc1Δ strains. However, it is noteworthy that SY12 tlc1Δ can generate Type I and Type X survivors, potentially impacting the circular survivor ratio.

      We further compared circular survivor ratios in SY12YΔ tlc1Δ, SY12XYΔ tlc1Δ and SY12XYΔ+Y tlc1Δ strains. SY12YΔ tlc1Δ had 52% circular survivors, similar to SY12XYΔ tlc1Δ with 48%, indicating minimal impact of X- elements. Additionally, SY12XYΔ+Y tlc1Δ had 44% circular survivors, also similar to SY12XYΔ tlc1Δ, suggesting that Y’-element has little effect on chromosome circularization. However, due to unknown frequency of survivor formation, alternative explanations, like subtelomeric elements affecting all the type of survivor similarly, are possible. With our current data, it remains unclear whether X and Y'-elements are involved in end protection and consequently inhibit the formation of circular survivors.

      Therefore, these results were not included in the revised manuscript.

      12) p.32 line 533: this result section doesn't really fit with the rest of the paper, does it?

      Thanks for your valuable advice. To avoid confusing readers and to keep the fluency of logic flow of the manuscript we have deleted this section in the revised version.

      13) The methods section does not describe the experiments sufficiently and it often lacks specific details such as the manufacturer or references. Some sections of the methods are more exhaustive than others. They should all be written with the same level of detail in my opinion.

      Thanks for your advice. We have described the experiments more sufficiently and added the manufacturer or references in the ‘materials and methods’ part in the revised manuscript. (p.41 line741-745, p.42 line 755-756, p.42 line 762-770, p.43 line 788 and p.45 line 812-813)

      Minor comments, typos and grammatical errors:

      p.3 line 33: "INTROUDUCTION" should be "INTRODUCTION".

      We have corrected it in the revised manuscript. (p.3 line 33) p.4 line 54: "S, cerevisiae", use dot instead of comma. R15: We have corrected it in the revised manuscript. (p.4 line 57)

      p.4 line 55: I believe TLC1 as the RNA moiety should be in (non-italicized) capital letters and not written as a protein.

      We have corrected it in the revised manuscript. (p.4 line 58)

      p.7 line 115: please indicate that pRS316 uses URA3 as a marker, otherwise the counterselection with 5'-FOA is not obvious.

      Thank this reviewer for the comment. We have added this statement in the revised manuscript. (p.7 line 121-122)

      p.12 line 206: tlc1Δ should be in italic.

      We have corrected it in the revised manuscript. (p.10 line 184)

      p.13 lines 227-229: "where only one hybridization signal", a verb seems to be missing.

      We thank the reviewer’s kind reminder and have corrected the mentioned errors in the revised manuscript. (p.14 line 254-255)

      Reviewer #3

      1) A weakness of the manuscript is the analysis of telomere transcriptional silencing. They state: "The results demonstrated a significant increase in the expression of the MPH3 and HSP32 upon Sir2 deletion, indicating that telomere silencing remains effective in the absence of X and Y'-elements". However, there are no statistical analyses performed as far as I can see. For some of the strains, the significance of the increased expression in sir2 (especially for MPH3) looks questionable. In addition, a striking observation is that the SY12 strain (with only three chromosomes) express much less of both MPH3 and HSP32 than the parental strain BY4742 (16 chromosomes), both in the presence and absence of Sir2. In fact, the expression of both MPH3 and HSP32 in the SY12 sir2 strain is lower than in the BY4742 SIR2+ strain. In addition, relating this work to previous studies of subtelomeric sequences in other organisms would make the discussion more interesting.

      First, I enjoyed reading your manuscript. It would be great if you performed the statistical analysis on the RT-qPCR data in figure 4B and addressed the issue of the difference of the BY4742 and SY12 strains. A model could be that this is a titration effect of silencing proteins due to fewer telomeres, which could be investigated by performing the analyses on more SY-strains with variable numbers of telomeres.

      We highly appreciate the reviewer’s valuable comments and suggestions, which included a point that has also left us confused. We conducted statistical analyses on the RT-qPCR data, and the t-test result revealed that upon the deletion of Sir2, SY12YΔ, SY12XYΔ and SY12XYΔ+Y strains exhibited a significant increase in MPH3 expression (located on the right arm of chr X) with a P value < 0.05. In the case of SY12, the deletion of Sir2 resulted in an increase in gene expression (P value < 0.1). Similar tendencies were observed in the BY4742 strain. The statistical analyses of RTqPCR results on XVI-L mirrored those of X-R.

      The results demonstrated a significant increase in MPH3 and HSP32 expression upon SIR2 deletion in SY12YΔ, SY12XYΔ and SY12XYΔ+Y strains, leading to the conclusion that telomere silencing remains effective in the absence of X-and Y’-elements. However, as the reviewer has pointed out, no statistically significant differences in MPH3 and HSP32 expression were observed between the SY12 and SY12 sir2Δ strain. For HSP32, this lack of significance may be attributed to the greater distance between HSP32 and telomere XVI-L in SY12 compared to SY12YΔ, SY12XYΔ or SY12XYΔ+Y strains, resulting in a weaker telomere position effect on HSP32 and a non-significant increase in gene expression in SY12. However, this explanation does not apply to MPH3, as SY12YΔ, with a same distance between MPH3 and telomere X-R as in SY12, still exhibits an effective telomere position effect on MPH3. We cannot provide a compelling explanation at this moment, and we suspect that the lack of statistically significant differences may be due to random clonal variation.

      Additionally, the SY12 strain (with three chromosomes) exhibited lower expression levels of both MPH3 and HSP32 compared to the parental strain BY4742 (with 16 chromosomes). Notably, it has been reported that the expression of genes coding silencing proteins in SY14 (with one chromosomes) were nearly identical to that of BY4742 (with 16 chromosomes)(Shao et al., 2018). Consequently, with respect to the reduced chromosome numbers, the silencing proteins appeared to be relatively overexpressed. Therefore, as pointed out by the reviewer, this observed phenomenon may be attributed to a titration effect of silencing proteins due to fewer telomeres. We have added the statistical analyses result in Figure 4B.

      We have related our work with previous studies of subtelomeric sequences in fission yeast in the discussion part. (p.37 line 655-676)

      Minor points are to correct the figure legend for Figure 6 supplement 1 (the strain designations) and line 55, RNAs are written with all caps, i.e. TLC1, and line 537 delete the "which" in the sentence.

      Thanks for your advice. We have corrected them in the revised manuscript.

      1) The strain has been replaced with SY12XYΔ+Y (p.35 line 617, 618 and 620)

      2) “Tlc1” has been replaced with “TLC1” (p.4 line 58).

      3) We have deleted the section of “Circular chromosome maintain stable when double knockout of yku70 and tlc1” according to the suggestions raised by reviewer 1 and 2, the deleted section contain the sentence in line 537 you mentioned.

      Kockler, Z.W., Comeron, J.M., and Malkova, A. (2021). A unified alternative telomerelengthening pathway in yeast survivor cells. Molecular Cell 81, 1816-1829.e1815. Krogh, B.O., and Symington, L.S. (2004). Recombination proteins in yeast. Annu Rev Genet 38, 233-271.

      Lewis, L.K., Storici, F., Van Komen, S., Calero, S., Sung, P., and Resnick, M.A. (2004). Role of the nuclease activity of Saccharomyces cerevisiae Mre11 in repair of DNA double-strand breaks in mitotic cells. Genetics 166, 1701-1713.

      Shao, Y., Lu, N., Wu, Z., Cai, C., Wang, S., Zhang, L.L., Zhou, F., Xiao, S., Liu, L., Zeng, X., et al. (2018). Creating a functional single-chromosome yeast. Nature 560, 331-335. Symington, L.S. (2002). Role of RAD52 epistasis group genes in homologous recombination and double-strand break repair. Microbiol Mol Biol Rev 66, 630-670, table of contents.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1:

      The chosen classification scheme for aGPCRs may require reassessment and amendment by the authors in order to prevent confusion with previously issued classification attempts of this family. (…) Can the authors suggest another scheme (mind to avoid the subfamily IIX or the alternative ADGRA-G,L,V subfamily schemes of metazoan aGPCRs), and adapt their numbering throughout the text and all figures/supplementary figures/supplementary files?

      We appreciate the reviewer's comment and agree that a different nomenclature should be used for choanoflagellate aGPCRs to avoid possible confusion. We have now re-labeled the choanoflagellate aGPCR subfamilies, previously numbered from I to XIX, using alphabetical enumeration (from A to S). Changes have been made throughout the main text, in Figure 5, and in Supplementary Figures S6 and S7.

      line 10: The abbreviation 'GPCR-TKL/Ks' is not explained.

      Thank you for pointing this out. We have now revised the text to explain the abbreviation:

      “Adhesion GPCRs and a class of GPCRs fused to kinases (the GPCR-TKL/Ks) are the most abundant GPCRs in choanoflagellates.”

      line 30: "7TM domain is diagnostic for GPCRs": strange wording. Use an alternative expression.

      We changed the wording to: 

      “A conserved seven transmembrane (7TM) domain is a hallmark of GPCRs, while the wide spectrum of extracellular and intracellular domains in some GPCRs reflects the diversification of the gene family and its functions (Schiöth and Lagerström 2008).”

      line 33: In the case of rhodopsins, not the GPCR (i.e., the apoprotein) responds directly to photons, but the retinal, which isomerises upon illumination.

      We thank the reviewer for bringing this to our attention, and we have now removed mention of photons from the list of cues detected by GPCRs.

      “For example, the extracellular N-terminus and the three extracellular loops of the 7TM domain respond to a wide range of cues, including odorant molecules, peptides, amines, lipids, nucleotides, and other molecules (Yang et al. 2021).”

      line 111: What are "genome-enabled choanoflagellates"? Explain the term. As it stands, it doesn't make sense to me.

      We meant only to highlight that these two species have sequenced genomes. We have deleted the phrase “genome enabled.”

      “To assess the predictive power of our protein-detection pipeline, we then compared the new GPCR and cytosolic signaling component datasets from two choanoflagellates – Salpingoeca rosetta and Monosiga brevicollis – with previously published GPCR and downstream GPCR signaling component counts for these two species (Nordström et al. 2009a; Krishnan et al. 2012; De Mendoza et al. 2014; Krishnan et al. 2015; Lokits et al. 2018).”

      line 145: Please give a reasoning for the naming of each of the new families (e.g., RemiSens, Hidden Gold, GPCR-TLK/K, etc.) or at least the explanations of the acronyms/names early in the manuscript, even if they are discussed later in more detail.

      Thank you for identifying this as an area of confusion. While we feel that going into the rationale behind each of the names here would interrupt the flow of the manuscript, we have added a phrase encouraging readers to “hold that thought” with the hope that they can wait for the sections that specifically focus on each of these new GPCR families.

      “This left twelve new GPCR families that had not, to our knowledge, been previously detected in choanoflagellates: Rhodopsin, TMEM145, GPR180, TMEM87, GPR155, GPR157, and six additional GPCR families that appear to fall outside all previously characterized GPCR families in eukaryotes. For reasons that will be discussed further below, we have named these six new GPCR families “Rémi-Sans-Famille” (RSF), “Hidden Gold” (Hi-GOLD), GPCR-TKL/K, GPRch1, GPRch2, and GPRch3. (Fig. 1B; Table 1).”

      lines 297/298 and 2049: Rename tethered agonist "peptide" to "element". Synthetic peptides resembling the TA were used in experiments to test for the sufficiency of the TA for receptor activation, but because the naturally occurring TAs are part of the receptor protein, they are not peptides.

      Thank you for pointing this out. We have revised the text as suggested.

      line 2026: I think the letters in the acronym "CMR" are mixed up and were intended to read "CRM".

      Good catch! We have corrected the text.

      line 2048: "diagnostic" again. Change to "tell-tale", "hallmark", or another similar descriptor.

      We have corrected the text accordingly.

      2058: Strike "motif" in order to avoid confusion with the now obsolete term "GPS motif", which entailed the five most C-terminal β-strands of GAIN subdomain B (not thus neither the full GAIN domain nor the GPS).

      Thank you for pointing this out. We have corrected the text.

      Figure 5: Did the authors also find homologs placed in the aGPCR family based on their 7TM domain sequence but lacking a GAIN domain similar to vertebrate ADGRA/GPR123, the only aGPCR known to lack a GAIN domain (10.1016/j.tips.2013.06.002)? Irrespective of the authors' findings or non-finding on that matter, please insert a note on this in the results text.

      We thank the reviewer for bringing this interesting point to our attention. We have now added a new supplementary figure A in Fig. S9 to answer the reviewer's comment. We also modified the legend of Fig. S9  to take into account this change and uploaded a new supplementary data file 20 to support Fig. S9A. Finally, we revised the main text under the section “Adhesion GPCRs” as requested: 

      Lines 328-331: “ While the GAIN and aGPCR 7TM domains evolved before the origin of opisthokonts (Araç et al.2012; Krishnan et al. 2012; De Mendoza et al. 2014), we detected the fusion of these two domains into a single module (GAIN/7TM) in most, but not all, holozoan aGPCRs (Fig. 5D, Fig.S7B and S9A; Supplementary file 20; Prömel et al, 2013; Krishnan et al. 2014).

      Reviewer #2:

      While the study contributes several interesting observations, it does not radically revise the evolutionary history of the GPCR family. However, in an era increasingly concerned with the reproducibility of scientific findings, this is arguably a strength rather than a weakness. It is encouraging to see that previously established patterns largely hold, and that with expanded sampling and improved methods, new insights can be gained, especially at the level of specific GPCR subfamilies. Then, no functional follow-ups are provided in the model system Salpingoeca rosetta, but I am sure functional work on GPCRs in choanoflagellates is set to reveal very interesting molecular adaptations in the future.

      We agree with the reviewer and anticipate that this work will provide a useful resource to motivate the future functional characterization of GPCRs in choanoflagellates, other CRMs, as well as in metazoans.

      The GPCR-TKL fusion is a particularly interesting finding, especially given the presence of such sequences in sponges. This could potentially represent a synapomorphy shared between sponges and choanoflagellates, later lost in other animals. The authors mention that BLASTP searches using the kinase domain recover the sponge GPCR-TKLs, suggesting the fusion may be ancestral. It would be useful to include phylogenetic trees of both the GPCR and TKL domains to assess this possibility. The authors might also consider examining sponge genomes released by the DTOL project to increase representation from this group.

      We agree and thank the reviewer for this suggestion. We have now added the requested phylogenetic analyses to the new Figure S17, revised the supplementary files and Methods accordingly, and commented on these results in the main text under the section “GPCR-TKL/K and GPCR-TKs“.  

      Lines 579 – 589: “While no metazoan homologs were found when using the 7TM domain of choanoflagellate GPCR-TKs as queries, using the conserved tyrosine kinase domains as queries recovered GPCR-TKs in sponges but not in other metazoan lineages or other holozoans (Fig. S17E). To test whether GPCR-TKs in sponges and choanoflagellates are homologous, we performed phylogenetic analyses of their TK and 7TM domains (Fig. S17F and G). While the TK domains of GPCR-TKs from sponges and choanoflagellates formed a well-supported clade, their 7TM domains did not. These results point to a heterogeneous evolutionary history that may include domain swapping (i.e. ancestral GPCR-TKs in which the 7TM domain was replaced in either the sponge or choanoflagellate lineages) or convergent evolution, in which homologous 7TM domains fused with unrelated 7TM domains in the sponge and choanoflagellate lineages.”

      Added to the Method section “Sequence alignment and phylogenetic analyses”:

      Lines 913 – 933: “Phylogenetic analyses of holozoan aGPCRs, Glutamate Receptors, and Gα subunits, and the 7TM and Kinase domains from GPCR TK/TKL/Ks were performed in this study. (…) To construct the phylogenies of the Kinase domain and 7TM domain from the GPCR TK/TKL/Ks, we first built a dataset including all the GPCR TK/TKL/Ks sequences identified in choanoflagellates and in sponges, as well as the GPCR TKL/Ks previously published in oomycetes and amoebozoans (Van Den Hoogen et al. 2018). We extracted the 7TM domain and Kinase domain from each sequence by combining the transmembrane domain prediction tool TMHMM-2.0 and the protein domain prediction tool InterProScan with the alignment tool MAFFT (E-INS-I algorithm) on Geneious Prime v2024.07 (Supplementary Files 30 and 32). We then aligned the aGPCR, Glutamate and Glutamate GPCR TK/TKL/K Receptor 7TMs, the GPCR TK/TKL/Ks Kinase domain, or the full-length Gα sequences using MAFFT with the E-INS-I algorithm. The resulting alignments were then used for Maximum-likelihood and/or Bayesian inference of phylogenies (Fig. 3B, Fig. 5A, Fig. S3D, and Fig. S6A, and Fig. S17F and G; Supplementary Files 5, 9, 16,18, 31, and 33).”

      Rhodopsin-like receptors are proposed in the discussion to be potential cases of lateral gene transfer (LGT) between eukaryotes. To support or refute this hypothesis, it would be valuable to place the choanoflagellate and ichthyosporean Rhodopsins within a broader phylogeny of this family, including (a few) representatives from animals and other eukaryotes. Even if deep branching relationships remain unresolved, signs such as unusually short branches could point toward recent LGT events.

      Thank you for your suggestion. While we originally considered testing these alternative hypotheses in this manuscript by building a phylogeny, the rapid sequence evolution of the Rhodopsin family has stymied similar efforts in the past and instead motivated others to use clustering approaches like those used in our study (Hu et al. 2017; Thiel et al. 2023). Unfortunately, these types of analyses cannot be used to readily identify instances of LGT.

      Therefore, following the suggestion of the reviewer, we bit the bullet and performed phylogenetic analyses on the sequences in question. Unfortunately, these analyses were completely inconclusive, and we feel they do not warrant inclusion in the manuscript. The topologies of the sequence trees recovered were poorly supported and sensitive to most of the variables we tested – the set of rhodopsin sequences included, the multiple alignment algorithms used, and the probabilistic methods employed to infer the phylogenies. 

      Instead, we have revised the manuscript to highlight the challenge of differentiating between the different hypotheses that are consistent with the phylogenetic distribution of Rhodopsins:

      Lines 670 – 678: “Thus, while it is formally possible that Rhodopsins existed in stem choanoflagellates and were lost in most modern choanoflagellate lineages, either horizontal gene transfer or convergent evolution in the shared ancestor of S. macrocollata and S. punica are similarly plausible explanations for their presence in these species. Differentiating between these alternative evolutionary scenarios is challenging because of rapid rate of sequence evolution within the family and the resultant loss of phylogenetic signal. Our own preliminary investigations of Rhodopsin evolution in non-metazoans were inconclusive. Therefore, ambiguities about the provenance and function of CRM Rhodopsins currently obscure the ancestry of metazoan Rhodopsins and opsins.”

      While the study surveys most available holozoan genomes, it appears that the genomes of Amoebidium spp.-which are cited in the manuscript- were not included. It may not be necessary to repeat all analyses with these two species (A. appalachense and A. parasiticum), but a preliminary search indicates the presence of four candidate 7tm_1 (Rhodopsin-like) proteins in their proteomes. These may warrant closer inspection (e.g., via BLASTP against animal databases) to confirm whether they are genuine GPCRs or false positives.

      Author response image 1.

      We thank the reviewer for bringing these sequences to our attention. To be clear, we did not analyze the Amoebidium spp. genome and we can find no reference to it in our manuscript. If the reviewer had the impression that the genome was analyzed, we would be grateful to know the source of the confusion so that it can be corrected. (We did not intentionally exclude the genome; it simply was not available on the Multicell Genome database from which we retrieved the ichthyosporean genomes and transcriptomes used in this study.)

      Nevertheless, out of curiosity, we have now analyzed the sequences provided by the reviewer and summarize our findings here for the interest of the reviewer. Although the sequences were annotated as 7tm_1 (Rhodopsin-like) proteins in the original genome study, none of these sequences group with metazoan or choanoflagellate Rhodopsins in our clustering analysis; instead, we found that these putative GPCRs form a distinct cluster that only weakly resembles cAMP receptors, both on the basis of their sequence and predicted structures. 

      It is not surprising to find new GPCR clusters as new taxa are folded into the study, and these Amoebidium sequences do not add to our understanding of Rhodopsin evolution. Therefore, we have not added their analysis to the manuscript, but we hope the reviewer finds our quick analysis of interest.

      Author response image 2.

      In Figure 2, perhaps expanding the other holozoan clades would have been nice, as there are not too many species, but I understand if that's beyond the point of the manuscript, focused on choanoflagellates.

      Thank you for this comment. However, given the focus of this study, we feel that an expansion of the other holozoan clades would reduce the clarity of the figure.

      line 87 - "To this end, the 671 validated choanoflagellate GPCRs were sorted by sequence similarity, resulting in 18 clusters. "Some details in the results section would be nice, or at least clear references to where this is explained in more detail. How were the extra choanoflagellate GPCRs added if they failed to be identified with quite sensitive HMM profiles?

      We apologize for the possible confusion and thank the reviewer for the suggestion; we have now added specific references to the related sections from the material and methods for interested readers.

      We believe that the "extra choanoflagellate GPCRs" mentioned by the reviewer refer to the choanoflagellate GPCRs that failed to be detected when the choanoflagellate genomes and transcriptomes were searched with the predominantly metazoan-derived GPCRHMM and HMMs from the GPCR_A Pfam clan (CL0192). We were able to recover these extra choanoflagellate GPCRs by using custom choanoflagellate-specific GPCR HMMs and by blasting the choanoflagellate GPCRs previously identified as queries against the 23 choanoflagellate proteomes. We hope that the referencing of the Methods section "Recovering additional choanoflagellate GPCRs using choanoflagellate GPCR BLAST queries and custom choanoflagellate GPCR HMMs", in lines 91 and 93, will help clarify this point.

      line 108 - Well, from the figure it seems that most eukaryotes have an 'animal-like' G protein signalling, so that's perhaps more of an eukaryotic signature than something that puts choanoflagellates and animals together.

      Excellent point! We have revised the text.

      line 132 - It is unclear what the criteria are to include these taxa as helpers for choanoflagellate classification, and not adding the other unicellular holozoans. Just some text justification could help.

      Thank you for pointing this out. We have added an explanation of the rationale to the methods — section “Clustering of the 918 validated choanoflagellate GPCRs” — and referred to it in the main text.

      New text added to methods:

      “The non-choanoflagellate sequences added to the dataset were either top blast hits recovered after searching the entire Eukprot v3 dataset (993 species) with choanoflagellate GPCRs as queries, or previously published and well-documented GPCR sequences from metazoans.”

      line 145 - These families are listed, but perhaps it would be nice to explicitly mention that they will be covered in more detail later on in the manuscript. I found myself wondering about those exotic names, until I reached the sections in the manuscript where they are explained.

      Thank you for this suggestion. We have now modified our sentence to refer to the related sections.

      “For reasons that will be discussed further below, we have named these six new GPCR families “Rémi-Sans-Famille” (RSF), “Hidden Gold” (Hi-GOLD), GPCR-TKL/K, GPRch1, GPRch2, and GPRch3. (Fig. 1B; Table 1).”

      line 199 - perhaps would be nice to explain domain architecture of validated Dictyostelium GABA-like receptors (ANF domain?).

      Thank you for your suggestion. We have now modified the sentence to mention the protein domain composition of the validated GABA-like receptor, GrlE, in Dictyostelium.

      “The Glutamate Receptors from the amoebozan Dictyostelium discoideum, of which at least one, GrlE, binds both GABA and Glutamate presumably through its conserved ANF domain (Anjard and Loomis 2006; Taniura et al. 2006; Wu and Janetopoulos 2013), grouped separately from metazoan and CRM GPCRs in our analysis.”

      Figure S4 - Perhaps a stacked bar chart would be easier to browse than a bunch of pie charts, notoriously difficult to quantify.

      Thank you for this comment. Opinions differ on how best on whether pie charts or bar charts are more effective in this context (including between the authors of this manuscript). However, we think the point of Figure S4 a minor point, only to be appreciated by a tiny number of readers, and therefore have left the data presentation as it was in the original submission.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Li et al. investigate Ca2+ signaling in T. gondii and argue that Ca2+ tunnels through the ER to other organelles to fuel multiple aspects of T. gondii biology. They focus in particular on TgSERCA as the presumed primary mechanism for ER Ca2+ filling. Although, when TgSERCA was knocked out there was still a Ca2+ release in response to TG present.

      Note that we did not generate a complete SERCA knockout, as this gene is essential, and its complete loss would not permit the isolation of viable parasites. Instead, we created conditional mutants that downregulate the expression of SERCA. Importantly, some residual activity is present in the mutant after 24 h of ATc treatment as shown in Fig 4C. This is consistent with our Western blots, which demonstrate the presence of residual SERCA protein at 1, 1.5 and 2 days post ATc treatment (Fig. 3B). We have clarified this point in the revised manuscript (lines 232233). See also lines 97-102.

      Overall the Ca2+ signaling data do not support the conclusion of Ca2+ tunneling through the ER to other organelles in fact they argue for direct Ca2+ uptake from the cytosol. The authors show EM membrane contact sites between the ER and other organelles, so Ca2+ released by the ER could presumably be taken up by other organelles but that is not ER Ca2+ tunneling. They clearly show that SERCA is required for T. gondii function.

      Overall, the data presented to not fully support the conclusions reached

      We agree that the data does not support Ca<sup>2+</sup> tunneling as defined and characterized in mammalian cells. In response to this comment, we have modified the title and the text accordingly.

      However, we respectfully would like to emphasize that the study demonstrates more than just the role of SERCA in T. gondii “function”. Our findings reveal that the ER, through SERCA activity, sequesters calcium following influx through the PM (see reviewer 2 comment). The ER calcium pool is important for replenishing other intracellular compartments.

      The experiments support a model in which the ER actively takes up cytosolic Ca²⁺ as it enters the parasite and contributes to intracellular Ca²⁺ redistribution during transitions between distinct extracellular calcium environments. We believe that the role of the ER in modulating intracellular calcium dynamics is demonstrated in Figures 1H–K, 4G-H, and 5H–K. To highlight the relevance of these findings, we have included an expanded discussion in the revised manuscript. See lines 443-449 and 510-522.

      Data argue for direct Ca2+ uptake from the cytosol

      The ER most likely takes up calcium from the cytosol following its entry through the PM and redistributes it to the other organelles. We deleted any mention of the word “tunneling” and replaced it with transfer and re-distribution as they reflect our experimental findings more accurately.

      We interpret the experiments shown in Figure 1 H and I as re-distribution because the amount of calcium released after nigericin or GPN are greatly enhanced after TG addition. We first add calcium to allow intracellular stores to become filled, followed by the addition of TG, which allows calcium leakage from the ER. This leaked calcium can either enter the cytosol and be pumped out or be taken up by other organelles. Our interpretation is that this process leads to an increased calcium content in acidic compartments.

      We conducted an additional experiment in which SERCA was inhibited prior to calcium addition, allowing cytosolic calcium to be exported or taken up by acidic stores. We observed a change in the GPN response (Fig. S2A), possibly indicating that the PLVAC can sequester calcium when SERCA is inactive. While this may support the reviewer’s view, TG treatment does not reflect physiological conditions and may enhance calcium transfer to other compartments. Although the result is interesting, interpretation is complicated by the use of parasites in suspension and drug exposure in solution. Single-parasite measurements are not feasible due to weak signals, and adhered parasites are even less physiological than those in suspension.

      In support of our view, the experiments shown in Figs 4G and H show that down regulating SERCA reduces significantly the response to GPN indicating diminished acidic store loading. In Fig 5I we observe that mitochondrial calcium uptake is reduced in the iDSERCA (+ATc) mutant in response to GPN. Fig 2B demonstrates that TgSERCA can take up calcium at 55 nM, close to resting cytosolic calcium while in Figures 5E and S5B we show that the mitochondrion is not responsive to an increase of cytosolic calcium. Uptake by the mitochondria requires much higher concentrations (Fig 5B-C), which may be achieved within microdomains at MCS between the ER and mitochondrion. This is also consistent with findings reported by Li et al (Nat Commun. 2021) where similar microdomains mediated transfer of calcium to the apicoplast (Fig. 7 E and F of the mentioned reference) was observed.

      Reviewer 2 (Public review):

      The role of the endoplasmic reticulum (ER) calcium pump TgSERCA in sequestering and redistributing calcium to other intracellular organelles following influx at the plasma membrane.

      T. gondii transitions through life cycle stages within and exterior to the host cells, with very different exposures to calcium, adds significance to the current investigation of the role of the ER in redistributing calcium following exposure to physiological levels of extracellular calcium

      They also use a conditional knockout of TgSERCA to investigate its role in ER calcium store-filling and the ability of other subcellular organelles to sequester and release calcium. These knockout experiments provide important evidence that ER calcium uptake plays a significant role in maintaining the filling state of other intracellular compartments.

      We thank the reviewer.

      While it is clearly demonstrated, and not surprising, that the addition of 1.8 mM extracellular CaCl2 to intact T. gondii parasites preincubated with EGTA leads to an increase in cytosolic calcium and subsequent enhanced loading of the ER and other intracellular compartments, there is a caveat to the quantitation of these increases in calcium loading. The authors rely on the amplitude of cytosolic free calcium increases in response to thapsigargin, GPN, nigericin, and CCCP, all measured with fura2. This likely overestimates the changes in calcium pool sizes because the buffering of free calcium in the cytosol is nonlinear, and fura2 (with a Kd of 100-200 nM) is a substantial, if not predominant, cytosolic calcium buffer. Indeed, the increases in signal noise at higher cytosolic calcium levels (e.g. peak calcium in Figure 1C) are indicative of fura2 ratio calculations approaching saturation of the indicator dye.

      We acknowledge the limitations associated with using Fura-2 for cytosolic calcium measurements. However, according to the literature (Grynkiewicz, Get al. (1985). J. Biol. Chem. 260 (6): 3440–3450. PMID 3838314) Fura-2 is suited for measurements between 100 nM and 1 µM calcium. The responses in our experiments were within that range and the experiments with the SERCA mutant and mitochondrial GCaMPfs supports the conclusions of our work.

      However, we agree with the reviewer that the experiment shown in Fig 1C (now Fig 1D) presents a response that approaches the limit of the linear range of Fura-2. In response to this, we have replaced this panel with a more representative experiment that remains within the linear range of the indicator (revised Fig 1D). Additionally, we have included new experiments adding GPN along with corresponding quantifications, which further support our conclusions regarding calcium dynamics in the parasite.

      Another caveat, not addressed, is that loading of fura2/AM can result in compartmentalized fura2, which might modify free calcium levels and calcium storage capacity in intracellular organelles.

      We are aware of the potential issue of Fura-2 compartmentalization, and our protocol was designed to minimize this effect. We load cells with Fura-2 for 26 min at room temperature, then maintain them on ice, and restrict the use of loaded parasites to 2-3 hours. We have observed evidence of compartmentalization as this is reflected in increasing concentrations of resting calcium with time. We carry out experiments within a time frame in which the resting calcium stays within the 100 nM range. We have included a sentence in the Materials and Methods section. Lines 604-606.

      Additionally, following this reviewer’s suggestion, we performed further experiments to directly assess compartmentalization. See below the full response to reviewer 2.

      The finding that the SERCA inhibitor cyclopiazonic acid (CPA) only mobilizes a fraction of the thapsigargin-sensitive calcium stores in T. gondii coincides with previously published work in another apicomplexan parasite, P. falciparum, showing that thapsigargin mobilizes calcium from both CPA-sensitive and CPA-insensitive calcium pools (Borges-Pereira et al., 2020, DOI: 10.1074/jbc.RA120.014906). It would be valuable to determine whether this reflects the off-target effects of thapsigargin or the differential sensitivity of TgSERCA to the two inhibitors.

      This is an interesting observation, and we now include a discussion of this result considering the Plasmodium study and include the citation. Lines 436-442.

      Figure S1 suggests differential sensitivity, and it shows that thapsigargin mobilizes calcium from both CPA-sensitive and CPA-insensitive calcium pools in T. gondii. Also important is that we used 1 µM TG as we are aware that TG has shown off-target effects at higher concentrations. TG is a well-characterized, irreversible SERCA inhibitor that ensures complete and sustained inhibition of SERCA activity. In contrast, CPA is a reversible inhibitor whose effectiveness is influenced by ATP levels, and it may only partially inhibit SERCA or dissociate over time, allowing residual Ca²⁺ reuptake into the ER.

      Additionally, as suggested by the reviewer we performed experiments using the Mag-Fluo-4 protocol to compare the inhibitory effects of CPA and TG. These results are presented in Fig. S3 (Lines 217-223). Under the conditions of the Mag-Fluo-4 assay with digitonin-permeabilized cells, both TG and CPA showed similar rates of Ca<sup>2+</sup> leakage following the addition of the inhibitor. This may indicate that under the conditions of the Mag-Fluo-4 experiments the rate of Ca<sup>2+</sup> leak is mostly determined by the intrinsic leak mechanism and not by the nature of the inhibitor. By contrast, in intact Fura-2–loaded cells, CPA induces a smaller cytosolic Ca²⁺ increase than TG, consistent with less efficient SERCA inhibition likely due to its reversibility and possibly incomplete inhibition under cellular conditions.

      The authors interpret the residual calcium mobilization response to Zaprinast observed after ATc knockdown of TgSERCA (Figures 4E, 4F) as indicative of a target calcium pool in addition to the ER. While this may well be correct, it appears from the description of this experiment that it was carried out using the same conditions as Figure 4A where TgSERCA activity was only reduced by about 50%.

      We partially agree with the reviewer that 50% knockdown of TgSERCA means that the ER may still be targeted by zaprinast, and that there is no definitive evidence of the involvement of another calcium pool. The Mag-Fluo-4 experiment, while we acknowledge that the fluorescence of MagFluo-4 is not linear to calcium, indicates that SERCA activity is present even after 24 hr of ATc treatment. However, when Zaprinast is added after TG, we observed a significant calcium release in wild type cells. This result suggests the presence of another large calcium pool than the one mobilized by TG (PMID: 2693306).

      We recently published work describing the Golgi as a calcium store in Toxoplasma (PMID: 40043955) and we showed in Fig. S4 D-G of that work, that GPN treatment of tachyzoites loaded with Fura-2 diminished the Zaprinast response indicating that they could be impacting a similar store. In the present study we performed additional experiments in which TG was followed by GPN and Zaprinast showing a similar pattern. GPN significantly diminished the Zaprinast response. These results are shown now in Figure S2B. We address these possibilities in the discussion and interpretation of the result. Lines 451-460.

      The data in Figures 4A vs 4G and Figures 4B vs 4H indicate that the size of the response to GPN is similar to that with thapsigargin in both the presence and absence of extracellular calcium. This raises the question of whether GPN is only releasing calcium from acidic compartments or whether it acts on the ER calcium stores, as previously suggested by Atakpa et al. 2019 DOI: 10.1242/jcs.223883. Nonetheless, Figure 1H shows that there is a robust calcium response to GPN after the addition of thapsigargin.

      The results of the indicated experiments did not exclude the possibility that GPN can also mobilize some calcium from the ER besides acidic organelles. We don’t have any evidence to support that GPN can mobilize calcium from the ER either. Based on our unpublished work, we think GPN mainly release calcium from the PLVAC. We included the mentioned citation and discuss the result considering the possibility that GPN may be acting on more than one store. Lines 451-460.

      An important advance in the current work is the use of state-of-the-art approaches with targeted genetically encoded calcium indicators (GECIs) to monitor calcium in important subcellular compartments. The authors have previously done this with the apicoplast, but now add the mitochondria to their repertoire. Despite the absence of a canonical mitochondrial calcium uniporter (MCU) in the Toxoplasma genome, the authors demonstrate the ability of T. gondii mitochondrial to accumulate calcium, albeit at high calcium concentrations. Although the calcium concentrations here are higher than needed for mammalian mitochondrial calcium uptake, there too calcium uptake requires calcium levels higher than those typically attained in the bulk cytosolic compartment. And just like in mammalian mitochondria, the current work shows that ER calcium release can elicit mitochondrial calcium loading even when other sources of elevated cytosolic calcium are ineffective, suggesting a role for ER-mitochondrial membrane contact sites. With these new tools in hand, it will be of great value to elucidate the bioenergetics and transport pathways associated with mitochondrial calcium accumulation in T. gondii.

      We thank this reviewer praising our work. Studies of bioenergetics and transport pathways associated with mitochondrial calcium accumulation is part of our future plans mentioned in lines 520-522 and 545.

      The current studies of calcium pools and their interactions with the ER and dependence on SERCA activity in T. gondi are complemented by super-resolution microscopy and electron microscopy that do indeed demonstrate the presence of close appositions between the ER and other organelles (see also videos). Thus, the work presented provides good evidence for the ER acting as the orchestrating organelle delivering calcium to other subcellular compartments through contact sites in T. gondi, as has become increasingly clear from work in other organisms.

      Thank you

      Reviewer #3 (Public review):

      This manuscript describes an investigation of how intracellular calcium stores are regulated and provides evidence that is in line with the role of the SERCA-Ca2+ATPase in this important homeostasis pathway. Calcium uptake by mitochondria is further investigated and the authors suggest that ER-mitochondria membrane contact sites may be involved in mediating this, as demonstrated in other organisms.

      The significance of the findings is in shedding light on key elements within the mechanism of calcium storage and regulation/homeostasis in the medically important parasite Toxoplasma gondii whose ability to infect and cause disease critically relies on calcium signalling. An important strength is that despite its importance, calcium homeostasis in Toxoplasma is understudied and not well understood.

      We agree with the reviewer. Thank you

      A difficulty in the field, and a weakness of the work, is that following calcium in the cell is technically challenging and thus requires reliance on artificial conditions. In this context, the main weakness of the manuscript is the extrapolation of data. The language used could be more careful, especially considering that the way to measure the ER calcium is highly artificial - for example utilising permeabilization and over-loading the experiment with calcium. Measures are also indirect - for example, when the response to ionomycin treatment was not fully in line with the suggested model the authors hypothesise that the result is likely affected by other storage, but there is no direct support for that.

      The Mag-Fluo-4-based protocol for measuring intraluminal calcium is well established and has been extensively used in mammalian cells, DT40 cells and other cells for measuring intraluminal calcium, activity of SERCA and response to IP3 (Some examples: PMID: 32179239, PMID: 15963563, PMID: 19668195, PMID: 30185837, PMID: 19920131).

      Furthermore, we have successfully employed this protocol in previous work, including the characterization of the Trypanosoma brucei IP3R (PMID: 23319604) and the assessment of SERCA activity in Toxoplasma (PMID: 40043955 and 34608145). The citation PMID: 32179239 provides a detailed description of the protocol, including references to its prior use. In addition, the schematic at the top of Figure 2 summarizes the experimental workflow, reinforcing that the protocol follows established methodologies. We included more references and an expanded discussion, lines 425-435.

      We respectfully disagree with the concern regarding potential calcium overloading. The cells used in our assays were permeabilized, which is a critical step that allows to precisely control calcium concentrations. All experiments were conducted at 220 nM free calcium, a concentration within the physiological range of cytosolic calcium fluctuations. This concentration was consistently used across all studies described above. Importantly, permeabilization ensures that the dye present in the cytosol becomes diluted, and allows MgATP (which cannot cross intact membranes) to access the ER membrane, in addition to be able to expose the ER to precise calcium concentrations.

      The Mag-Fluo-4 loading conditions are designed to allow compartmentalization of the indicator to all intracellular compartments and the calcium uptake stimulated by MgATP exclusively occurs in the compartment occupied by SERCA as only SERCA is responsive to MgATP-dependent transport in this experimental setup

      Regarding the use of IO, we would like to clarify that its broad-spectrum activity is welldocumented. As a calcium ionophore, IO facilitates calcium release across multiple membranes, and not just the ER leading to a more substantial calcium release compared to the more selective effect of TG. The results observed with IO were consistent with this expected broader activity and support our interpretation.

      Lastly, we emphasize that the experiment in Figure 2 was designed specifically to assess SERCA activity in situ under defined conditions. It was not intended to provide a comprehensive characterization of the role of TgSERCA in the parasite. We now clarify this distinction in the revised Discussion lines 425-435.

      Below we provide some suggestions to improve controls, however, even with those included, we would still be in favour of revising the language and trying to avoid making strong and definitive conclusions. For example, in the discussion perhaps replace "showed" with "provide evidence that are consistent with..."; replace or remove words like "efficiently" and "impressive"; revise the definitive language used in the last few lines of the abstract (lines 13-17); etc. Importantly we recommend reconsidering whether the data is sufficiently direct and unambiguous to justify the model proposed in Figure 7 (we are in favour of removing this figure at this early point of our understanding of the calcium dynamic between organelles in Toxoplasma).

      We thank the reviewer for the suggestions and we modified the language as suggested. We limited the use of the word "showed" to references to previously published work. We deleted the other words

      Figure 7 is intended as a conceptual model to summarize our proposed pathways, and, like all models, it represents a working hypothesis that may not fully capture the complexity of calcium dynamics in the parasite. In light of the reviewer’s comments, we revised the figure and legend to clearly distinguish between pathways for which there is experimental evidence from those that are hypothetical.

      Another important weakness is poor referencing of previous work in the field. Lines 248250 read almost as if the authors originally hypothesised the idea that calcium is shuttled between ER and mitochondria via membrane contact sites (MCS) - but there is extensive literature on other eukaryotes which should be first cited and discussed in this context. Likewise, the discussion of MCS in Toxoplasma does not include the body of work already published on this parasite by several groups. It is informative to discuss observations in light of what is already known.

      The sentence in which we state the hypothesis about the calcium transfer refers specifically to Toxoplasma. To clarify this, we have now added the phrase “In mammalian cells” (Line 311) and included additional citations, as suggested by the reviewer. While only a few studies have described membrane contact sites (MCSs) in Toxoplasma, we do cite several pertinent articles (e.g., lines 479-486). We believe that we cited all articles mentioning MCS in T. gondii

      However, we must clarify to the reviewer that the primary focus of our study is not to characterize or confirm the presence of MCSs in T. gondii, but rather to demonstrate functional calcium transfer between the ER and mitochondria. Our data support the conclusion that this transfer requires close apposition of these organelles, consistent with the presence of MCSs.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Line 45: change influx to release as Ca2+ influx usually referred to Ca2+ entry from the extracellular space. Same for line 71.

      Corrected, line 47 and 73

      (2) Line 54: consider toning down the strong statement of 'widely' accepted as ER Ca2+ subdomain heterogeneity remains somewhat debated.

      Changed the sentence to “it has been proposed”, Line 56

      (3) Line 119-21: A lower release in response to TG is typical and does not reflect TG specific for SERCA. It is due to the slow kinetics of Ca2+ leak out of the ER allowing other buffering and transport mechanisms to act. Also, could be a reflection of the duration after TG treatment to allow complete store depletion. Figure S1A-B shows that there is still Ca2+ in the stores following TG but the TG signal does not go back to baseline arguing that the leak is still active. Hence the current data does not address the specificity of TG for TgSERCA. Please revise the statement accordingly.

      Thank for the suggestion, we changed the sentence to this: “This result could reflect the slow kinetics of Ca²⁺ leak from the ER, allowing other buffering and transport mechanisms to mitigate the phenomenon. Alternatively, it may indicate the duration after TG treatment allowing time to complete store depletion. As shown in Figure S1A-B, residual Ca²⁺ remains in the stores after TG treatment, and the TG-induced phenomenon does not return to baseline, suggesting that the leak remains active”. Lines 124-128

      (4) Figure 1C: the authors interpret the data 'This Ca2+ influx appeared to be immediately taken up by the ER as the response to TG was much greater in parasites previously exposed to extracellular Ca2+'. I don't understand this interpretation, in Ca2+-containing solution it would expected to have a larger signal as TG is likely to activate store-operated Ca2+ entry which would contribute to a larger cytosolic Ca2+ transient. Does T. gondii have SOCE? It cannot be uptake into the ER as SERCA is blocked. Unless the authors are arguing for another ER Ca2+ uptake pathway? But why are Ca2+ uptake in the ER would lower the signal whereas the data show an increased signal?

      We pre-incubated the suspension with calcium to allow filling of the stores, while SERCA is still active, and added thapsigargin (TG) at 400 seconds to measure calcium release. The experiment was designed to introduce the concept that the ER may have access to extracellular calcium, a phenomenon not yet clearly demonstrated in Toxoplasma. We did not expect to have less release by TG but if the ER is not efficient in filling after extracellular calcium entry it would be expected to have a similar response to TG. Yes, it is very possible that when we add TG we are also seeing more calcium entry through the PM as we previously proposed that the increased cytosolic Ca<sup>2+</sup> may regulate Ca<sup>2+</sup> entry. However, the evidence does not support that this increased entry would be triggered by store depletion. The experiments with the SERCA mutant (Fig. 4D) shows that in the conditional knockout mutant, the ER is partially depleted, yet this does not lead to enhanced calcium entry, suggesting that the depletion alone is not sufficient to trigger increased influx.

      There is no experimental evidence supporting the regulation of calcium entry by store depletion in Toxoplasma (PMID: 24867952). We revised the text to clarify this point and expanded the discussion on store-operated calcium entry (SOCE). While it is possible that a channel similar to Orai exists in Toxoplasma, it is highly unlikely to be regulated by store depletion, as there is no gene homologous to STIM. If store-regulated calcium entry does occur in Toxoplasma, it is likely mediated through a different, still unidentified, mechanism. Lines 461-467.

      (5) The choice of adding Ca2+ first followed by TG is curious as it is more difficult to interpret. Would be more informative to add TG, allow the leak to complete, and then add Ca2+ which would allow temporal separation between Ca2+ release from stores and Ca2+ influx from the extracellular space. Was this experiment done? If not would be useful to have the data.

      Yes, this experiment was already published: PMID: 24867952 and PMID: 38382669.

      It mainly highlighted that increased cytosolic calcium may regulate calcium entry most likely through a TRP channel. See our response to point 4 and the description of the new Fig. S2 in the response to point 7.

      (6) Line 136-39: these experiments as designed - partly because of the issues discussed above - do not address the ability of organelles to access extracellular Ca2+ or the state of refilling of intracellular Ca2+ stores. They can simply be interpreted as the different agents (TG, Nig, GPN, CCCP) inducing various levels of Ca2+ influx.

      Concerning TG, the experiment shown in Fig. 4D shows that depletion of the ER calcium does not result in stimulation of calcium entry, indicating the absence of classical SOCE activation in Toxoplasma.

      To our knowledge, neither mitochondria nor lysosomes (or other acidic compartments) are capable of triggering classical SOCE in mammalian cells.

      Given that the ER in Toxoplasma lacks the canonical components required to initiate SOCE, it is unclear why the mitochondria or acidic compartments would be able to do so. While it is possible that T. gondii utilizes an alternative mechanism for store-operated calcium entry, investigating such a pathway would require a comprehensive study. In mammalian systems, it took almost 15 years and the efforts of multiple research groups to identify the molecular components of SOCE. Expecting this complex question to be resolved within the scope of a single study is unrealistic.

      Our current data show that the mitochondrion is unable to access calcium from the cytosol, as shown in Figure 5E. Performing a similar experiment for the PLVAC would be ideal; however, expression of fluorescent calcium indicators in this organelle has not been successful. This is likely due to the presence of several proteases that degrade expressed proteins, as well as the acidic environment, which quenches fluorescence. These challenges have made studying calcium dynamics in the PLVAC particularly difficult.

      To address the reviewer’s comment, we performed an additional experiment presented in Fig. S2A. In this experiment, we first inhibited SERCA with thapsigargin (TG), preventing calcium uptake into the ER, and subsequently added calcium to the suspension. Under these conditions, calcium cannot be sequestered by the ER. We then applied GPN and quantified the response, comparing it to a similar experimental condition without TG. Indeed, under these conditions, we observed a significant but modest increase in the GPN-induced response, suggesting that the PLVAC may be capable of directly taking up calcium from the cytosol. However, this occurs under conditions of SERCA inhibition which creates nonphysiological conditions with elevated cytosolic calcium levels and the presence of TG may promote additional ER leakage, both of which could artificially enhance PLVAC uptake. Under physiological conditions, with functional SERCA activity, the ER would likely sequester cytosolic calcium more efficiently, thereby limiting calcium availability for PLVAC direct uptake. Thus, while the result is intriguing, it may not reflect calcium handling under normal cellular conditions. See lines 172-178.

      (7) Figure 1H-I: I disagree with the authors' interpretation of the results (lines 144-153). The data argue that by blocking ER Ca2+ uptake by TG, other organelles take up Ca2+ from the cytosol where it accumulates due to the leak and Ca2+ influx as is evident from the data allowing more release. The data does not argue for ER Ca2+ tunneling to other organelles. Tunneling would be reduced in the presence of TG (see PMID: 30046136, 24867608).

      We partially agree with this concern. In our experiments, TG was used to inhibit SERCA and block calcium uptake into the ER, allowing calcium to leak into the cytosol. We propose that this leaked calcium is subsequently taken up by other intracellular compartments. This effect is observed immediately upon TG addition. However, pre-incubation with TG or knockdown of SERCA reduces calcium storage in the ER, thereby diminishing the transfer of calcium to other stores.

      To further support our claim, we performed additional experiments in the absence of extracellular calcium, now presented in Figure 1J-K. We observed that calcium release triggered by GPN or nigericin was significantly enhanced when both agents were added after TG. These results suggest that calcium initially released from the ER can be sequestered by other compartments. As mentioned, we deleted any mention of “tunneling,” but we believe the data support the occurrence of calcium transfer. New results described in lines 166-171.

      The experiment in Fig S2A described in the response to (6) also addresses this concern. Under physiological conditions with functional SERCA, cytosolic calcium would likely be rapidly sequestered by the ER, limiting its availability to other compartments. See lines 172178.

      (8) Line 175: SERCA-dependent Ca2+ uptake is higher at 880 nM as would be expected yet the authors state that it's optimal at 220 nM Ca2+ ?

      Yes, it is true that the SERCA-dependent Ca<sup>2+</sup> uptake rate is higher at elevated Ca²⁺ concentrations. We chose to use 220 nM free calcium because of several reasons: 1) this concentration is close to physiological cytosolic levels fluctuations; 2) it is commonly used in studies of mammalian SERCA; and 3) calcium uptake is readily detectable at this level. While this may not represent the maximal activity conditions for SERCA, we believe it is a reasonable and physiologically relevant choice for assessing calcium transport activity SERCA-dependent. We added one sentence to the results explaining this reasoning (lines 204-207) and we deleted the word optimal.

      (9) Figure 3H: the saponin egress data support the conclusion that organelles Ca2+ take up cytosolic Ca2+ directly without the need for ER tunneling.

      The saponin concentration used permeabilizes the host cell membrane, allowing the intracellular tachyzoite to be surrounded with the added higher extracellular calcium concentration. The saponin concentration used does not affect the tachyzoite membrane as the parasite is still moving and calcium oscillations were clearly seen under similar conditions (PMID: 26374900 ). The resulting calcium increase in the tachyzoite cytosol is what stimulates parasite motility and egress. Since SERCA activity is reduced in the mutant, cytosolic calcium accumulates more rapidly, reaching the threshold for egress sooner and thereby accelerating parasite exit. The result does not support that the other stores contribute to this because of the Ionomycin response, which shows that egress is diminished in the mutant, likely because the calcium stores are depleted. We added an explanation in the results, lines 262-269 and the discussion, lines 532-539.

      (10) Figure S2: the HA and SERCA signals do not match perfectly? Could this reflect issues with HA tagging, potentially off-target effects? Was this tested?

      These are not off-target effects, as we did not observe them in the control cells lacking HA tagging. The HA signal also disappeared after treatment with ATc, further confirming that the IFA signal is specific. We agree with the reviewer that the signals do not align perfectly. This discrepancy could be due to differences in antibody accessibility or the fact that the two antibodies recognize different regions of the protein. We added a sentence about this in the result; lines 240-243.

      Reviewer #2 (Recommendations for the authors):

      The description of the data of Figures 1B and S1A starting on line 108 would be easier to follow if Figure S1A was actually incorporated into Figure 1. It is not clear why these two complementary experiments were separated since they are both equally important in understanding and interpreting the data.

      We re-arranged figure 1 and incorporated S1A now as Fig 1C.

      As noted in the public comments, loading of fura2/AM can result in compartmentalized fura2, which can contaminate the cytosolic calcium measurements and might modify free calcium levels and calcium storage capacity in intracellular organelles. This can be assessed using the digitonin permeabilization method used in the MagFluo4 measurements, but in this case, detecting the fura2 signal remaining after cell permeabilization.

      As suggested by the reviewer, we measured Fura-2 compartmentalization by permeabilizing cells with digitonin as we do for the Mag-Fluo-4 and the fluorescence was reduced almost completely and was unresponsive to any additions (see Author response image 1).

      Author response image 1.

      T. gondii tachyzoites in suspension exposed to Thapsigargin Calcium and GPN. The dashed lines shows and experiments using the same conditions but parasites were permeabilized with digitonin shows a similar experiment with parasites exposed to MgATP.to release the cytosolic Fura. Part B

      Following the public comment regarding the residual calcium mobilization response to Zaprinast observed after 24 h ATc knockdown of SERCA (Figsures 4E, 4F, as explained in the legend to Figure 4), was there still a response to Zaprinast after 48 h knockdown, where the thapsigargin response was apparently fully ablated?

      Unfortunately, we were unable to perform this experiment as it is not possible to obtain sufficient cells at 48 h with ATc. Due to the essential role of TgSERCA, parasites are unable to replicate after 24 h.

      As noted in the public comments, the data in Figure 4A vs 4G and Figure 4B vs 4H appear to show that the calcium responses to GPN are similar to that with thapsigargin, which seems unexpected if the acidic compartment is loaded from the ER. The results with GPN addition after thapsigargin (Figure 1H) argue against this, but the authors should still cite the work of Atakpa et al.

      We think that the reviewer is concerned that GPN may also be acting on the ER. This is a possibility that we considered, and we now included the suggested citation (line 457). However, we believe that it is difficult to directly compare the responses, as the kinetics of calcium release from the ER may differ from those of release from the PLVAC. This could be due to differences in the calcium buffering capacity between the two compartments. Additionally, it is possible that calcium leaked from the ER is more efficiently sequestered by other stores or extruded through the plasma membrane than calcium released from the PLVAC. Besides, GPN is known to have a more disruptive effect on membranes compared to TG, which may also influence their responses. As noted by the reviewer, Figure 1H also supports the idea that the acidic compartment is loaded from the ER.

      The abbreviation for the plant-like vacuolar compartment (PLVAC) only appears in a figure legend but should be defined in the main text on first use.

      Corrected, lanes 140-143

      The authors should cite the previous study of Borges-Pereira et al., 2020 (PMID: 32848018) that also demonstrates the incomplete overlap of the calcium pools mobilized by thapsigargin and CPA in P. falciparum. The ability to measure calcium in intracellular stores using MagFluo4 opens the possibility to further investigate this discrepancy between CPA and thapsigargin, but CPA does not appear to have been used in the permeabilized cell experiments with MagFluo4. I would suggest that this could be added to Figure 2 and/or Figure 4, or at least as a supplementary figure.

      In response to this reviewer’s critique we performed additional experiments with Mag-Fluo4 loaded parasites. These are presented in the new Figure S3. We added CPA and TG and combined them to inhibit SERCA and to allow calcium leak from the loaded organelle. Under these conditions, we observed a very similar leak rate after the addition of the inhibitors as measured by the slope of Ca<sup>2+</sup> leak. We believe that the leak rate is most likely determined by the intrinsic ER mechanism. See the discussion of this result in lines 436442 and the previous response to the same reviewer comment.

      Reviewer #3 (Recommendations for the authors):

      Suggestions for improved or additional experiments, data, or analyses

      (1) Figure 1A is not mentioned in the main text even though it is discussed.

      Corrected

      (2) Figure 1G: Values do not match, how can GPN be so high?

      These figures were replaced by new traces and individual quantification analyses for each experiment.

      (3) Figure 1H and I: Is this type of data/results also available for the mitochondrion?

      Unfortunately, we were not able to include this experiment because we were unable to accurately quantify the mitochondrial calcium release. Instead, we used mitochondrial GECIs and the results are shown in Figure 5 to study mitochondrial calcium uptake.

      (4) Figure 1H: where does the calcium go after GPN addition? Taken up by another calcium store?

      Most likely calcium is extruded through the plasma membrane by the activity of the Calcium ATPase TgA1.

      However, the reviewer’s suggestion is also possible, and calcium could be taken by another store like the mitochondrion. In this regard, we did observe a large mitochondrial calcium increase (parasites expressing SOD2-GCaMp6) after adding GPN (Fig 5I) suggesting that the mitochondrion may take calcium from the organelle targeted by GPN. However, the calcium affinity of the mitochondrion is very low, so the concentration of calcium needs to be very high to activate it and these concentrations are most likely achieved at the microdomains formed between the mitochondrion and other organelles.

      (5) Figure 2B-C: Further explanation of why these particular values were chosen for the follow-up experiments would be helpful for the reader.

      We tested a wide range of MgATP and free calcium concentrations to measure ER Ca<sup>2+</sup> uptake catalyzed by TgSERCA. The concentrations shown fall within the linear range.

      We followed the free calcium concentrations used by studies of mammalian SERCA (https://doi.org/10.1016/j.ceca.2020.102188 ). In this protocol they used 220 nM free calcium, which was close to cytosolic Ca<sup>2+</sup> levels. TgSERCA can take up calcium efficiently at this concentration, as shown in Fig 2. We used less MgATP than the mammalian cell protocols, since we did not observe a significant increase in SERCA activity beyond 0.5 mM MgATP. We added one more sentence explaining in the results, lines 204-207.

      (6) Figure 3E: Revise the error bar? (and note that colours do not match the graph legend).

      The colors do match; the problem visualizing it is because vacuoles containing a single parasite are virtually absent in the control group without ATc treatment.

      (7) Figure 3H: 'Interestingly, when testing egress after the addition of saponin in the presence of extracellular Ca2+, we observed that the tachyzoites egressed sooner (Figure 3H, saponin egress).' This is the only graph showing egress timing, and thus it is not clear what is the comparison. The egressed here is sooner compared to what condition? Egress in the absence of Ca2+? This requires clarification and might require the control data to be added.

      In the saponin experiment we compare time to egress of the mutant grown with or without ATc. The measurement is for time to egress after adding saponin. This experiment is in the presence of extracellular calcium. The protocol was previously used to measure time to egress: PMID: 40043955, PMID: 38382669, PMID: 26374900. See also response to question 9 of reviewer 1.

      (8) Figure 4C: There is a small peak appearing right after TG addition this should be discussed and explained.

      This trace was generated in a different fluorometer, F-4000. This was an artifact due to jumping of the signal when adding TG. Multiple repeats of the same experiment in the newer F7000 did not show the peak. We included in the MM the use of the F-4000 fluorometer for some experiments. We apologize for the omission. Lines 609-610

      (9) Figure 5A: An important control that is missing is co-localisation with a mitochondrial marker.

      The expression of the SOD2-GCaMP6 has been characterized: PMID: 31758454

      (10) Figure 5H: This line was made for this study however the line genetic verification is missing.

      In response to this concern we now include a new Figure S5 showing the fluorescence of GCaMP6 in the mitochondrion of the iDTgSERCA mutant (Fig. S5A). We include several parasites. In addition, we show fluorescence measurements after addition of Calcium showing that the cells are unresponsive indicating that the indicator is not in the cytosol. Lines 650-651 and 344-348.

      (11) Figure 6D: since the membranes are hard to see, it is not clear whether the arrows show structures that are in line with the definition of membrane contact sites. The authors should provide an in-depth analysis of the length of the interaction between the membranes where the distance is less than 30 nM, and discuss how many structures corresponding to the definition were analysed.

      All the requested details are now included in the legend to Figure S3.

      Minor corrections to the text and figures

      (1) Unify statistical labelling throughout the paper replacing *** with p values.

      Corrected. We changed the *** with the actual p value in some figures. For figure 2 and Fig S1, we still use the *** due to the space limitation.

      (2) Unify ATC vs ATc throughout the paper.

      Corrected

      (3) Unify capitalization of line name (iΔTgserca/i ΔTgSERCA) throughout the paper.

      Corrected

      (4) Unify capitalization of p value (p/P) throughout the paper.

      Corrected in figures

      (5) Unify Fig X vs Fig. X throughout the text.

      Corrected

      (6) Add values of scale bars to legends (eg Figure S2).

      Corrected

      (7) What is the time point for the data in Figures 4E-H, 5H, and S3? 24hrs? include in the legend.

      Added 24 h to the legends. Fig S3 is now S4.

      (8) Figure 3F: The second graph is NS thus perhaps no need for the p-value?

      Corrected

      (8) Figure 3G: Worth considering swapping the two around: first attachment and then invasion?

      Corrected. Invasion and attachment bars were swapped.

      (10) Figure 4A/B: Wrong colour match for Figure 4B.

      Corrected

      (11) Figure 4F: In the main text, the authors reference to Figure 1F, correct to 4F.

      Corrected

      (12) Figure 4H: In the main text, authors reference to Figure 1H, correct to 4H.

      Corrected

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary: The authors of this study sought to define a role for IgM in responses to house dust mites in the lung.

      Strengths:

      Unexpected observation about IgM biology

      Combination of experiments to elucidate function

      Weaknesses:

      Would love more connection to human disease

      We thank the reviewer for these comments. At the time of this publication, we have not made a concrete link with human disease. While there is some anecdotal evidence of diseases such as Autoimmune glomerulonephritis, Hashimoto’s thyroiditis, Bronchial polyp, SLE, Celiac disease and other diseases in people with low IgM. Allergic disorders are also common in people with IgM deficiency, other studies have reported as high as 33-47%. The mechanisms for the high incidence of allergic diseases are unclear as generally, these patients have normal IgG and IgE levels. IgM deficiency may represent a heterogeneous spectrum of genetic defects, which might explain the heterogeneous nature of disease presentations. 

      Reviewer #2 (Public Review):

      Summary:

      The manuscript by Hadebe and colleagues describes a striking reduction in airway hyperresponsiveness in Igm-deficient mice in response to HDM, OVA and papain across the B6 and BALB-c backgrounds. The authors suggest that the deficit is not due to improper type 2 immune responses, nor an aberrant B cell response, despite a lack of class switching in these mice. Through RNA-Seq approaches, the authors identify few differences between the lungs of WT and Igm-deficient mice, but see that two genes involved in actin regulation are greatly reduced in IgM-deficient mice. The authors target these genes by CRISPR-Cas9 in in vitro assays of smooth muscle cells to show that these may regulate cell contraction. While the study is conceptually interesting, there are a number of limitations, which stop us from drawing meaningful conclusions.

      Strengths:

      Fig. 1. The authors clearly show that IgMKO mice have striking reduced AHR in the HDM model, despite the presence of a good cellular B cell response.

      Weaknesses:

      Fig. 2. The authors characterize the cd4 t cell response to HDM in IGMKO mice.<br /> They have restimulated medLN cells with antiCD3 for 5 days to look for IL-4 and IL-13, and find no discernible difference between WT and KO mice. The absence of PBS-treated WT and KO mice in this analysis means it is unclear if HDM-challenged mice are showing IL-4 or IL-13 levels above that seen at baseline in this assay.

      We thank the Reviewer for this comment. We would like to mention that a very minimal level of IL-4 and IL-13 in PBS mice was detected. We have indicated with a dotted line on the Figure to show levels in unstimulated or naïve cytokines. Please see Author response image 1 below from anti-CD3 stimulated cytokine ELISA data. The levels of these cytokines are very low and are not changed between WT and IgM<sup>-/-</sup> mice, this is also true for PMA/ionomycin-stimulated cells.

      Author response image 1.

      The choice of 5 days is strange, given that the response the authors want to see is in already primed cells. A 1-2 day assay would have been better.

      We agree with the reviewer that a shorter stimulation period would work. Over the years we have settled for 5-day re-stimulation for both anti-CD3 and HDM. We have tried other time points, but we consistently get better secretion of cytokines after 5 days.

      It is concerning that the authors state that HDM restimulation did not induce cytokine production from medLN cells, since countless studies have shown that restimulation of medLN would induce IL-13, IL-5 and IL-10 production from medLN. This indicates that the sensitization and challenge model used by the authors is not working as it should.

      We thank the reviewer for this observation. In our recent paper showing how antigen load affects B cell function, we used very low levels of HDM to sensitise and challenge mice (1 ug and 3 ug respectively). See below article, Hadebe et al., 2021 JACI. This is because Labs that have used these low HDM levels also suggested that antigen load impacts B cell function, especially in their role in germinal centres. We believe the reason we see low or undetectable levels of cytokines is because of this low antigen load sensitisation and challenge. In other manuscripts we have published or about to publish, we have shown that normal HDM sensitisation load (1 ug or 100 ug) and challenge (10 ug) do induce cytokine release upon restimulation with HDM. See the below article by Khumalo et al, 2020 JCI Insight (Figure 4A).

      Sabelo Hadebe, Jermaine Khumalo, Sandisiwe Mangali, Nontobeko Mthembu, Hlumani Ndlovu, Amkele Ngomti, Martyna Scibiorek, Frank Kirstein, Frank Brombacher. Deletion of IL-4Ra signalling on B cells limits hyperresponsiveness depending on antigen load. doi.org/10.1016/j.jaci.2020.12.635).

      Jermaine Khumalo, Frank Kirstein, Sabelo Hadebe, Frank Brombacher. IL-4Rα signalling in regulatory T cells is required for dampening allergic airway inflammation through inhibition of IL-33 by type 2 innate lymphoid cells. JCI Insight. 2020 Oct 15;5(20):e136206. doi: 10.1172/jci.insight.136206

      The IL-13 staining shown in panel c is also not definitive. One should be able to optimize their assays to achieve a better level of staining, to my mind.

      We agree with the reviewer that much higher IL-13-producing CD4 T cells should be observed. We don’t think this is a technical glitch or non-optimal set-up as we see much higher levels of IL-13-producing CD4 T cells when using higher doses of HDM to sensitise and challenge, say between 7 -20% in WT mice (see Author response image 2, lung stimulated with PMA/ionomycin+Monensin, please note this is for illustration purposes only and it not linked to the current manuscript, its merely to demonstrate a point from other experiments we have conducted in the lab).

      Author response image 2.

      In d-f, the authors perform a serum transfer, but they only do this once. The half life of IgM is quite short. The authors should perform multiple naïve serum transfers to see if this is enough to induce FULL AHR.

      We thank the reviewer for this comment. We apologise if this was not clear enough on the Figure legend and method, we did transfer serum 3x, a day before sensitisation, on the day of sensitisation and a day before the challenge to circumvent the short life of IgM. In our subsequent experiments, we have now used busulfan to deplete all bone marrow in IgM-deficient mice and replace it with WT bone marrow and this method restores AHR (Figure 3).

      This now appears in line 165 to 169 and reads

      “Adoptive transfer of naïve serum

      Naïve wild-type mice were euthanised and blood was collected via cardiac puncture before being spun down (5500rpm, 10min, RT) to collect serum. Serum (200mL) was injected intraperitoneally into IgM-deficient mice. Serum was injected intraperitoneally at day -1, 0, and a day before the challenge with HDM (day 10).”

      The presence of negative values of total IgE in panel F would indicate some errors in calculation of serum IgE concentrations.

      We thank the reviewer for this observation. For better clarity, we have now indicated these values as undetected in Figure , as they were below our detection limit.

      Overall, it is hard to be convinced that IgM-deficiency does not lead to a reduction in Th2 inflammation, since the assays appear suboptimal.

      We disagree with the reviewer in this instance, because we have shown in 3 different models and in 2 different strains and 2 doses of HDM (high and low) that no matter what you do, Th2 remains intact. Our reason for choosing low dose HDM was based on our previous work and that of others, which showed that depending on antigen load, B cells can either be redundant or have functional roles. Since our interest was to tease out the role of B cells and specifically IgM, it was important that we look at a scenario where B cells are known to have a function (low antigen load). We did find similar findings at high dose of HDM load, but effects on AHR were not as strong, but Th2 was not changed, in fact in some instances Th2 was higher in IgM-deficient mice.

      Fig. 3. Gene expression differences between WT and KO mice in PBS and HDM challenged settings are shown. PCA analysis does not show clear differences between all four groups, but genes are certainly up and downregulated, in particular when comparing PBS to HDM challenged mice. In both PBS and HDM challenged settings, three genes stand out as being upregulated in WT v KO mice. these are Baiap2l1, erdr1 and Chil1.

      Noted

      Fig. 4. The authors attempt to quantify BAIAP2L1 in mouse lungs. It is difficult to know if the antibody used really detects the correct protein. A BAIAP2L1-KO is not used as a control for staining, and I am not sure if competitive assays for BAIAP2L1 can be set up. The flow data is not convincing. The immunohistochemistry shows BAIAP2L1 (in red) in many, many cells, essentially throughout the section. There is also no discernible difference between WT and KO mice, which one might have expected based on the RNA-Seq data. So, from my perspective, it is hard to say if/where this protein is located, and whether there truly exists a difference in expression between wt and ko mice.

      We thank the reviewer for this comment. We are certain that the antibody does detect BAIAP2L1, we have used it in 3 assays, which we admit may show varying specificities since it’s a Polyclonal antibody. However, in our western blot, the antibody detects 1 band at 56.7kDa and no other bands, apart from what we think are isoforms. We agree that BAIAP2L1 is expressed by many cell types, including CD45+ cells and alpha smooth muscle negative cells and we show this in our supplementary Figure 9. Where we think there is a difference in expression between WT and IgM-deficient mice is in alpha-smooth muscle-positive cells. We have tested antibodies from different companies, and we find similar findings. We do not have access to BAIAP2L1 KO mice and to test specificity, we have also used single stain controls with or without secondary antibody and isotype control which show no binding in western blot and Immunofluorescence assays and Fluorescence minus one antibody in Flow cytometry, so that way we are convinced that the signal we are seeing is specific to BAIAP2L1.

      Fig. 5 and 6. The authors use a single cell contractility assay to measure whether BAIAP2L1 and ERDR1 impact on bronchial smooth muscle cell contractility. I am not familiar with the assay, but it looks like an interesting way of analysing contractility at the single cell level.

      The authors state that targeting these two genes with Cas9gRNA reduces smooth muscle cell contractility, and the data presented for contractility supports this observation. However, the efficiency of Cas9-mediated deletion is very unclear. The authors present a PCR in supp fig 9c as evidence of gene deletion, but it is entirely unclear with what efficiency the gene has been deleted. One should use sequencing to confirm deletion. Moreover, if the antibody was truly working, one should be able to use the antibody used in Fig 4 to detect BAIAP2L1 levels in these cells. The authors do not appear to have tried this.

      We thank the reviewer for these observations. We are in a process to optimise this using new polyclonal BAIAP2L1 antibodies from other companies, since the one we have tried doesn’t seem to work well on human cells via western blot. So hopefully in our new version, we will be able to demonstrate this by immunofluorescence or western blot.

      Other impressions:

      The paper is lacking a link between the deficiency of IgM and the effects on smooth muscle cell contraction.

      The levels of IL-13 and TNF in lavage of WT and IGMKO mice could be analysed.

      We have measured Th2 cytokine IL-13 in BAL fluid and found no differences between IgM-deficient mice and WT mice challenged with HDM (Author response image 1). We could not detected TNF-alpha in the BAL fluid, it was below detection limit.

      Author response image 3.

      IL-13 levels are not changed in IgM-deficient mice in the lung. Bronchoalveolar lavage fluid in WT or IgM-deficient mice sensitised and challenged with HDM. TNF-a levels were below the detection limit.

      Moreover, what is the impact of IgM itself on smooth muscle cells? In the Fig. 7 schematic, are the authors proposing a direct role for IgM on smooth muscle cells? Does IgM in cell culture media induce contraction of SMC? This could be tested and would be interesting, to my mind.

      We thank the Reviewer for these comments. We are still trying to test this, unfortunately, we have experienced delays in getting reagents such as human IgM to South Africa. We hope that we will be able to add this in our subsequent versions of the article. We agree it is an interesting experiment to do even if not for this manuscript but for our general understanding of this interaction at least in an in vitro system.

      Reviewer #3 (Public Review):

      Summary:

      This paper by Sabelo et al. describes a new pathway by which lack of IgM in the mouse lowers bronchial hyperresponsiveness (BHR) in response to metacholine in several mouse models of allergic airway inflammation in Balb/c mice and C57/Bl6 mice. Strikingly, loss of IgM does not lead to less eosinophilic airway inflammation, Th2 cytokine production or mucus metaplasia, but to a selective loss of BHR. This occurs irrespective of the dose of allergen used. This was important to address since several prior models of HDM allergy have shown that the contribution of B cells to airway inflammation and BHR is dose dependent.

      After a description of the phenotype, the authors try to elucidate the mechanisms. There is no loss of B cells in these mice. However, there is a lack of class switching to IgE and IgG1, with a concomitant increase in IgD. Restoring immunoglobulins with transfer of naïve serum in IgM deficient mice leads to restoration of allergen-specific IgE and IgG1 responses, which is not really explained in the paper how this might work. There is also no restoration of IgM responses, and concomitantly, the phenotype of reduced BHR still holds when serum is given, leading authors to conclude that the mechanism is IgE and IgG1 independent. Wild type B cell transfer also does not restore IgM responses, due to lack of engraftment of the B cells. Next authors do whole lung RNA sequencing and pinpoint reduced BAIAP2L1 mRNA as the culprit of the phenotype of IgM<sup>-/-</sup> mice. However, this cannot be validated fully on protein levels and immunohistology since differences between WT and IgM KO are not statistically significant, and B cell and IgM restoration are impossible. The histology and flow cytometry seems to suggest that expression is mainly found in alpha smooth muscle positive cells, which could still be smooth muscle cells or myofibroblasts. Next therefore, the authors move to CRISPR knock down of BAIAP2L1 in a human smooth muscle cell line, and show that loss leads to less contraction of these cells in vitro in a microscopic FLECS assay, in which smooth muscle cells bind to elastomeric contractible surfaces.

      Strengths:

      (1) There is a strong reduction in BHR in IgM-deficient mice, without alterations in B cell number, disconnected from effects on eosinophilia or Th2 cytokine production

      (2) BAIAP2L1 has never been linked to asthma in mice or humans

      Weaknesses:

      (1) While the observations of reduced BHR in IgM deficient mice are strong, there is insufficient mechanistic underpinning on how loss of IgM could lead to reduced expression of BAIAP2L1. Since it is impossible to restore IgM levels by either serum or B cell transfer and since protein levels of BAIAP2L1 are not significantly reduced, there is a lack of a causal relationship that this is the explanation for the lack of BHR in IgM-deficient mice. The reader is unclear if there is a fundamental (maybe developmental) difference in non-hematopoietic cells in these IgM-deficient mice (which might have accumulated another genetic mutation over the years). In this regard, it would be important to know if littermates were newly generated, or historically bred along with the KO line.

      We thank the reviewer for asking this question and getting us to think of this in a different way. This prompted us to use a different method to try and restore IgM function and since our animal facility no longer allows irradiation, we opted for busulfan. We present this data as new data in Figure 3. We had to go back and breed this strain and then generated bone marrow chimeras. What we have shown now with chimeras is that if we can deplete bone marrow from IgM-deficient mice and replace it with congenic WT bone marrow when we allow these mice to rest for 2 months before challenge with HDM (new Supplementary Figure 6 a-c) We also show that AHR (resistance and elastance) is partially restored in this way (Figure 3 a and b) as mice that receive congenic WT bone marrow after chemical irradiation can mount AHR and those that receive IgM-deficient bone marrow, can’t mount AHR upon challenge with HDM. If the mice had accumulated an unknown genetic mutation in non-hematopoietic cells, the transfer of WT bone marrow would not make a difference. So, we don’t believe the colony could have gained a mutation that we are unaware of. We have also shipped these mice to other groups and in their hands, this strains still only behaves as an IgM only knockout mice. See their publication below.

      Mark Noviski, James L Mueller, Anne Satterthwaite, Lee Ann Garrett-Sinha, Frank Brombacher, Julie Zikherman 2018. IgM and IgD B cell receptors differentially respond to endogenous antigens and control B cell fate. eLife 2018;7:e35074. DOI: https://doi.org/10.7554/eLife.35074 we have also added methods for bone marrow chimaeras and added results sections and new Figures related to this methods.

      Methods (line 171-182).

      “Busulfan Bone marrow chimeras

      WT (CD45.2) and IgM<sup>-/-</sup> (CD45.2) congenic mice were treated with 25 mg/kg busulfan (Sigma-Aldrich, Aston Manor, South Africa) per day for 3 consecutive days (75 mg/kg in total) dissolved in 10% DMSO and Phosphate buffered saline (0.2mL, intraperitoneally) to ablate bone marrow cells. Twenty-four hours after last administration of busulfan, mice were injected intravenously with fresh bone marrow (10x10<sup>6</sup> cells, 100mL) isolated from hind leg femurs of either WT (CD45.1) or IgM<sup>-/-</sup> mice(33). Animals were then allowed to complement their haematopoietic cells for 8 weeks. In some experiments the level of bone marrow ablation was assessed 4 days post-busulfan treatment in mice that did not receive donor cells. At the end of experiment level of complemented cells were also assessed in WT and IgM<sup>-/-</sup> mice that received WT (CD45.1) bone marrow.”

      Results (line 491-521)

      “Replacement of IgM-deficient mice with functional hematopoietic cells in busulfan mice chimeric mice restores airway hyperresponsiveness.

      We then generated bone marrow chimeras by chemical radiation using busulfan(33). We treated mice three times with busulfan for 3 consecutive days and after 24 hrs transferred naïve bone marrow from congenic CD45.1 WT mice or CD45.2 IgM<sup>-/-</sup> mice (Fig. 3a and Supplementary Fig. 5a). We showed that recipient mice that did not receive donor bone marrow after 4 days post-treatment have significantly reduced lineage markers (CD45+Sca-1+) or lineage negative (Lin-) cells in the bone marrow when compared to untreated or vehicle (10% DMSO) treated mice (Supplementary Figure 5b-c). We allowed mice to reconstitute bone marrow for 8 weeks before sensitisation and challenge with low dose HDM (Figure 3a). We showed that WT (CD45.2) recipient mice that received WT (CD45.1) donor bone marrow had higher airway resistance and elastance and this was comparable to IgM<sup>-/-</sup> (CD45.2) recipient mice that received donor WT (CD45.1) bone marrow (Figure 3b). As expected, IgM<sup>-/-</sup> (CD45.2) recipient mice that received donor IgM<sup>-/-</sup> (CD45.2) bone marrow had significantly lower AHR compared to WT (CD45.2) or IgM<sup>-/-</sup> (CD45.2) recipient mice that received WT (CD45.1) bone marrow (Figure 3b). We confirmed that the differences observed were not due to differences in bone marrow reconstitution as we saw similar frequencies of CD45.1 cells within the lymphocyte populations in the lungs and other tissues (Supplementary Fig. 5d). We observed no significant changes in the lung neutrophils, eosinophils, inflammatory macrophages, CD4 T cells or B cells in WT or IgM<sup>-/-</sup> (CD45.2) recipient mice that received donor WT (CD45.1/CD45.2) or IgM<sup>-/-</sup> (CD45.2) bone marrow when sensitised and challenged with low dose HDM (Fig. 3c)

      Restoring IgM function through adoptive reconstitution with congenic CD45.1 bone marrow in non-chemically irradiated recipient mice or sorted B cells into IgM<sup>-/-</sup> mice (Supplementary Fig.  6a) did not replenish IgM B cells to levels observed in WT mice and as a result did not restore AHR, total IgE and IgM in these mice (Supplementary Fig.  6b-c).”

      The 2 new figures are

      Figure 3 which moved the rest of the Figures down and Supplementary Figure 5, which also moved the rest of the supplementary figures down.

      Discussion appears in line 757-766 of the untracked version of the article.

      To resolve other endogenous factors that could have potentially influenced reduced AHR in IgM-deficient mice, we resorted to busulfan chemical irradiation to deplete bone marrow cells in IgM-deficient mice and replace bone marrow with WT bone marrow. While it is well accepted that busulfan chemical irradiation partially depletes bone marrow cells, in our case it was not possible to pursue other irradiation methods due to changes in ethical regulations and that fact that mice are slow to recover after gamma rays irradiation. Busulfan chemical irradiation allowed us to show that we could mostly restore AHR in IgM-deficient recipient mice that received donor WT bone marrow when challenged with low dose HDM.

      (2) There is no mention of the potential role of complement in activation of AHR, which might be altered in IgM-deficient mice 

      We thank the reviewer for this comment. We have not directly looked at complement in this instance, however, from our previous work on C3-/- mice, there have been comparable AHR to WT mice under the HDM challenge.

      (3) What is the contribution of elevated IgD in the phenotype of the IgM-deficient mice. It has been described by this group that IgD levels are clearly elevated

      We thank the reviewer for this question. We believe that IgD is essentially what drives partial class switching to IgG, we certainly have shown that in the case of VSV virus and Trypanosoma congolense and Trypanosoma brucei brucei that elevated IgD drive delayed but effective IgG in the absence of IgM (Lutz et al, 2001, Nature). This is also confirmed by Noviski studies where they show that both IgM and IgD do share some endogenous antigens, so its likely that external antigens can activate IgD in a similar manner to prompt class switching.

      (4) How can transfer of naïve serum in class switching deficient IgM KO mice lead to restoration of allergen specific IgE and IgG1?

      We thank the Reviewer for these comments, we believe that naïve sera transferred to IgM deficient mice is able to bind to the surface of B cells via IgM receptors (FcμR / Fcα/μR), which are still present on B cells and this is sufficient to facilitate class switching. Our IgM<sup>-/-</sup> mouse lacks both membrane-bound and secreted IgM, and transferred serum contains at least secreted IgM which can bind to surfaces via its Fc portion. We measured HDM-specific IgE and we found very low levels, but these were not different between WT and IgM<sup>-/-</sup> adoptively transferred with WT serum. We also detected HDM-specific IgG1 in IgM<sup>-/-</sup> transferred with WT sera to the same level as WT, confirming a possible class switching, of course, we can’t rule out that transferred sera also contains some IgG1. We also can’t rule out that elevated IgD levels can partially be responsible for class switched IgG1 as discussed above.

      In the discussion line 804-812, we also added the following

      “We speculate that IgM can directly activate smooth muscle cells by binding a number of its surface receptors including FcμR, Fcα/μR and pIgR(52-54). IgM binds to FcμR strictly, but shares Fcα/μR and pIgR with IgA(5,52,54). Both Fcα/μR and pIgR can be expressed by non-structural cells at mucosal sites(54,55). We would not rule out that the mechanisms of muscle contraction might be through one of these IgM receptors, especially the ones expressed on smooth muscle cells(54,55). Certainly, our future studies will be directed towards characterizing the mechanism by which IgM potentially activates the smooth muscle.”

      We have discussed this section under Discussion section, line 731 to 757. In addition, since we have now performed bone marrow chimaeras we have further added the following in our discussion in line 757-766.

      To resolve other endogenous factors that could have potentially influenced reduced AHR in IgM-deficient mice, we resorted to busulfan chemical irradiation to deplete bone marrow cells in IgM-deficient mice and replace bone marrow with WT bone marrow. While it is well accepted that busulfan chemical irradiation partially depletes bone marrow cells, in our case it was not possible to pursue other irradiation methods due to changes in ethical regulations and that fact that mice are slow to recover after gamma rays irradiation. Busulfan chemical irradiation allowed us to show that we could mostly restore AHR in IgM-deficient recipient mice that received donor WT bone marrow when challenged with low dose HDM.

      We removed the following lines, after performing bone marrow chimaeras since this changed some aspects.

      Our efforts to adoptively transfer wild-type bone marrow or sorted B cells into IgM-deficient mice were also largely unsuccessful partly due to poor engraftment of wild-type B cells into secondary lymphoid tissues. Natural secreted IgM is mainly produced by B1 cells in the peritoneal cavity, and it is likely that any transfer of B cells via bone marrow transfer would not be sufficient to restore soluble levels of IgM(3,10).

      (5) Alpha smooth muscle antigen is also expressed by myofibroblasts. This is insufficiently worked out. The histology mentions "expression in cells in close contact with smooth muscle". This needs more detail since it is a very vague term. Is it in smooth muscle or in myofibroblasts.

      Response: We appreciate that alpha-smooth muscle actin-positive cells are a small fraction in the lung and even within CD45 negative cells, but their contribution to airway hyperresponsiveness is major. We also concede that by immunofluorescence BAIAP2L1 seems to be expressed by cells adjacent to alpha-smooth muscle actin (Fig. 5b), however, we know that cells close to smooth muscle (such as extracellular matrix and myofibroblasts) contribute to its hypertrophy in allergic asthma.

      James AL, Elliot JG, Jones RL, Carroll ML, Mauad T, Bai TR, et al. Airway Smooth Muscle Hypertrophy and Hyperplasia in Asthma. Am J Respir Crit Care Med [Internet]. 2012;185:1058–64. Available from: https://doi.org/10.1164/rccm.201110-1849OC

      (6) Have polymorphisms in BAIAP2L1 ever been linked to human asthma?

      No, we have looked in asthma GWAS studies, at least summary statics and we have not seen any SNPs can could be associated with human asthma.

      (7) IgM deficient patients are at increased risk for asthma. This paper suggests the opposite. So the translational potential is unclear

      We thank the reviewer for these comments. At the time of this publication, we have not made a concrete link with human disease. While there is some anecdotal evidence of diseases such as Autoimmune glomerulonephritis, Hashimoto’s thyroiditis, Bronchial polyp, SLE, Celiac disease and other diseases in people with low IgM. Allergic disorders are also common in people with IgM deficiency as the reviewer correctly points out, other studies have reported as high as 33-47%. The mechanisms for the high incidence of allergic diseases are unclear as generally, these patients have normal or higher IgG and IgE levels. IgM deficiency may represent a heterogeneous spectrum of genetic defects, which might explain the heterogeneous nature of disease presentations.

    1. Author response:

      The following is the authors’ response to the original reviews.

      We are thankful to the reviewers and the editor for their detailed feedback, insightful suggestions, and thoughtful assessment of our work. The revised manuscript has taken into account all the comments of the three reviewers. We have also undertaken additional analyses and added materials in response to reviewer suggestions. In brief:

      (1) We have conducted a more in-depth analysis of frequency domain HRV metrics to better depict the change of autonomic tone.

      (2) We have revised the manuscript to provide justifications for the chosen taVNS protocol and to clearly articulate the objectives of the current study.

      (3) In response to comments from reviewer #2, we have included two new tables that present the absolute changes in cardiovascular metrics, clinical characteristics for the two trial arms, and effects of taVNS adjusted for age.

      Other significant amendments include:

      (1) An expanded discussion linking our findings to the existing literature on the effects of taVNS on cardiovascular function, biomarkers for taVNS response, the safety of taVNS, and the dose-response relationship of taVNS.

      (2) Revision to the Method section to provide details of QT interval calculation.

      Reviewer #1 (Public Review):

      The authors report the results of a randomized clinical trial of taVNS as a neuromodulation technique in SAH patients. They found that taVNS appears to be safe without inducing bradycardia or QT prolongation. taVNS also increased parasympathetic activity, as assessed by heart rate variability measures. Acute elevation in heart rate might be a biomarker to identify SAH patients who are likely to respond favorably to taVNS treatment. The latter is very important in light of the need for acute biomarkers of response to neuromodulation treatments.

      Comments:

      (1) Frequency domain heart rate variability measures should be analyzed and reported. Given the short duration of the ECG recording, the frequency domain may more accurately reflect autonomic tone.

      We sincerely appreciate this encouraging summary of our paper. We have analyzed and reported frequency-domain heart rate variability measures, including the relative power of the high-frequency band (0.15–0.4 Hz) and the relative power of the low-frequency band (0.04 – 0.15). We showed the distribution of the two frequency-domain HRV measures in supplementary Figure 2C-D. For 24-hour ECG recording, we found that the change in the relative high-frequency power from Day 1 was not significantly different between the treatment groups. As both high-frequency band and low-frequency band power are relative to the total power, the comparison of the relative power of the low-frequency band between groups would be the opposite of the relative power of the high-frequency band. As both time-domain and frequency-domain HRV measures can reflect the autonomic tone, we performed factor analysis to identify the parasympathetic activity component (Figure 2D). Comparing the change in parasympathetic activity component and relative high-frequency power, we observed similarities and discrepancies. Specifically, both the change in parasympathetic activity component and the change in relative high-frequency power were higher in the taVNS group at the early phase (Day 2 - 4).

      We also observed higher high-frequency power in the Sham group at the later phase. If the factor analysis successfully isolates the parasympathetic activity, there should be other factors than the parasympathetic activity affecting the relative power of the high-frequency band. One such factor is the respiration rate. The high-frequency range is between 0.15 to 0.4 Hz, corresponding to respiration's frequency range of approximately 9 to 24 breaths per minute. If the respiration rate increases and exceeds 24 breaths per minute, the respiratory-driven HRV might occur at a frequency higher than the typical high-frequency band. Given that the respiration rate was higher in the taVNS treatment group, a compensatory mechanism to ensure oxygen delivery (Figure 4E), we hypothesized that observed lower high-frequency power in the taVNS treatment group compared to sham at later phases is a result of increased respiration rate in the taVNS treatment group. Indeed, we found the normalized high-frequency power is higher when RR is less than 25 bpm compared to when RR > 25 bpm (Cohen’s d = 0.85, Supplementary Figure 3A). Moreover, an increase in RR in the taVNS treatment group is associated with a decrease in high-frequency power (Supplementary Figure 3B). These control analyses underscored the necessity of performing factor analysis to robustly measure parasympathetic activities and confirm that taVNS treatment mitigated the sympathetic overactivation during the early phase.

      We have now discussed the results of frequency-domain HRV measures in the Discussion section: taVNS and autonomic system (p23): “A key metric that reflects this restored sympathovagal balance is the increase in heart rate variability (Figure 3F). Specifically, the factor analysis showed that the parasympathetic activity was significantly higher in the taVNS treatment group. This difference was most pronounced during the early phase, particularly between Days 2 and 4 following SAH. In addition to analyzing the correlation between the parasympathetic activity factor and established HRV measures that reflect parasympathetic activity such as RMSSD and pNNI_50 (Figure 3C), we also examined changes in a frequency-domain HRV measure—the relative power of the high-frequency band (0.15–0.4 Hz)—to validate the accuracy of the factor analysis. the relative power of the high-frequency band is widely used to indicate respiratory sinus arrhythmia, a process primarily driven by the parasympathetic nervous system (Supplementary Figure 2). We found that both the change in parasympathetic activity factor and relative high-frequency power were higher in the taVNS group at the early phase (Day 2 - 4). Conversely, we observed higher high-frequency power in the Sham group during the later phase. If the factor analysis successfully isolates the parasympathetic activity, there should be other factors than the parasympathetic activity affecting the relative power of the high-frequency band. One such factor is the respiration rate. The high-frequency range is between 0.15 to 0.4 Hz, corresponding to respiration's frequency range of approximately 9 to 24 breaths per minute. If the respiration rate increases and exceeds 24 breaths per minute, the respiratory-driven HRV might occur at a frequency higher than the typical high-frequency band. Given that the respiration rate was higher in the taVNS treatment group, a compensatory mechanism to ensure oxygen delivery (Figure 4E), we hypothesized that observed lower high-frequency power in the taVNS treatment group compared to sham at later phases is a result of increased respiration rate in the taVNS treatment group. Indeed, we found the normalized high-frequency power is higher when RR is less than 25 bpm compared to when RR > 25 bpm (Cohen’s d = 0.85, Supplementary Figure 3A). Moreover, an increase in RR in the taVNS treatment group is associated with a decrease in high-frequency power (Supplementary Figure 3B). These control analyses underscored the necessity of performing factor analysis to robustly measure parasympathetic activities and confirm that taVNS treatment mitigated the sympathetic overactivation during the early phase.”

      We have also reported the changes in the relative power of the high-frequency band between the two treatment groups in Supplementary Figure 6. We did not find a significant change in relative high-frequency band power between the treatment groups (Treatment – pre-treatment difference: p = 0.74, Cohen’s d = -0.08, N(Sham) = 199, N(taVNS) = 188, Mann-Whitney U test). We reported these results in the Results section: Acute effects of taVNS on cardiovascular function (p18): “There were no significant differences in changes in corrected QT interval or heart rate variability, as measured by RMSSD, SDNN, and relative power of high-frequency band between treatment groups (Figure 5D and E and Supplementary Figure 6).”

      How was the "dose" chosen (20 minutes twice daily)?

      The choice of a 20-minute taVNS session twice daily was informed by findings from Addorisio et al. (2019), where the authors administered 5-minute taVNS twice daily to patients with rheumatoid arthritis for two days. They found that the circulating c-reactive protein (CRP) levels significantly reduced after two days of treatment but returned to baseline at the second clinical assessment by day 7. Given the high inflammatory state associated with subarachnoid hemorrhage (SAH) and our intention to maintain a steady reduction in inflammation, we extended the duration of taVNS to 20 minutes per session. We have clarified this stimulation schedule's rationale in the Results section (p5-6): “This treatment schedule was informed by findings from Addorisio et al., where a 5-minute taVNS protocol was administered twice daily to patients with rheumatoid arthritis for two days.29 Their study found that circulating c-reactive protein (CRP) levels significantly reduced after 2 days of treatment but returned to baseline at the second clinical assessment by day 7. Given the high inflammatory state associated with SAH and our intention to maintain a steady reduction in inflammation, we decided to extend the treatment duration to 20 minutes per session.”

      Addorisio, Meghan E., et al. "Investigational treatment of rheumatoid arthritis with a vibrotactile device applied to the external ear." Bioelectronic Medicine 5 (2019): 1-11.

      The use of an acute biomarker of response is very important. A bimodal response to taVNS has been previously shown in patients with atrial fibrillation (Kulkarni et al. JAHA 2021).

      Thank you for this valuable insight and for bringing the study by Kulkarni et al. to our attention. Their study showed that the response to Low-Level Tragus Stimulation (LLTS) varied among patients with atrial fibrillation, which can be predicted by acute P-wave alternans (PWA) to some degree. We have discussed the implication of the bimodal response to taVNS in the Discussion section (p26-27): “Kulkarni et al. showed that the response to low-level tragus stimulation (LLTS) varied among patients with atrial fibrillation.49 Similarly, in our study, not all patients in the taVNS treatment group showed a reduction in mRS scores (improved degree of disability or dependence). This differential response may be inherent to taVNS and potentially influenced by factors such as anatomical variations in the distribution of the vagus nerve at the outer ear. These findings underscore the importance of using acute biomarkers to guide patient selection and optimize stimulation parameters. Furthermore, we found that increased heart rate was a potential acute biomarker for identifying SAH patients who are most likely to respond favorably to taVNS treatment. Translating this finding into clinical practice will require further research to elucidate the mechanisms by which an acute increase in heart rate may predict the outcomes of patients receiving taVNS, including its relationship with neurological evaluations, vasospasm, echocardiography, and inflammatory markers.”

      Reviewer #2 (Public Review):

      Summary:

      This study investigated the effects of transcutaneous auricular vagus nerve stimulation (taVNS) on cardiovascular dynamics in subarachnoid hemorrhage (SAH) patients. The researchers conducted a randomized clinical trial with 24 SAH patients, comparing taVNS treatment to a Sham treatment group (20 minutes per day twice a day during the ICU stay). They monitored electrocardiogram (ECG) readings and vital signs to assess acute as well as middle-term changes in heart rate, heart rate variability, QT interval, and blood pressure between the two groups. The results showed that repetitive taVNS did not significantly alter heart rate, corrected QT interval, blood pressure, or intracranial pressure. However, it increased overall heart rate variability and parasympathetic activity after 5-10 days of treatment compared to the sham treatment. Acute taVNS led to an increase in heart rate, blood pressure, and peripheral perfusion index without affecting corrected QT interval, intracranial pressure, or heart rate variability. The acute post-treatment elevation in heart rate was more pronounced in patients who showed clinical improvement. In conclusion, the study found that taVNS treatment did not cause adverse cardiovascular effects, suggesting it is a safe immunomodulatory treatment for SAH patients. The mild acute increase in heart rate post-treatment could potentially serve as a biomarker for identifying SAH patients who may benefit more from taVNS therapy.

      Strengths:

      The paper is overall well written, and the topic is of great interest. The methods are solid and the presented data are convincing.

      Weaknesses:

      (1) It should be clearly pointed out that the current paper is part of the NAVSaH trial (NCT04557618) and presents one of the secondary outcomes of that study while the declared first outcomes (change in the inflammatory cytokine TNF-α in plasma and cerebrospinal fluid between day 1 and day 13, rate of radiographic vasospasm, and rate of requirement for long-term CSF diversion via a ventricular shunt) are available as a pre-print and currently under review (doi: 10.1101/2024.04.29.24306598.). The authors should better stress this point as well as the potential association of the primary with the secondary outcomes.

      Thank you for this valuable suggestion. The current study indeed focuses on the trial’s secondary outcomes. The main objective is to evaluate the cardiovascular safety of the taVNS protocol and to provide insights that will inform the application of taVNS in SAH patients. Following your comments, we have clarified this in the Introduction section (p6): “The current study is part of the NAVSaH trial (NCT04557618) and focuses on the trial’s secondary outcomes, including heart rate, QT interval, HRV, and blood pressure.32 This interim analysis aims to evaluate the cardiovascular safety of the taVNS protocol and to provide insights that will inform the application of taVNS in SAH patients. The primary outcomes of this trial, including change in the inflammatory cytokine TNF-α and rate of radiographic vasospasm, are available as a pre-print and currently under review.26”

      The negative association between HRV and inflammatory cytokines has been reported in numerous studies such as (Williams et al., Brain, Behavior, and Immunity, 2019; Haensel et al., Psychoneuroendocrinology. 2008). There are some studies suggesting that increased sympathetic tone following SAH is associated with vasospasm (Bjerkne Wenneberg, S. et al., Acta Anaesthesiologica Scandinavica. 2020; Megjhani et al., Neurocrit Care. 2020). Based on the literature, we compared the effects of taVNS on primary and secondary outcomes. The findings from the two parallel analyses are consistent: taVNS treatment reduced pro-inflammatory cytokines and increased HRV. Furthermore, the analyses of the primary outcomes revealed a reduction in the presence of any radiographic vasospasm in the taVNS treatment group compared to the sham. We have now integrated these findings and discussed them in the Discussion section (p25-26): “Given the negative association between pro-inflammatory markers and HRV, our finding that HRV was higher in the taVNS treatment group aligns with the findings of primary outcomes of this clinical trial, which showed that taVNS treatment reduced pro-inflammatory cytokines, including tumor necrosis factor-alpha (TNF-α) and interleukin-6.26,52 The consistency between these findings strengthens the evidence supporting the anti-inflammatory effects of taVNS. In addition, the sympathetic predominance following SAH is implicated in an increased risk of delayed cerebral vasospasm, which is most commonly detected 5-7 days after SAH.12 Given that taVNS treatment mitigated the sympathetic overactivation before the typical onset of cerebral vasospasm, it could potentially reduce the severity of this complication.”

      (2) The references should be implemented particularly concerning other relevant papers (including reviews and meta-analysis) of taVNS safety, particularly from a cardiovascular standpoint, such as doi: 10.1038/s41598-022-25864-1 and doi: 10.3389/fnins.2023.1227858).

      Thank you for providing the relevant papers. We have provided these references in the Introduction section to provide a more comprehensive background of our study (p6): “While some animal studies have reported a potential risk of bradycardia and decreased blood pressure associated with vagus nerve stimulation, two reviews of human studies have considered the cardiovascular effects of taVNS generally safe, with adverse effects reported only in patients with pre-existing heart diseases. 21,22,23

      (3) The dose-response issue that affects both VNS and taVNS applications in different settings should be mentioned (doi: 10.1093/eurheartjsupp/suac036.) as well as the need for more dose-finding preclinical as well as clinical studies in different settings (the best stimulation protocol is likely to be disease-specific).

      Overall, the present work has the important potential to further promote the usage of taVNS even on critically ill patients and might set the basis for future randomized studies in this setting

      Thank you for this valuable insight. Scientific understanding of the dose-response relationship and determining optimal parameters tailored to specific disease contexts has been recognized as an important part of taVNS research and, more generally, in the electrical neuromodulation field. Studies in this direction are often complex and time-intensive due to the multitude of possible parameter combinations. As such, most taVNS studies opted to use parameters that have been established in previous studies. For example, 20 Hz taVNS is extensively used as a therapeutic intervention in stroke (Matyas Jelinek ,2024, https://www.sciencedirect.com/science/article/pii/S0014488623003138). As we pioneer the application of taVNS as an immunomodulation technique in SAH patients, we also adopt parameters reported in similar studies, aiming to provide a basis for future preclinical and clinical studies of taVNS in this patient population. As you noted, the effects of taVNS are dose-dependent, necessitating systematic exploration of the parameter space, including frequency, intensity, and duration. Our findings of the acute biomarker (heart rate) hold the promise of close-loop taVNS. We have now emphasized the importance of investigating how parameters/dose affect taVNS’s effects on immune function and cardiovascular function in SAH patients (p28): “As we pioneer the application of taVNS as an immunomodulation technique in SAH patients, we adopt parameters (20 Hz, 0.4 mA) reported in similar studies.55 The current study provides a basis for future preclinical and clinical studies of taVNS in this patient population. To build on our findings, a systematic evaluation of the relationship between parameters such as frequency, intensity, and duration and taVNS’s effects on the immune system and cardiovascular function is necessary to establish taVNS as an effective therapeutic option for SAH patients.56”

      Reviewer #2 (Recommendations For The Authors):

      The paper is overall well written, and the topic is of great interest. The reviewer has some major comments:

      (1) It should be clearly pointed out that the current paper is part of the NAVSaH trial and presents one of the secondary outcomes of that study while the declared first outcomes (change in the inflammatory cytokine TNF-α in plasma and cerebrospinal fluid between day 1 and day 13, rate of radiographic vasospasm, and rate of the requirement for long-term CSF diversion via a ventricular shunt) are available as a pre-print and currently under review (doi: 10.1101/2024.04.29.24306598.).

      We have revised the manuscript following your comment. Please see comment Reviewer 2 Public Review and our response.

      The authors should assess the relationship between the impact of taVNS on inflammatory markers in plasma and in cerebrospinal fluid and the autonomic responses. The association between inflammatory markers and noninvasive autonomic markers as well as sympathovagal balance should also be assessed. Specifically, the authors should try to assess whether the acute post-treatment elevation in heart rate was more pronounced in patients who experienced a more pronounced reduction in inflammatory biomarkers. Indeed, since all patients in the current study received the same dose of taVNS (20 Hz frequency, 250 μs pulse width, and 0.4 mA intensity), while in several cardiovascular studies (doi: 10.1016/j.jacep.2019.11.008, doi: 10.1007/s10286-023-00997-z) the intensity (amplitude) of taVNS was differentially set based on the subjective pain/sensory threshold, that might be a marker of acute afferent neuronal engagement.

      We agree that analyzing the change in cardiovascular metrics and changes in inflammatory markers is an important next step. In particular, testing whether the acute elevation in heart rate correlates with changes in inflammatory markers could further establish heart rate as a biomarker to guide patient selection and optimize stimulation parameters. (Please refer to comment 1.3 and our responses). However, in this paper, the primary objective is the cardiovascular safety of the current taVNS protocol in SAH patients. This association between inflammatory markers and autonomic responses extends beyond the scope of the current manuscript and would be more appropriately addressed in a separate publication.

      Previous literature has shown a negative association between HRV and inflammatory markers in SAH patients (for example, Adam, J., 2023). It is reasonable to postulate that taVNS modulates the immune system and the autonomic system synergistically. We found that parasympathetic tone was higher in the taVNS treatment group, with the most notable differences observed between Days 2 and 4 following SAH (Figure 3F). In a separate study of the primary outcomes of this trial (Huguenard et al., 2024), serum levels of IL-6 (pro-inflammation cytokine) were also significantly lower in the taVNS treatment group on Day 4 (Figure 3A, in our preprint, https://doi.org/10.1101/2024.04.29.24306598).

      We appreciate your input regarding the potential mechanism behind acute heart rate changes. In this trial, all patients who were able to engage in verbal communication were asked if they felt any prickling or pain during all sessions. We confirmed that the current stimulation setting was sub-perception in all trialed patients, making it unlikely that the observed heart rate increase was due to pain or sensory perception. Our current hypothesis is that successful activation of the afferent vagal pathway by taVNS increased arousal, resulting in increased heart rate. We have revised the Discussion section based on your insight (p29): “All patients who were capable of verbal communication were asked if they felt any prickling or pain during all sessions. We confirmed that the current taVNS protocol is below the perception threshold for all trialed patients. Altogether, successful activation of the afferent vagal pathway by taVNS increased arousal, resulting in increased heart rate.50,51”

      Huguenard, A. L. et al. Auricular Vagus Nerve Stimulation Mitigates Inflammation and Vasospasm in Subarachnoid Hemorrhage: A Randomized Trial. (2024) doi:10.1101/2024.04.29.24306598.

      Adam, J., Rupprecht, S., Künstler, E. C. S. & Hoyer, D. Heart rate variability as a marker and predictor of inflammation, nosocomial infection, and sepsis – A systematic review. Autonomic Neuroscience vol. 249 103116 (2023).

      A new table should be provided with the mean (or median) values of the two arms of the population (taVNS and sham) including baseline clinical characteristics, comorbidities (mean age, % of female, % with known hypertension, diabetes, etc), ongoing medications (% on beta-betablockers, etc), and pre, during and post-treatment absolute values (expressed as mean or median depending on the distribution) of the studied parameters (QT and QTc absolute values, heart rate, SDNN, etc) in order for the reader to have a better understanding of how SAH affects these parameters. Absolute changes in the abovementioned parameters should also be presented in the table. For instance, the reported absolute increase in heart rate, based on Figure 5, panel C, seems very modest, below 2 bpm. This is very important to underlying for several reasons, including the fact that the evaluation of the impact of treatment on heart rate variability as assessed in the time domain might be influenced by concomitant changes in heart rate due to the nonlinearity of neural modulation of sinus node cycle length. Indeed, time-domain indexes of HRV intrinsically increase when heart rate decreases in a nonlinear way, while frequency domain indexes (e.g. the low frequency/high frequency (LF/HF) ratio), appear to be devoid of intrinsic rate-dependency (doi: 10.1016/s0008-6363(01)00240-1).

      Thank you for your suggestion. We have added the new table to the manuscript. In this table, we include clinical characteristics, the median of absolute values of cardiovascular metrics from 24-hour ECG recording, and the median absolute changes in these metrics for both arms. We believe that absolute values of cardiovascular metrics from 24-hour ECG recording are more informative about how SAH affects these parameters than metrics for the pre-, during-, and post-treatment periods.

      In Result (p7), we have added: “Supplementary Table 3 shows the clinical characteristics of the two treatment groups.” In Result, Acute effect of taVNS on cardiovascular function (p20), we have added: “Supplementary Table 3 summarizes the absolute changes in cardiovascular metrics for the treatment groups.”

      Thank you for raising the concern about HRV and providing the reference. We have now reported frequency domain indexes in our results: relative power of high-frequency power, which is negatively correlated with the LF/HF ratio. The high-frequency power is used to capture sinus arrhythmia, reflecting the parasympathetic modulation of the heart. Although the frequency domain metrics might be less susceptible to the rate-dependency (doi: 10.1016/s0008-6363(01)00240-1), there are circumstances when the frequency domain metrics might not accurately reflect the autonomic tone (Please see Reviewer 1 Publice Review and our responses).

      An attempt to correct the effect of taVNS on the evaluated autonomic parameters according to age should be provided, considering that there were no age limits and parasympathetic indexes, particularly at the sinus node level, are known to decrease with age, particularly for those older than 65 years.

      Thank you for the suggestion. We were aware of the influence of age on cardiac heart rate and heart rate variability. In our initial analysis, we compared the change in autonomic parameters from day 1 within each subject across the two treatment groups. This approach controls for individual differences, including those due to age. In addition to your comment, age is a risk factor for subarachnoid hemorrhage. Older individuals often face an increased risk of poor outcomes. To further verify if age influences autonomic changes following SAH, we performed ANCOVA on autonomic function parameters with age included as a covariate. This analysis showed that age was negatively correlated with changes in heart rate, SDNN, and RMSSD from Day 1, but not with changes in QT intervals. After adjusting for age, we found that RMSSD changes and SDNN changes were significantly higher in the taVNS treatment group, while QTc changes were significantly lower in this group. These results align with the main findings (Figures 2 and 3). In addition, autonomic changes following SAH may be influenced by age. Specifically, lower RMSSD and SDNN in older individuals suggest a greater shift toward sympathetic predominance following SAH. We have now reported these results in Supplementary Table 4 and discussed their implication in the Discussion section (p28): “To control for individual differences, including those due to age, our study compared the change in cardiovascular parameters from Day 1 within each subject across treatment groups. To further verify if age influences autonomic changes following SAH, we performed ANCOVA on autonomic function parameters with age included as a covariate. This analysis showed that age was negatively correlated with changes in heart rate, SDNN, and RMSSD from Day 1 but not with changes in QT intervals. After adjusting for age, we found that RMSSD changes and SDNN changes were significantly higher, while QTc changes were significantly lower in the taVNS treatment group (Supplementary Table 4). These results align with the conclusion that repetitive taVNS treatment increased HRV and was unlikely to cause bradycardia or QT prolongation. In addition, autonomic changes following SAH may be influenced by age. Specifically, lower RMSSD and SDNN in older individuals suggest a greater shift toward sympathetic predominance following SAH (Supplementary Table 4).”

      The results of the current study should be discussed considering what was previously demonstrated concerning the cardiovascular effects of taVNS (doi: 10.3389/fnins.2023.1227858).

      We appreciate the suggestion to consider previous findings on the cardiovascular effects of taVNS. However, it is important to note that most studies investigating the cardiovascular effects of taVNS involve healthy individuals, whereas our study focuses on SAH patients who are critically ill. Given the influence of SAH on cardiovascular parameters, we should be cautious when generalizing our findings to the broader population. Previous studies involving stroke populations have reported cardiovascular parameters descriptively as part of their safety assessments (doi: 10.1155/2020/8841752). Our study is currently the only one systematically investigating the cardiovascular safety of taVNS in SAH patients. Furthermore, the review paper (doi: 10.3389/fnins.2023.1227858) includes a highly heterogeneous mix of studies, such as auricular acupressure, auricular acupuncture, and electrical stimulation applied to different parts of the ear. For the subset of studies involving electrical stimulation, there is considerable variation in the parameters used, with frequencies ranging from 0.5 Hz to 100 Hz, currents from 0.1 mA to 45 mA, and durations spanning from 20 minutes to 168 days. These variations make direct comparisons with our findings challenging.

      It looks like QT measurements were performed automatically. It should be specified which method was used for the measurements (threshold, tangent, or superimposed method?).

      In our study, QT intervals were measured based on thresholding after wavelet transforming the ECG signals (Martínez, J. P., IEEE Transactions on Biomedical Engineering, 2004, doi: 10.1109/TBME.2003.821031). The local maxima of the wavelet transform correspond to significant changes in the ECG signal, such as the rapid upward or downward deflections associated with the QRS complex. The algorithm searches modulus maxima, that is, peaks of wavelet transform coefficients that exceed specific thresholds, to identify the QRS complex. R peaks are found as the zeros crossing between the positive-negative modulus maxima pair. After localizing the R peak, the Q onset is detected as the beginning of the first modulus maximum before the modulus maximum pair created by the R wave. To identify the T wave, the algorithm searches for local maxima in the absolute wavelet transform in a search window defined relative to the QRS complex. Thresholding is used to identify the offset of the T wave. Please refer to comments 3.4 and 3.5 and our responses for details. We have clarified the method for measuring QT in the Method section (p35): “This algorithm identifies the QRS complex by searching for modulus maxima, which are peaks in the wavelet transform coefficients that exceed specific thresholds. The onset of the QRS complex is determined as the beginning of the first modulus maximum before the modulus maximum pair created by the R wave. To identify the T wave, the algorithm searches for local maxima in the absolute wavelet transform in a search window defined relative to the QRS complex. Thresholding is used to identify the offset of the T wave.”

      QTc dispersion was not evaluated, and this should be listed as a limitation of the current study.

      We have added this limitation in the Discussion section: Limitations and outlook (p31): “The current study did not explore the effects of taVNS on less commonly used cardiovascular metrics, such as QTc dispersion.”

      It has been recently suggested (doi: 10.1016/j.brs.2018.12.510) that QTc, as a potential indirect marker of HRV, might be used as a biomarker for VNS response in the treatment of resistant depression. The author should try to assess whether in the current study baseline QTc before taVNS is associated with outcome and with taVNS response.

      Thank you for the suggestion. The conference abstract in the provided doi stated that QTc as an indirect marker of HRV before implantation was correlated with changes in the depression rating scale. The mechanism seems to be that QTc has information about the pathophysiology of the depression (10.1097/YCT.0000000000000684). The current study focused on the comparison between taVNS treatment and sham treatment. Our future study will further test if SAH patients’ response to taVNS can be predicted by baseline QTc.

      The dose-response issue that affects both VNS and taVNS in different settings should be mentioned (doi: 10.1093/eurheartjsupp/suac036.) as well as the need for more dose-finding preclinical as well as clinical studies in different settings (the best stimulation protocol is likely to be disease-specific).

      Please refer to our responses to comment 3.

      Minor Comments

      Some typos or commas instead of affirmative points and vice versa.

      Thank you for pointing this out. We have carefully proofread the manuscript and made the necessary corrections to ensure proper punctuation and grammar throughout.

      Table 1: why age is expressed as a range for each person?

      MedRxiv asks authors to remove all identifying information. Precise ages are direct identifiers, as opposed to age ranges. We have now revised the age column to ‘decade of life’ in the updated table. We believe this modification reduces confusion while adhering to MedRxiv’s guidelines.

      Although already reported in the study protocol (doi: 10.1101/2024.03.18.24304239), the heart rate limits for inclusion should be reported (sustained bradycardia on arrival with a heart rate < 50 beats per minute for > 5 minutes, implanted pacemaker or another electrical device).

      We have now added the specific inclusion and exclusion criteria in the Method details section (p33): “Inclusion criteria were: (1) Patients with SAH confirmed by CT scan; (2) Age > 18; (3) Patients or their legally authorized representative are able to give consent. Exclusion criteria were: (1) Age < 18; (2) Use of immunosuppressive medications; (3) Receiving ongoing cancer therapy; (4) Implanted electrical device; (5) Sustained bradycardia on admission with a heart rate < 50 beats per minute for > 5 minutes; (6) Considered moribund/at risk of imminent death.”

      Why did the authors choose a taVNS schedule of two times per day of 30 minutes each as compared for instance to one hour per day? Please comment on that also referring to other taVNS studies in the acute setting such as the one by Dasari T et al (doi: 10.1007/s10286-023-00997-z.) where taVNS was applied for 4 hours twice daily. For instance, Yum Kim et al (doi: 10.1038/s41598-022-25864-1) recently reported in a systematic review and meta-analysis of taVNS, safety, that repeated sessions and sessions lasting 60 min or more were shown to be more likely to lead to adverse events.

      The International Consensus-Based Review and Recommendations for Minimum Reporting Standards in Research on Transcutaneous Vagus Nerve Stimulation should be referred to and contextualized (doi: 10.3389/fnhum.2020.568051).

      Thank you for raising this question and providing relevant references. We have reviewed the proposed checklist for minimum reporting items in taVNS research (10.3389/fnhum.2020.568051) and have ensured that our manuscript complies with the recommended reporting items.

      The current taVNS schedule was based on findings from Addorisio et al. (2019). We have revised the manuscript to clarify the rationale behind the current taVNS protocol. Please refer to our response to comment 1.2. The two studies mentioned in the comments were published after our trial was designed and initiated (https://clinicaltrials.gov/study/NCT04557618). Based on the meta-analysis by Yum Kim et al., the short duration of treatment sessions might explain the cardiovascular safety of the current taVNS protocol. We are also currently assessing the effects of our taVNS protocol on inflammatory markers.

      Reviewer #3 (Public Review):

      Summary:

      The authors aimed to characterize the cardiovascular effects of acute and repetitive taVNS as an index of safety. The authors concluded that taVNS treatment did not induce adverse cardiovascular effects, such as bradycardia or QT prolongation.

      Strengths:

      This study has the potential to contribute important information about the clinical utility of taVNS as a safe immunomodulatory treatment approach for SAH patients.

      Weaknesses:

      A number of limitations were identified:

      (1) A primary hypothesis should be clearly stated. Even though the authors state the design is a randomized clinical trial, several aspects of the study appear to be exploratory. The method of randomization was not stated. I am assuming it is a forced randomization given the small sample size and approximately equal numbers in each arm.

      Thank you for the suggestion. The current study is part of the NAVSaH trial (NCT04557618), aiming to define the effects of taVNS on inflammatory markers, vasospasm, hydrocephalus, and continuous physiology data. This study focuses on the effects of repetitive and acute taVNS on continuous physiology data to evaluate the cardiovascular safety of the current taVNS protocol. The primary hypothesis tested in our study is that repetitive taVNS increased HRV but did not cause bradycardia and QT prolongation. Following your comments, we have clarified this in the Introduction section (p6): “This interim analysis aims to evaluate the cardiovascular safety of the taVNS protocol and to provide insights that will inform the application of taVNS in SAH patients. The primary outcomes of this trial, including change in the inflammatory cytokine TNF-α and rate of radiographic vasospasm, are available as a pre-print and currently under review.26 Based on a meta-analysis, repeated sessions lasting 60 min or more are likely to lead to aversive effects; therefore, we hypothesized that repetitive taVNS increased HRV but did not cause bradycardia and QT prolongation.23”

      (2) The authors "first investigated whether taVNS treatment induced bradycardia or QT prolongation, both potential adverse effects of vagus nerve stimulation. This analysis showed no significant differences in heart rate calculated from 24-hour ECG recording between groups." A justification should be provided for why a difference is expected from 20 minutes of taVNS over a period of 24 hours. Acute ECG changes are a concern for increasing arrhythmic risk, for example, due to cardiac electrical restitution properties.

      A human study (Clancy, L. A. et al., Brain Stimulation, 2017, https://doi.org/10.1016/j.brs.2014.07.031) has found that 15-min taVNS led to reduced sympathetic activity measured by low-frequency/high-frequency (LF/HF) ratio. The sympathetic activity remained lower than baseline levels during the recovery period, suggesting potential long-term effects of taVNS on cardiovascular function. In addition, the repetitive taVNS treatment in this clinical trial was intended to maintain a steady low-inflammatory state. Given the potential life-threatening implications of bradycardia and QT prolongation in these critically ill patients, we deemed it crucial to evaluate heart rate and QT interval both acutely and from 24-hour ECG monitoring. We have now provided the justification in the Result section (p11): “A study has shown that 15 minutes of taVNS reduced sympathetic activity in healthy individuals, with effects that persist during the recovery period.33 This finding suggests that taVNS may exert long-term effects on cardiovascular function. Therefore, we investigated whether repetitive taVNS treatment affects heart rate and QT interval, key indicators of bradycardia or QT prolongation, using 24-hour ECG recording.”

      An additional value of analyzing 24-hour ECG recording is that we can detect bradycardia or QT prolongation that happen outside the period of the stimulation, which could caused by repetitive taVNS. To this end, we reanalyzed the data and calculated the percentage of prolonged QT intervals using 500ms criterion (Giudicessi, J. R., Noseworthy, P. A. & Ackerman, M. J. The QT Interval. Circulation, 2019). When comparing the percentage of prolonged QT intervals between the treatment groups, we found that changes in prolonged QT intervals percentage from Day 1 were higher in the Sham group (Figure 3F, Mann–Whitney U test, N(taVNS) = 94, N(Sham)=95, p-value < 0.001, Cohen’s d = -0.72). We have now reported the results in the Result section (p11): “To ensure that repetitive taVNS did not lead to QT prolongation happening outside the period of stimulation, we calculated the percentage of prolonged QT intervals. Prolonged QT intervals were defined as corrected QT interval >= 500 ms. We found that changes in prolonged QT intervals percentage from Day 1 were higher in the Sham group (Figure 3F, Mann–Whitney U test, N(taVNS) = 94, N(Sham)=95, p-value < 0.001, Cohen’s d = -0.72).

      The concern regarding acute ECG changes related to increased arrhythmic risk is valid. We have improved the reasoning behind analyzing acute ECG change, which now reads (p20): “Assessing the acute effect of taVNS on cardiovascular is crucial for its safe translation into clinical practice. We compared the acute change of heart rate, corrected QT interval, and heart rate variability between treatment groups, as abrupt changes in the pacing cycle may increase the risk of arrhythmias.”

      (3) More rigorous evaluation is necessary to support the conclusion that taVNS did not change heart rate, HRV, QTc, etc. For example, shifts in peak frequencies of the high-frequency vs. low-frequency power may be effective at distinguishing the effects of taVNS. Further, compensatory sympathetic responses due to taVNS should be explored by quantifying the changes in the trajectory of these metrics during and following taVNS.

      We appreciate your concerns regarding the potential effects on the autonomic system associated with taVNS treatment. We would like to clarify that the primary objective of our study was to evaluate the cardiovascular safety of the taVNS protocol in SAH, with a specific focus on detecting any acute changes in heart rate and QT interval. As you highlighted, such acute ECG changes are a concern for increasing arrhythmic risk. By directly studying the trend of heart rate, HRV, and QT over the acute treatment periods, we found no significant change in these metrics between treatment groups. In addition, these metrics remained within 0.5 standard deviations of their daily fluctuations during and following taVNS treatment (Figure 5 and Supplementary Figure 6). These findings support the conclusion that the current protocol is unlikely to cause cardiac complications.

      In response to your suggestion to conduct a more rigorous analysis, particularly concerning peak frequencies within the high-frequency (HF) and low-frequency (LF) bands, we pursued this analysis to explore more nuanced effects of taVNS on the autonomic system. We compared the shifts in peak frequencies within these bands between the treatment groups and found no significant changes that would suggest a sympathetic or parasympathetic shift following acute taVNS.

      In detail, we have made the following revisions following your comments:

      (1) We have clarified the motivation behind studying the acute change of cardiac metrics following taVNS treatment – monitoring the cardiovascular safety of current taVNS protocol in SAH patients (p18): please refer to response to comment 3.2.

      (2) We compared the peak frequencies of the high-frequency and low-frequency bands following taVNS. added the results in the supplementary materials:

      We note that neurophysiology underlying peak frequencies has not been thoroughly studied in the literature compared to the LF-band power or HF-band power. Therefore, we report this result as an exploratory analysis.

      (3) We have added the changes of QTc during and following taVNS in Figure 5 and showed that they were within 0.5 standard deviations of their daily fluctuations during and following taVNS treatment. We have now shown the changes of HRV during and following taVNS in Supplementary Figure 6 A-D. We added the change of high-frequency power following Reviewer #1’s comment 1.1. Overall, our results suggest that repetitive taVNS increased parasympathetic activities, while there is no evidence that acute taVNS significantly affected heart rate or QT.

      (4) The authors do not state how the QT was corrected and at what range of heart rates. Because all forms of corrections are approximations, the actual QT data should be reported along with the corrected QT.

      The corrected QT interval (QTc) estimates the QT interval at a standard heart rate of 60 bpm. In practice, we removed RR intervals outside of the 300 – 2000 ms range. Further, we removed ectopic beats, defined as RR intervals differing by more than 20% from the one proceeding. We used the Bazett formula to correct the QT intervals: . We have now clarified how QT was corrected in the Method section – Data processing (p35-36): “R-peaks were detected as local maxima in the QRS complexes. P-waves, T-waves, and QRS waves were delineated based on the wavelet transform (Figure 2A-C).34  RR intervals were preprocessed to exclude outliers, defined as RR intervals greater than 2 s or less than 300 ms. RR intervals with > 20% relative difference to the previous interval were considered ectopic beats and excluded from analyses. After preprocessing, RR intervals were used to calculate heart rate, heart rate variability, and corrected QT (QTc) based on Bazett's formula: .44 The corrected QT interval (QTc) estimates the QT interval at a standard heart rate of 60 bpm.”

      We have reported the actual QT data in the Result section (p10 and p 19):” Moreover, changes in corrected QT interval from Day 1 were significantly higher in the Sham group compared to the taVNS group (Figure 3B, Mann–Whitney U test, N(taVNS) = 94, N(Sham)=95, p-value < 0.001, Cohen’s d = -0.57). Similarly, uncorrected QT intervals from Day 1 were higher in the Sham group (Supplementary Figure 10A, Cohen’s d = -0.42).”

      “Supplementary Figure 10B-C shows the acute changes in uncorrected QT interval.”

      (5) The QT extraction method needs to be more robust. For example, in Figure 2C, the baseline voltage of the ECG is shifting while the threshold appears to be fixed. If indeed the threshold is not dynamic and does not account for baseline fluctuations (e.g., due to impedance changes from respiration), then the measures of the QT intervals were likely inaccurate.

      A robust method to estimate the QT interval is essential in our study. To this end, we used the state-of-the-art method to calculate QT intervals. We first applied a 0.5 Hz fifth-order high-pass Butterworth filter and a 60 Hz powerline filter on the ECG recording. The high-pass filtering is used to correct potential baseline fluctuations. Subsequently, a wavelet-based algorithm was used to delineate the QRS complex and T wave (Martínez, J. P., IEEE Transactions on Biomedical Engineering, 2004). In short, this algorithm identifies QRS based on modulus maxima of the wavelet transform of ECG signals. After localizing the R peak, the Q onset is detected as the beginning of the first modulus maximum before the modulus maximum pair created by the R wave. The detection is performed on wavelet transform at a small scale rather than on the original signal, minimizing the effect of baseline shift (see III Detection methods, (5), Cuiwei Li et al., IEEE TBME, 1995, Detection of ECG Characteristic Points Using Wavelet Transforms). T wave is detected similarly based on wavelet transform. Please refer to our response to comment 2.9.

      Martínez, J. P., Almeida, R., Olmos, S., Rocha, A. P., & Laguna, P. (2004). A wavelet-based ECG delineator: evaluation on standard databases. IEEE Transactions on Biomedical Engineering, 51(4), 570-581.

      In Figure 2C, the purple and green lines take the value of 1 at the QRS onset or the T wave offset; otherwise, 0, which might appear to be a threshold. We have now used verticle lines to denote the detected QRS onsets and T wave offsets. Please see below for a comparison of the annotation:

      We have clarified the details of extracting QT intervals from ECG recordings in the Method section (p31): “To calculate cardiac metrics, we first applied a 0.5 Hz fifth-order high-pass Butterworth filter and a 60 Hz powerline filter on ECG data to reduce artifacts. 35 We detected QRS complexes based on the steepness of the absolute gradient of the ECG signal using the Neurokit2 software package.35 R-peaks were detected as local maxima in the QRS complexes. P waves, T waves, and QRS complexes were delineated based on the wavelet transform of the ECG signals proposed by Martinez J. P. et al. (Figure 2A-C).36 This algorithm identifies the QRS complex by searching for modulus maxima, which are peaks in the wavelet transform coefficients that exceed specific thresholds. The onset of the QRS complex is determined as the beginning of the first modulus maximum before the modulus maximum pair created by the R wave. To identify the T wave, the algorithm searches for local maxima in the absolute wavelet transform in a search window defined relative to the QRS complex. Thresholding is used to identify the offset of the T wave.”

      We have modified Figure 2C for better clarity:

      More statistical rigor is needed. For example, in Figure 2D, the change in heart rate for days 5-7, 8-10, and 11-13 is clearly a bimodal distribution and as such, should not be analyzed as a single distribution. Similarly, Figure 2E also shows a bimodal distribution. Without the QT data, it is unclear whether this is due to the application of the heart rate correction method.

      Thank you for raising this concern. Several factors could contribute to the observed distribution of changes in heart rate for days 5-7, 8-10, and 11-13, as shown in Figure 2D. One such factor is the smaller sample size in the later days. The mean duration of hospitalization for the 24 subjects included in this study was 11.29 days, with a standard deviation of 6.43, respectively. Other factors include variations in medical history, SAH pathology, and clinical outcomes during hospitalization. Further analysis revealed that heart rate was lower in patients with improved mRS scores (Supplementary Figure 4B), suggesting that clinical outcomes might impact changes in heart rate. Understanding the association between cardiovascular metrics and clinical assessments, such as vasospasm and inflammation, could help decide whether future taVNS trials should control for these factors when evaluating the effects of taVNS on cardiovascular function. We are currently continuing to recruit SAH patients in this clinical trial, and we plan to perform such analyses in future studies.

      In the manuscript, we reported the effect size between the treatment groups for days 5-7, 8-10, and 11-13. This should be interpreted in conjunction with the characteristics of the distribution. To provide a rigorous interpretation of our results, we have now discussed these considerations in the discussion section (p28): “We noticed a high variance of change in heart rate for days 5 – 7, 8 – 10, and 11 – 13 for both treatment groups (Figure 2D). This may be due to the small sample size in the later days, given that the mean duration of hospitalization for the 24 subjects included in this study was 11.3 days with a standard deviation of 6.4. Differences in medical history and clinical outcomes during hospitalization may also explain the variance of change in heart rate for the later days. For example. heart rate was lower in patients with improved mRS scores (Supplementary Figure 4B). Understanding the association between cardiovascular metrics and clinical assessments, such as vasospasm and inflammation, could help decide whether future taVNS trials should control for these factors when evaluating the effects of taVNS on cardiovascular function.”

      To test our hypothesis that repetitive taVNS does not induce significant heart rate change, we performed a two-tailed equivalence test of heart rate change between the two treatment groups, including data from days 2-13 (Figure 2D, left panel). To verify the validity of this approach, we calculated the Bimodality Coefficient (BC) and performed the Dip Test for unimodality for the distribution of heart rate change for the two treatment groups. The Bimodality Coefficient (BC) is a measure that combines skewness and kurtosis to assess whether a distribution is bimodal or unimodal. A BC value greater than 0.555 typically indicates a bimodal distribution, whereas a BC value less than or equal to 0.555 suggests an unimodal distribution. The Dip Test is a statistical test that assesses the unimodality of a distribution. A non-significant p-value (p-value ≥ 0.05) indicates that the distribution is likely unimodal. This analysis suggests that the distributions of heart rate changes in both treatment groups (days 2 - 13) are unimodal (BC = 0.457 and p = 0.374 for the taVNS treatment group; BC = 0.421 and p = 0.656 for the sham treatment group). This finding provides justification for our statistical approaches.

      Figure 3A shows a number of outliers. A SDNN range of 200 msec should raise concern for a non-sinus rhythm such as arrhythmia or artifact, instead of sinus arrhythmia. Moreover, Figure 3B shows that the Sham RMSSD data distribution is substantially skewed by the presence of at least 3 outliers, resulting in lower RMSSD values compared to taVNS. What types of artifact or arrhythmia discrimination did the authors employ to ensure the reported analysis is on sinus rhythm? The overall results seem to be driven by outliers.

      Mild cardiac abnormalities are common in SAH patients. Therefore, change in cardiovascular metrics was expected to differ from healthy individuals, which makes studying the cardiovascular effect on taVNS extremely important in this context. Following your comment, we investigated whether the large SDNN change was due to arrhythmia or artifacts. Except for a single instance where one subject exhibited an SDNN change of 200 ms on a particular day, all other SDNN changes were less than 150 msec. We identified the subject and day associated with the largest SDNN change, which is Day 7. As shown in Author response image 1A and B, SDNN of this subject increased on day 7 while the heart rate (HR) of this subject decreased. Changes in HRV were inversely related to HR changes, suggesting shifts in sympathetic and parasympathetic tone. We checked the ECG recording and the extracted NN intervals (processed RR intervals) on that day. The NN intervals are more variate on day 7 compared to day 1 (Author response image 1C and D). To determine whether the significant variance observed between 5:01 am and 5:02 am was due to arrhythmia or artifacts, we closely examined the corresponding ECG signals (Author response image 1E and F). Based on our analysis, the elevated SDNN is unlikely to be attributed to artifacts.

      Author response image 1.

      Similarly, we identified the subjects and days corresponding to the most prominent RMSSD decrease in the sham treatment group. We verified the ECG quality for this subject and the accuracy of RR interval identification, and that there was no significant cardiovascular event during the subject’s stay in the ICU. Based on the inclusion and exclusion criteria defined in our protocol (Huguenard A et al.m PLOS ONE, 2024), we did not exclude these data from our analysis.

      Huguenard A, Tan G, Johnson G, Adamek M, Coxon A, et al. (2024) Non-invasive Auricular Vagus nerve stimulation for Subarachnoid Hemorrhage (NAVSaH): Protocol for a prospective, triple-blinded, randomized controlled trial. PLOS ONE 19(8): e0301154. https://doi.org/10.1371/journal.pone.0301154

      To ensure accurate inferences about sympathetic and parasympathetic tone from these cardiovascular metrics, we have rigorously refined our methodologies, including correcting RR intervals outliers, correcting ectopic peaks, using state-of-art algorithms to identify QRS complex, P wave, and T wave (please refer to response to comment 3.5), and performing factor analysis. In addition, no significant cardiac complications have been reported by the attending physicians for the subjects included in this study. Nonetheless, it is important to note that ECG patterns in patients with SAH differ from those in healthy individuals, potentially impacting the accuracy of R peak identification. For example, one identified R peak (out of 73) was Q peak (F in the above figure). The pathology associated with SAH complicates the precise calculation of cardiovascular metrics and the interpretation of the results. We are committed to continually improving our methodologies for assessing autonomic function in SAH patients. We have now discussed these limitations in the Discussion section (p31-32): “Mild cardiac abnormalities are common in SAH patients5, complicating the precise calculation of cardiovascular metrics from ECG signals and the interpretation of the results. Systematic verification of methods for calculating cardiovascular metrics to ensure their applicability in SAH patients is crucial.”

      The above concern will also affect the power analysis, which was reported by authors to have been performed based on the t-test assuming the medium effect size, but the details of sample size calculations were not reported, e.g., X% power, t-test assumed Bonferroni correction in the power analysis, etc.

      Thank you for raising this concern. The current study is part of the NAVSaH trial (NCT04557618), focusing on the trial’s secondary outcomes (Please refer to comment 2.1 and our responses). The main objective of this interim analysis is to evaluate the cardiovascular safety of the current taVNS protocol. Goal enrollment for the pilot NAVSaH trial is 50 patients, based on power calculations to detect significant differences in inflammatory cytokines, radiographic vasospasm, and chronic hydrocephalus. The detailed power analysis is described in the protocol (Huguenard A et al.m PLOS ONE, 2024):

      “Under a 2-by-2 repeated measures design consisting of two groups of patients, each measured at two time points, our goal is to compare the change across time in the taVNS group to the change across time in the Sham group. Based upon previous work from Koopman et al. [67], we assume our study will observe 1.1 standardized inflammatory cytokines mean change difference between the two groups. Using a two-sided, two-sample t-test, assuming both time points have equal variance and there is a weak correlation (i.e., 0.15) between measurement pairs, a sample size of 25 in each group achieves at least 80% power to detect a standardized difference of 1.1 in mean changes, with a significance level (alpha) of 0.05 [68].

      Based on our preliminary data, we assume this study will observe 25% and 55% severe vasospasm in the taVNS and Sham groups, respectively. Under a design with 2 repeated measurements (i.e., 2 raters), assuming a compound symmetry covariance structure with a Rho of 0.2, at a significance level (alpha) of 0.05, a sample size of 25 in each group achieves at least 80% power when the null proportion is 0.55, and the alternative proportion is 0.25 [69–71].

      As previously described, LV et al. [8] studied the relationship between cytokine levels and clinical endpoints in SAH, including hydrocephalus. From their outcomes, we predict a needed enrollment of approximately 50 to detect these endpoints. From our own preliminary data, with an incidence of chronic hydrocephalus 0% in treated patients and 28.6% in control (despite grade of hemorrhage), alpha = 0.05 and power = 0.80, the projected sample size to capture that change is approximately 44 patients.”

      In this study, we used power analysis to report the achieved power of insignificant findings. For example, a Mann-Whitney U test on heart rate change between the treatment groups revealed no significant differences. We then used power analysis to calculate the achieved power. We have added the details of power analysis in the Method section (p34): “We calculated the achieved power of tests on heart rate change between the treatment groups assuming a medium effect size (Cohen’s d of 0.5) and a Type I error probability (a) of 0.05. Given that the Mann-Whitney U test is a non-parametric counterpart to the t-test and that the asymptotic relative efficiency of the U test relative to the t-test is 0.95 with normal distributions, we estimated the achieved power based on the power of a two-sample t-test, which is 0.93. We have clarified this in the introduction section and in the method section (p6 and p38):

      “The current study is part of the NAVSaH trial (NCT04557618) and focuses on the trial’s secondary outcomes, including heart rate, QT interval, HRV, and blood pressure.30 This interim analysis aims to evaluate the cardiovascular safety of the taVNS protocol and to provide insights that will inform the application of taVNS in SAH patients. The primary outcomes of this trial, including change in the inflammatory cytokine TNF-α and rate of radiographic vasospasm, are available as a pre-print and currently under review.24”

      “In this study, we reported the statistical power achieved for tests that yielded non-significant results. The achieved power is calculated based on a two-sample t-test assuming a medium effect size (Cohen’s d of 0.5) and a Type I error probability (a) of 0.05.”

      If the study was designed to show a cardiovascular effect, I am surprised that N=10 per group was considered to be sufficiently powered given the extensive reports in the literature on how HRV measures (except when pathologically low) vary within individuals. Moreover, HRV measures are especially susceptible to noise, artifacts, and outliers.

      If the study was designed to show a lack of cardiovascular effect (as the conclusions and introduction seem to suggest), then a several-fold larger sample size is warranted.

      The primary goal of this study is to assess the cardiovascular safety of the current taVNS protocol in SAH patients (please refer to comments 2.1 and 3.8 and our responses). More specifically, we want to assess whether the current taVNS protocol is associated with bradycardia or QT prolongation. The data in this study included ECG signals and vital signals from 24 subjects recruited between 2021 and 2024. The total number of days in the ICU is 271 days, which corresponds to 542 taVNS/sham treatment sessions. These data allow us to detect significant cardiovascular effects of acute taVNS with high power. For example, the comparison of heart rate from pre- to post-treatment sessions between treatment groups had power > 99% (N1 = 188, N2 = 199, assuming 0.05 type I error probability, medium effect size two sample t-test).

      To safely conclude that there is no significant cardiovascular effect of repetitive taVNS on any given day following SAH, we would need to perform statistical tests between treatment groups on Day 1, Day 2, and Day N. In this context, 64 subjects per treatment group are required to achieve 80% power assuming medium effect size and 0.05 type I error probability (two-sample t-test). We have acknowledged this limitation in the Discussion section. Thank you for raising this concern!

      The results reported in this study treat each day as an independent sample for several reasons. First, heart rate and HRV metrics exhibited great daily variations (Figure in comment 3.7, for example). Their value on one day was not predictive of the metrics on another day, which could be due to medications, interventions, or individualized SAH recovery process during the patient’s stay in the ICU. Second, SAH patients in the ICU often experience rapid/daily changes in clinical status, including fluctuations in intracranial pressure, blood pressure, neurological status, and other vital signs. Also, the recovery process from SAH is highly individualized, with different patients exhibiting distinct trajectories of recovery or complications. Day-to-day cardiovascular function changes varied as the patient recovered or encountered setbacks. Moreover, we verified ECG signal quality, corrected outliers and artifacts in ECG processing, and employed a state-of-the-art QRS delineation method (Please refer to comment 3.5). All these ensure the accuracy of our reported results.

      The revised Discussion section now reads (31): ” Our study considers each day as an independent sample for the following considerations: 1. heart rate and HRV metrics exhibited great daily variations. Their value on one day was not predictive of the metrics on another day, which could be due to medications, interventions, or individualized SAH recovery process during the patient’s stay in the ICU. 2. SAH patients in the ICU often experience daily changes in clinical status, including fluctuations in intracranial pressure, blood pressure, neurological status, and other vital signs. 3. Day-to-day cardiovascular function changes varied as the patient recovered or encountered setbacks. To conclusively establish that there is no significant cardiovascular effect of repetitive taVNS on any given day following SAH, we would need to perform statistical tests between treatment groups for each day. In this context, 64 subjects per treatment group are required to achieve 80% power assuming medium effect size and 0.05 type I error probability (two-sample t-test).”

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      (1) It is a nice study but lacks some functional data required to determine how useful these alleles will be in practice, especially in comparison with the figure line that stimulated their creation.

      We are grateful for this comment. For the usefulness of these alleles, figure 3 shows that specific and efficient genetic manipulation of one cell subpopulation can be achieved by mating across the DreER mouse strain to the rox-Cre mouse strain. In addition, figure 6 shows that R26-loxCre-tdT can effectively ensure Cre-loxP recombination on some gene alleles and for genetic manipulation. The expression of the tdT protein is aligned with the expression of the Cre protein (Alb roxCre-tdT and R26-loxCre-tdT, figure 2 and figure 5), which ensures the accuracy of the tracing experiments. We believe more functional data can be shown in future articles that use mice lines mentioned in this manuscript.

      (2) The data in Figure 5 show strong activity at the Confetti locus, but the design of the newly reported R26-loxCre line lacks a WPRE sequence that was included in the iSure-Cre line to drive very robust protein expression.

      Thank you for bringing up this point in the manuscript. In the R26-loxCre-tdT mice knock-in strategy, the WPRE sequence is added behind the loxCre-P2A-tdT sequence, as shown in Supplementary Figure 9.

      (3) The most valuable experiment for such a new tool would be a head-to-head comparison with iSure (or the latest iSure version from the Benedito lab) using the same CreER and target foxed allele. At the very least a comparison of Cre protein expression between the two lines using identical CreER activators is needed.

      Thank you for your valuable and insightful comment. The comparison results of R26-loxCre-tdT with iSuRe-Cre using Alb-CreER and targeting R26-Confetti can be found in Supplementary Figure 7 C-E, according to the reviewer’s suggestion.

      (4) Why did the authors not use the same driver to compare mCre 1, 4, 7, and 10? The study in Figure 2 uses Alb-roxCre for 1 and 7 and Cdh5-roxCre for 4 and 10, with clearly different levels of activity driven by the two alleles in vivo. Thus whether mCre1 is really better than mCre4 or 10 is not clear.

      Thank you for raising this concern. After screening out four robust versions of mCre, we generated these four roxCre knock-in mice. It is unpredictable for us which is the most robust mCre in vivo. It might be one or two mCre versions that work efficiently. For example, if Alb-mCre1 was competitive with Cdh5-mCre10, we can use them for targeting genes in different cell types, broadening the potential utility of these mice.

      (5) Technical details are lacking. The authors provide little specific information regarding the precise way that the new alleles were generated, i.e. exactly what nucleotide sites were used and what the sequence of the introduced transgenes is. Such valuable information must be gleaned from schematic diagrams that are insufficient to fully explain the approach.

      We appreciate your thoughtful suggestions. The schematic figures, along with the nucleotide sequences for the generation of mice, can be found in the revised Supplementary Figure 9.

      Reviewer #2 (Public Review):

      (1) The scenario where the lines would demonstrate their full potential compared to existing models has not been tested.

      Thank you for your thoughtful and constructive comment. The comparative analysis of R26-loxCre-tdT with iSuRe-Cre, employing Alb-CreER to target R26-Confetti, is provided in Supplementary Figure 7 C-E.

      (2) The challenge lies in performing such experiments, as low doses of tamoxifen needed for inducing mosaic gene deletion may not be sufficient to efficiently recombine multiple alleles in individual cells while at the same time accurately reporting gene deletion. Therefore, a demonstration of the efficient deletion of multiple floxed alleles in a mosaic fashion would be a valuable addition.

      Thank you for your constructive comments. Mosaic analysis using sparse labeling and efficient gene deletion would be our future direction using roxCre and loxCre strategies.

      (3) When combined with the confetti line, the reporter cassette will continue flipping, potentially leading to misleading lineage tracing results.

      Thank you for your professional comments. Indeed, the confetti used in this study can continue flipping, which would lead to potentially misleading lineage tracing results. Our use of R26-Confetti is to demonstrate the robustness of mCre for recombination. Some multiple-color mice lines that don’t flip have been published, for example, R26-Confetti2(10.1038/s41588-019-0346-6) and Rainbow (10.1161/CIRCULATIONAHA.120.045750). These reporters could be used for tracing Cre-expressing cells, without concerns of flipping of reporter cassettes.

      (4) Constitutive expression of Cre is also associated with toxicity, as discussed by the authors in the introduction.

      Thank you for your professional comments. The toxicity of constitutive expression of Cre and the toxicity associated with tamoxifen treatment in CreER mice line (10.1038/s44161-022-00125-6) are known to the field. This study can’t solve the toxicity of the constitutive expression of Cre in this work. Many mouse lines with constitutive Cre driven by different promoters are present across various fields, representing similar toxicity. To solve this issue, it would be possible to construct a new strategy that enables the removal of Cre after its expression.

      Reviewer #3 (Public Review):

      (1) Although leakiness is rather minor according to the original publication and the senior author of the study wrote in a review a few years ago that there is no leakiness(https://doi.org/10.1016/j.jbc.2021.100509).

      Thank you so much for your careful check. In this review (https://doi.org/10.1016/j.jbc.2021.100509), the writer’s comments on iSuRe-Cre are on the reader's side, and all summary words are based on the original published paper (10.1038/s41467-019-10239-4). Currently, we have tested iSuRe-Cre in our hands. We did detect some leakiness in the heart and muscle, but hardly in other tissues as shown in Author response image 1.

      Author response image 1.

      Leakiness in Alb CreER;iSuRe-Cre mouse line Pictures are representative results for 5 mice. Scale bars, white 100 µm.

      (2) I would have preferred to see a study, which uses the wonderful new tools to address a major biological question, rather than a primarily technical report, which describes the ongoing efforts to further improve Cre and Dre recombinase-mediated recombination.

      We gratefully appreciate your valuable comment. The roxCre and loxCre mice mentioned in this study provide more effective methods for inducible genetic manipulation in studying gene function. We hope that the application of our new genetic tools could help address some major biological questions in different biomedical fields in the future.

      (3) Very high levels of Cre expression may cause toxic effects as previously reported for the hearts of Myh6-Cre mice. Thus, it seems sensible to test for unspecific toxic effects, which may be done by bulk RNA-seq analysis, cell viability, and cell proliferation assays. It should also be analyzed whether the combination of R26-roxCre-tdT with the Tnni3-Dre allele causes cardiac dysfunction, although such dysfunctions should be apparent from potential changes in gene expression.

      We are sorry that we mistakenly spelled R26-loxCre-tdT into R26-roxCre-tdT in our manuscript. We have not generated the R26-roxCre-tdT mouse line. We also thank the reviewer for concerns about the toxicity of high Cre expression. The toxicity of constitutive expression of Cre and the toxicity of tamoxifen treatment of CreER mice line (10.1038/s44161-022-00125-6) are known to the field. This study can’t solve the toxicity of the constitutive expression of Cre in this work. Many mouse lines with constitutive Cre driven by different promoters are present across various fields, representing similar toxicity. To solve this issue, it would be possible to construct a new strategy that enables the removal of Cre after its expression.

      (4) Is there any leakiness when the inducible DreER allele is introduced but no tamoxifen treatment is applied? This should be documented. The same also applies to loxCre mice.

      In this study, we come up with new mice tool lines, including Alb roxCre1-tdT, Cdh5 roxCre4-tdT, Alb roxCre7-GFP, Cdh5 roxCre10-GFP and R26-loxCre-tdT. As the data shown in supplementary figure 1, supplementary figure 2, and figure 4D, Alb roxCre1-tdT, Cdh5 roxCre4-tdT, Alb roxCre7-GFP, Cdh5 roxCre10-GFP and R26-loxCre-tdT are not leaky. Therefore, if there is any leakiness driven by the inducible DreER or CreER allele, the leakiness is derived from the DreER or CreER. Additional pertinent experimental data can be referenced in Figure S4C, Figure S7A-B, and Figure S8A.

      (5) It would be very helpful to include a dose-response curve for determining the minimum dosage required in Alb-CreER; R26-loxCre-tdT; Ctnnb1flox/flox mice for efficient recombination.

      Thank you for your suggestion. We value your feedback and have incorporated your suggestion to strengthen our study. Relevant experimental data can be referenced in Figure S8E-G.

      (6) In the liver panel of Figure 4F, tdT signals do not seem to colocalize with the VE-cad signals, which is odd. Is there any compelling explanation?

      The staining in Figure 4F in the revision is intended to deliver optimized and high-resolution images.

      (7) The authors claim that "virtually all tdT+ endothelial cells simultaneously expressed YFP/mCFP" (right panel of Figure 5D). Well, it seems that the abundance of tdT is much lower compared to YFP/mCFP. If the recombination of R26-Confetti was mainly triggered by R26-loxCre-tdT, the expression of tdT and YFP/mCFP should be comparable. This should be clarified.

      Thank you so much for your careful check. We checked these signals carefully and didn't find the “much lower” tdT signal. As the file-loading website has a file size limitation, the compressed image results in some signal unclear. We attached clear high-resolution images here. Author response image 2 shows how we split the tdT signal and compared it with YFP/mCFP.

      Author response image 2.

      (8) In several cases, the authors seem to have mixed up "R26-roxCre-tdT" with "R26-loxCre-tdT". There are errors in #251 and #256.Furthermore, in the passage from line #278 to #301. In the lines #297 and #300 it should probably read "Alb-CreER; R26-loxCretdT;Ctnnb1flox/flox"" rather than "Alb-CreER;R26-tdT2;Ctnnb1flox/flox".

      We are grateful for these careful observations. We have corrected these typos accordingly.

      Recommendations for the authors:

      Reviewer #1:

      (1) However, for it to be useful to investigators a more direct comparison with the Benedito iSure line (or the latest version) is required as that is the crux of the study.

      Thank you for emphasizing this point, which we have now addressed in the revised manuscript and in Figure S7D-G.

      (2) I would like to know how the authors will make these new lines available to outside investigators.

      Please contact the lead author by email to consult about the availability of new mouse lines developed in this study.

      (3) The discussion is overly long and fails to address potential weaknesses. Much of it reiterates what was already said in the results section.

      We are thankful for your critical evaluation, which has helped us improve our discussion.

      Reviewer #2:

      (1) Assessing the efficiency and accuracy of the lines in mosaic deletions of multiple alleles and reporting them in single cells after low-dose tamoxifen exposure would be highly beneficial to demonstrate the full potential of the models.

      We appreciate your careful consideration of this issue. Our future endeavors will focus on mosaic analysis utilizing sparse labeling and efficient gene deletion, employing both roxCre and loxCre strategies.

      (2) Performing FACS analysis to confirm that all targeted (Cre reporter-positive) cells are also tdT-positive would provide more precise data and avoid vague statements like 'virtually all' or 'almost complete' in the results section:

      Line 166: Although mCre efficiently labeled virtually all targeted cells (Figure S3A-E)…

      Line 293: ... and not a single tdT+ hepatocyte 293 expressed Cyp2e1 (Figure 6D)... However, the authors do not provide any quantification. FACS would be ideal here.

      Line 244: ...expression of beta-catenin and GS almost disappeared in the 4W mutant sample... The resolution in the provided PDF is not adequate for assessment.

      Line 296: ... revealed almost complete deletion of Ctnnb1 in the Alb-CreER;R26-tdT2;Ctnnb1flox/flox mice...

      Thank you for suggesting these improvements, which have strengthened the robustness of our conclusions. In the revised version, we have incorporated FACS results that correspond to related sections. Additionally, a quantification statement has been included in the statistical analysis section. We appreciate your meticulous review and comments, which have significantly improved the clarity of our manuscript.

      (3) In the beginning of the results section, it is not clear which results are from this study and which are known background information (like Figure 1A). For example, it is not clear if Figure 1C presents data from R26-iSuRe-Cre. Please revise the text to more clearly present the experimental details and new findings.

      Thank you for this observation. Figure 1C belongs to this study, and the revised version has been modified to the related statement for improved clarity.

      (4) Experimental details regarding the genetic constructs and genotyping of the new knock-in lines are missing. Are R26 constructs driven by the endogenous R26 promoter or were additional enhancers used?

      Thank you for emphasizing this point. The schematic figures and nucleotide sequences for the generation of mice can be found in the revised Supplementary Figure 9, which can help to address this issue.

      (5) The method used to quantify mCre activity in terms of reporter+ target cells is not specified. From images or by FACS?

      Additionally, if images were used for quantification, it would be important to provide details on the number of images analyzed, the number of cells counted per image, and how individual cells were identified.

      Thank you for your comment. We have included the quantification statement in the statistical analysis section. Analyzing R26-Confetti+ target cells using FACS is challenging due to the limitations of the sorting instrument. Consequently, we quantified the related data by images. Each dot on the chart represents one sample, and the quantification for each mouse was conducted by averaging the data from five 10x fields taken from different sections.

      (6) Line 160: These data demonstrate that roxCre was functionally efficient yet non-leaky. Functional efficiency in vivo was not shown in the preceding experiments.

      Functional efficiency in vivo can be referred to in Figures S1-S2 and S4C.

      (7) It would be useful to provide a reference for easy vs low-efficiency recombination of different reporter alleles (lines 56-58).

      We are grateful for this comment, as it has allowed us to improve the clarity of our explanation. Consequently, we have made the necessary modifications.

      (8) Discussion on the potential drawbacks and limitations of the lines would be useful.

      We are thankful for your evaluation, which has significantly contributed to the enhancement of our discourse.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      TMC7 knockout mice were generated by the authors and the phenotype was analyzed. They found that Tmc7 is localized to Golgi and is needed for acrosome biogenesis.

      Strengths:

      The phenotype of infertility is clear, and the results of TMC7 localization and the failed acrosome formation are highly reliable. In this respect, they made a significant discovery regarding spermatogenesis.

      Weaknesses:

      There are also some concerns, which are mainly related to the molecular function of TMC7 and Figure 5.

      (1) It is understandable that TMC7 exhibits some channel activity in the Golgi and somehow affects luminal pH or Ca2+, leading to the failure of acrosome formation. On the other hand, since they are conducting the pH and calcium imaging from the cytoplasm, I do not think that the effect of TMC7 channel function in Golgi is detectable with their methods.

      We agree with the reviewer that there are no direct evidences showing the effect of TMC7 channel function in Golgi. We have changed the description in the revised manuscript.

      (2) Rather, it is more likely that they are detecting apoptotic cells that have no longer normal ion homeostasis.

      We thank the reviewer for raising this concern. We apologize for not labeling the postnatal stage in original Figure 5. We measured intracellular Ca2+, pH and ROS in PD30 testes (revised Fig. S6a-c), no apoptotic cells were observed at this stage (revised Fig. S6e, f). Apoptotic cells were found in the seminiferous tubules and cauda epididymis of 9-week-old Tmc7–/– mice (revised Fig. 5e-f). We have included TUNEL data in testis of PD21, PD30 and 9-week-old mice (revised Fig. 5e, f and Fig. S6e, f). In accordance with our findings, Tmc1 mutation has also been shown to result in reduced Ca2+ permeability, thus triggering hair cell apoptosis (Fettiplace, R, PNAS. 2022) [1].

      (3) Another concern is that n is only 3 for these imaging experiments.

      As suggested by the reviewer, more replicates were included in imaging experiments.

      Reviewer #2 (Public Review):

      Summary:

      This study presents a significant finding that enhances our understanding of spermatogenesis. TMC7 belongs to a family of transmembrane channel-like proteins (TMC1-8), primarily known for their role in the ear. Mutations to TMC1/2 are linked to deafness in humans and mice and were originally characterized as auditory mechanosensitive ion channels. However, the function of the other TMC family members remains poorly characterized. In this study, the authors begin to elucidate the function of TMC7 in acrosome biogenesis during spermatogenesis. Through analysis of transcriptomics datasets, they identify TMC7 as a transmembrane channel-like protein with elevated transcript levels in round spermatids in both mouse and human testis. They then generate Tmc7-/- mice and find that male mice exhibit smaller testes and complete infertility. Examination of different developmental stages reveals spermatogenesis defects, including reduced sperm count, elongated spermatids, and large vacuoles. Additionally, abnormal acrosome morphology is observed beginning at the early-stage Golgi phase, indicating TMC7's involvement in proacrosomal vesicle trafficking and fusion. They observed localization of TMC7 in the cis-Golgi and suggest that its presence is required for maintaining Golgi integrity, with Tmc7-/- leading to reduced intracellular Ca2+, elevated pH, and increased ROS levels, likely resulting in spermatid apoptosis. Overall, the work delineates a new function of TMC7 in spermatogenesis and the authors suggest that its ion channel activity is likely important for Golgi homeostasis. This work is of significant interest to the community and is of high quality.

      Strengths:

      The biggest strength of the paper is the phenotypic characterization of the TMC7-/- mouse model, which has clear acrosome biogenesis/spermatogenesis defects. This is the main claim of the paper and it is supported by the data that are presented.

      Weaknesses:

      The claim is that TMC7 functions as an ion channel. It is reasonable to assume this given what has been previously published on the more well-characterized TMCs (TMC1/2), but the data supporting this is preliminary here, and more needs to be done to solidify this hypothesis. The authors are careful in their interpretation and present this merely as a hypothesis supporting this idea.

      We appreciate the insightful comment. It is indeed a limitation of our study that we lack strong evidences to support that TMC7 functions as an ion channel. We have planned to conduct cellular electrophysiology in GC-1 cells heterologous expression of TMC7. However, TMC7 was trapped in the endoplasmic reticulum like TMC1 and TMC2 (Yu X, PNAS. 2020)[2], and failed to localize to the Golgi. According to the reviewer’s suggestion, we have made careful and more detailed interpretation the molecular function of TMC7 in the revised manuscript.

      Reviewer #3 (Public Review):

      Summary:

      In this study, Wang et al. have demonstrated that TMC7, a testis-enriched multipass transmembrane protein, is essential for male reproduction in mice. Tmc7 KO male mice are sterile due to reduced sperm count and abnormal sperm morphology. TMC7 co-localizes with GM130, a cis-Golgi marker, in round spermatids. The absence of TMC7 results in reduced levels of Golgi proteins, elevated abundance of ER stress markers, as well as changes of Ca2+ and pH levels in the KO testis. However, further confirmation is required because the analyses were performed with whole testis samples in spite of the differences in the germ cell composition in WT and KO testis. In addition, the causal relationships between the reported anomalies await thorough interrogation.

      Strengths:

      The microscopic images are of great quality, all figures are properly arranged, and the entire manuscript is very easy to follow.

      Weaknesses:

      (1) Tmc7 KO male mice show multiple anomalies in sperm production and morphogenesis, such as reduced sperm count, abnormal sperm head, and deformed midpiece. Thus, it is confusing that the authors focused solely on impaired acrosome biogenesis.

      We are grateful to your comments and suggestions. We agree and have added these defects in spermiogenesis of Tmc7–/– mice in the abstract and discussion sections of revised manuscript.

      (2) Further investigations are warranted to determine whether the abnormalities reported in this manuscript (e.g., changes in protein, Ca2+, and pH levels) are directly associated with the molecular function of TMC7 or are the byproducts of partially arrested spermiogenesis. Please find additional comments in "Recommendations for the authors".

      Thank you for raising this concern. Per your comments, we have included data of intracellular Ca2+, pH and ROS in PD21 testes. The intracellular homeostasis was impaired as early as PD21, indicating TMC7 depletion impairs cellular homeostasis which in turn results in arrested spermiogenesis.

      Recommendations for the authors:

      Reviewing Editor (Recommendations For The Authors):

      As noted by all three reviewers, current flow cytometry data does not necessarily support the 'ion channel' hypothesis, thus the phenotypic analysis is compelling but the molecular mechanism of how TMC7 facilitates acrosome biogenesis remains incomplete. It is highly recommended for the authors to at least discuss or test alternative hypotheses (as reviewer #2 suggested) such as the possibility of acting as 'lipid scramblase'. Also, the authors need to provide further explanation for other morphological defects if TMC7 is truly a functional ion channel in Golgi (and thus later at acrosome), which is also related to the key question of whether TMC7 is a functional ion channel.

      We thank the reviewing editor for the comments and suggestions. We agree that our study lack strong evidences to support that TMC7 functions as an ion channel. We have discussed the possibility of TMC7 acting as 'lipid scramblase' as suggested. We have also included data of intracellular Ca2+, pH and ROS in PD21, PD30 testes.

      Indeed, Tmc7–/– mice exhibits other defects including abnormal head morphology and disorganized mitochondrial sheaths. As TMC7 is localized to the cis-Golgi apparatus and is required for maintaining Golgi integrity. Previous studies on Golgi localized proteins including GOPC (Yao R, PNAS. 2002)[2], HRB (Kang-Decker N. Science. 2001)[3] and PICK1(Xiao N, JCI. 2009)[4] exhibit similar defects in spermiogenesis with Tmc7–/– mice. It is possible that defects morphologies in Tmc7–/– mice might be due to impaired function of Golgi.

      Reviewer #1 (Recommendations For The Authors):

      (1) The authors should provide more details about the imaging experiments using FACS. Since they only describe catalog numbers (Beyotime, S1056, S1006, S0033S) for imaging reagents, it is not immediately clear what reagents they actually used. Since they used Fluo3, BCECF, and DCFH, it would be better to mention their names.

      Thanks. We have provided more detailed antibody information as suggested.

      (2) I am also concerned that in the FACS there is no information at all about laser wavelength and filter properties. This is especially important for BCECF because the wavelength spectrum changes with pH. Also, if there are any positive controls for these imaging reagents, such as ionophores, it would be more convincing to include them.

      Thank you for your comment. Excitation wavelength is 488nm for detecting Ca2+, pH and ROS in FACS. BCECF is the most popular pH probe to monitor cellular pH and the reagent from Beyotime (S1006) has been used by other studies (Chen S, Blood. 2016)[5], (Liu H, Cell Death Dis. 2022)[6]. To make the results more reliable, we have repeated these experiments in PD21 testes (revised Figure 5a-c). No positive controls for these reagents were used in our experiments.

      (3) As noted above, it is better to avoid directly linking the cell's abnormal ion homeostasis to TMC7 ion channel function in the text. The discussion should be changed to emphasize that the TMC7-deficient cells are apoptotic and that these physiological phenomena are occurring as a side effect of this apoptosis.

      Thank you for raising this concern. We agree with the reviewer that there are no direct evidences showing the effect of TMC7 channel function in Golgi and we have changed the description in the revised manuscript.

      We performed new experiment to measure apoptosis and intracellular Ca2+, pH and ROS in PD21 testes. No apoptotic cells were observed at this stage. However, impaired cellular homeostasis was still found in testis of PD21 Tmc7-/- mice. These data suggest that TMC7 depletion impairs cellular homeostasis and hence induces spermatid apoptosis.

      (4) While I understand that it appears to be difficult to experimentally verify the ion channel function of TMC7, it may be supportive to compare its amino acid sequence and/or 3D predicted structure with that of TMC1/2. Including a supplemental figure for this purpose would emphasize the possibility that TMC7 functions as an ion channel.

      We thank the reviewer for making this great suggestion. We compared the amino acid sequence and structure of TMC1, TMC2 with TMC7 respectively. TMC1 had 81% sequence similarity with TMC7 and the RMSD (Root Mean Square Deviation) was 3.079. TMC2 had 82% sequence similarity with TMC7, the RMSD was 2.176. These data suggest that TMC7 has similar amino acid sequence and predicted structure with TMC1/2 and might functions as an ion channel. We have included the predicted structures in revised Fig. S7.

      Author response image 1.

      Reviewer #2 (Recommendations For The Authors):

      I do not have any experimental comments or concerns to address, but I do ask that the authors consider an alternative hypothesis. Based on prior data demonstrating that TMC1 is a mechanosensitive ion channel, the authors reasonably assume that TMC7 may also function as an ion channel. Although the authors observe alterations in cytosolic Ca2+ and pH upon loss of TMC7 by flow cytometry, which begins to support this hypothesis, these data do not directly demonstrate ion channel activity.

      I was wondering if the authors had considered whether TMC7 could also function as a lipid scramblase. TMC1 has also been proposed to function as a Ca2+-inhibited scramblase, where knockout of TMC1 leads to a loss of phosphatidylserine (PS) exposure and membrane blebbing at the apical region of hair cells (Ballesteros, A. and Swartz, K., Science Advances, 2022). Furthermore, TMC proteins are structurally related to the Anoctamin/TMEM16 family of chloride channels and lipid scramblases, where TMEM16A-B are bona fide Ca2+-activated chloride channels, and TMEM16C-H are characterized as Ca2+-dependent scramblases. Based on their structural similarity and the observation that TMC1 may also exhibit lipid scrambling properties based on the PS exposure, I wonder if the authors may have data that support a TMC7 scramblase hypothesis. I was intrigued by this idea, especially given the authors' observations of large vacuoles in the seminiferous tubules and cauda epididymis and the vesicle accumulation phenotype in their TEM data. Incorporating this hypothesis into the discussion section, at minimum, could provide a valuable perspective, and this line of thought may lead to interesting data interpretation throughout the paper.

      We thank the reviewer for the valuable suggestion. We have discussed the possibility of TMC7 acting as 'lipid scramblase' as suggested.

      Reviewer #3 (Recommendations For The Authors):

      (1) Gene symbols should be italicized, and protein symbols should be capitalized.

      Thanks. We have made changes to the manuscript as recommended.

      (2) Tmc7 KO males show reduced sperm count, which alters the germ cell composition in the testis (Figure 2g). Thus, it is inappropriate to compare protein levels using whole testis lysates (Figure 3e, 4h, 5d, 5f). Instead, the same immunoblotting analyses could be done with purified round spermatids or 3-wk-old testis. Likewise, the significance of the intracellular Ca2+ and pH measurements is potentially diminished by the differences in the germ cell composition in WT and KO mice.

      We appreciate this constructive suggestion. We agree with the reviewer that whole testis lysates diminished the differences between WT and _Tmc7-/-_mice. However, we are unable purify round spermatids due to the lack of specific markers.

      (3) Figures 2i, 2j: How sperm motility was measured should be specified in the Methods.

      We thank you for your significant reminding and have added sperm motility assessment in Methods section.

      (4) Figure 4g: It does not make sense to compare the fluorescence intensity of these proteins without making sure that the seminiferous tubules are in the same stage. As shown in Figures S5a and S5b, TMC7 exhibits varied abundance in spermatids at different steps.

      We thank the reviewer for the insightful comment. We have replaced images in the same stage seminiferous tubules and compared the fluorescence intensity of new images as suggested.

      (5) Figure 4h: How were the band intensities measured? The third band from the left is visually stronger than the first one, but it does not seem to be so according to the column graph. The reviewer measured the intensity of GRASP65 bands relative to alpha-tubulin by ImageJ and obtained relative intensities of 0.35, 0.87, 0.6, and 0.08 for the bands from left to right. Additional replicates of the western blots should be included in the supplementary figures.

      Thank you for this insightful comment. The density and size of the blots were quantified by Image J. We have checked the first band from the left of GRASP65 and it seems that the protein was not fully transferred onto the PVDF membrane. We have performed new experiments and replaced the original bands (Revised Fig. 4h). Additional replicates of the western blots have been included in revised Fig. S8.

      (6) Figures 5a, 5b: Based on the observation of abnormal intracellular Ca2+ and pH levels in the KO germ cells, the authors concluded that TMC7 maintains the homeostasis of Golgi pH and ion (Lines 223-224, 263-264). However, intracellular Ca2+ and pH levels do not directly reflect those in the Golgi apparatus.

      We thank the reviewer for this important comment. We agree and have changed “Golgi” to “intracellular” as suggested.

      (7) Figure 5c: ROS is produced during apoptosis. Thus, it is not appropriate to conclude that the increased ROS levels in Tmc7 KO germ cells lead to apoptosis.

      According to the reviewer’s comment, we measured ROS and apoptosis in testis of PD21 and PD30 mice. ROS levels were increased, but no apoptotic cells were observed in testis of PD21 and PD30 Tmc7–/– mice. Apoptotic cells were observed in testis of 9-week-old Tmc7–/– mice (Revised Fig. 5e-f). These data suggest that TMC7 depletion results in the accumulation of ROS, thereby leads to apoptosis.

      (1) Fettiplace, R., D.N. Furness, and M. Beurg, The conductance and organization of the TMC1-containing mechanotransducer channel complex in auditory hair cells. Proc Natl Acad Sci U S A, 2022. 119(41): p. e2210849119.

      (2) Yu, X., et al., Deafness mutation D572N of TMC1 destabilizes TMC1 expression by disrupting LHFPL5 binding. Proc Natl Acad Sci U S A, 2020. 117(47): p. 29894-29903.

      (3) Kang-Decker, N., et al., Lack of acrosome formation in Hrb-deficient mice. Science, 2001. 294(5546): p. 1531-3.

      (4) Xiao, N., et al., PICK1 deficiency causes male infertility in mice by disrupting acrosome formation. J Clin Invest, 2009. 119(4): p. 802-12.

      (5) Chen, S., et al., Sympathetic stimulation facilitates thrombopoiesis by promoting megakaryocyte adhesion, migration, and proplatelet formation. Blood, 2016. 127(8): p. 1024-35.

      (6) Liu, H., et al., PRMT5 critically mediates TMAO-induced inflammatory response in vascular smooth muscle cells. Cell Death Dis, 2022. 13(4): p. 299.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This manuscript presents useful findings on several phage from deep sea isolates of Lentisphaerae strains WC36 and zth2 that further our understanding of deep sea microbial life. The manuscript's primary claim is that phage isolates augment polysaccharide use in Pseudomonas bacteria via auxiliary metabolic genes (AMGs). However, the strength of the evidence is incomplete and does not support the primary claims. Namely, there are not data presented to rule out phage contamination in the polysaccharide stock solution, AMGs are potentially misidentified, and there is missing evidence of successful infection.

      Thanks for the Editor’s and Reviewers’ positive and constructive comments, which help us improve the quality of our manuscript entitled “Deep-sea bacteriophages facilitate host utilization of polysaccharides” (paper#eLife-RP-RA-2023-92345). The comments are valuable, and we have studied the comments carefully and have made corresponding revisions according to the suggestions. We removed some uncertain results and strengthened other parts of the manuscript, which evidently improved the accuracy and impact of the revised version. Revised portions are marked in blue in the modified manuscript. Please find the detailed responses as following.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary: This manuscript describes the identification and isolation of several phage from deep sea isolates of Lentisphaerae strains WC36 and zth2. The authors observe induction of several putative chronic phages with the introduction of additional polysaccharides to the media. The authors suggest that two of the recovered phage genomes encode AMGs associated with polysaccharide use. The authors also suggest that adding the purified phage to cultures of Pseudomonas stutzeri 273 increased the growth of this bacterium due to augmented polysaccharide use genes from the phage. While the findings were of interest and relevance to the field, it is my opinion that several of the analysis fall short of supporting the key assertions presented.

      Thanks for your comments. We removed some uncertain results and strengthened other parts of the manuscript, which evidently improved the accuracy and impact of the revised version. Please find the detailed responses as following.

      Strengths: Interesting isolate of deep sea Lentisphaerae strains which will undoubtedly further our understanding of deep sea microbial life.

      Thanks for your positive comments.  

      Weaknesses:

      (1) Many of the findings are consistent with a phage contamination in the polysaccharide stock solution. 

      Thanks for your comments. We are very sure that the phages are specifically derived from the Lentisphaerae strain WC36 but not the polysaccharide stock solution. The reasons are as following: (1) the polysaccharide stock solution was strictly sterilized to remove any phage contamination; (2) we have performed multiple TEM checks of the rich medium supplemented with 10 g/L laminarin alone (Supplementary Fig. 1A) or in 10 g/L starch alone (Supplementary Fig. 1B), and there were not any phage-like structures, which confirmed that the polysaccharides (laminarin/starch) we used were not contaminated with any phage-like structures; in addition, we also observed the polysaccharides (laminarin/starch) directly by TEM and did not find any phage-like structures (Supplementary Fig. 2); (3) the polysaccharide (starch) alone could not promote the growth of Pseudomonas stutzeri 273, however, the supplement of starch together with the extracted Phages-WC36 could effectively facilitate the growth of Pseudomonas stutzeri 273 (Author response image 1). The above results clearly indicated the phages were derived from the Lentisphaerae strain WC36 but not the polysaccharide stock solution. 

      Author response image 1.

      Growth curve and status of Pseudomonas stutzeri 273 cultivated in basal medium, basal medium supplemented with 20 μl/mL Phages-WC36, basal medium supplemented with 5 g/L starch, basal medium supplemented with 5 g/L starch and 20 μl/mL Phages-WC36. 

       

      (2) The genes presented as AMGs are largely well known and studied phage genes which play a role in infection cycles.

      Thanks for your comments. Indeed, these AMGs may be only common in virulent phages, while have never been reported in chronic phages. In virulent phages, these genes typically act as lysozymes, facilitating the release of virions from the host cell upon lysis, or injection of viral DNA upon infection. However, the chronic phages do not lyse the host. Therefore, the persistence of these genes in chronic phages may be due to their ability to assist the host in metabolizing polysaccharides. Finally, according to your suggestions, we have weakened the role of AMGs and added “potential” in front of it. The detailed information is shown below.

      (3) The evidence that the isolated phage can infect Pseudomonas stutzeri 273 is lacking, putting into question the dependent results.

      Thanks for your comments. Actually, we selected many marine strains (Pseudomonadota, Planctomycetes, Verrucomicrobia, Fusobacteria, and Tenericutes isolates) to investigate whether Phages-WC36 could assist them in degradation and utilization of polysaccharides, and found that Phages-WC36 could only promote the growth of strain 273. It is reported that filamentous phages could recognize and bind to the host pili, which causes the pili to shrink and brings the filamentous phages closer to and possibly through the outer membrane of host cells. The possible mechanism of other chronic phages release without breaking the host might be that it was enclosed in lipid membrane and released from the host cells by a nonlytic manner. Thus, these chronic phages may have a wider host range. However, we were unable to further reveal the infection mechanism due to some techniques absence. Therefore, according to your suggestions, we have deleted this section in the revised manuscript.

      Reviewer #1 (Recommendations For The Authors):

      I have previously reviewed this manuscript as a submission to another journal in 2022. My recommendations here mirror those of my prior suggestions, now with further added details.

      Thanks for your great efforts for reviewing our manuscript and valuable suggestions for last and this versions.

      Specific comments:

      Comment 1: Line 32. Rephrase to "polysaccharides cause the induction of multiple temperate phages infecting two strains of Lentisphaerae (WC36 and zth2) from the deep sea."

      Thanks for your positive suggestion. We have modified this description as “Here, we found for the first time that polysaccharides induced the production of multiple temperate phages infecting two deep-sea Lentisphaerae strains (WC36 and zth2).” in the revised manuscript (Lines 31-33). 

      Comment 2: Line 66. "Chronic" infections are not "lysogenic" as described here, suggesting the former is a subcategory of the latter. If you are going to introduce lifecycles you need a brief sentence distinguishing "chronic" from "lysogenic"

      Thanks for your positive suggestion. We added this sentence as “Currently, more and more attention has been paid to chronic life cycles where bacterial growth continues despite phage reproduction (Hoffmann Berling and Maze, 1964), which was different from the lysogenic life cycle that could possibly lyse the host under some specific conditions.” in the revised manuscript (Lines 66-69).

      Comment 3: Line 72. Please avoid generalized statements like "a hand-full" (or "plenty" line 85). Try to be at least somewhat quantitative regarding how many chronic phages are known. This is a fairly common strategy among archaeal viruses. 

      Thanks for your suggestion. Given that some filamentous phages also have a chronic life cycle that is not explicitly reported, we cannot accurately estimate their numbers. According to your suggestions, we have modified these descriptions as “however, to our best knowledge, only few phages have been described for prokaryotes in the pure isolates up to date (Roux et al., 2019; Alarcón-Schumacher et al., 2022; Liu et al., 2022).” in the revised manuscript (Lines 73-75). In addition, the number of chronic phages in the biosphere cannot be accurately estimated, according to the latest report (Chevallereau et al., 2022), which showed that “a large fraction of phages in the biosphere are produced through chronic life cycles”. Therefore, we have modified this description as “Therefore, a large percentage of phages in nature are proposed to replicate through chronic life cycles” in the revised manuscript (Lines 87-88). 

      Comment 4: Line 93. While Breitbart 2012 is a good paper to cite here, there have been several, much more advanced analysis of the oceans virome. https://doi.org/10.1016/j.cell.2019.03.040 is one example, but there are several others. A deeper literature review is required in this section.  

      Thanks for your valuable suggestions. We have added some literatures and modified this description as “A majority of these viruses are bacteriophages, which exist widely in oceans and affect the life activities of microbes (Breitbart, 2012; Roux et al., 2016; Gregory et al., 2019; Dominguez-Huerta et al., 2022).” in the revised manuscript (Lines 94-97). 

      References related to this response:

      Roux, S., Brum, J.R., Dutilh, B.E., Sunagawa, S., Duhaime, M.B., Loy, A., Poulos, B.T., Solonenko, N., Lara, E., Poulain, J., et al. (2016) Ecogenomics and potential biogeochemical impacts of globally abundant ocean viruses. Nature 537:689-693. 

      Gregory, A.C., Zayed, A.A., Conceição-Neto, N., Temperton, B., Bolduc, B., Alberti, A., Ardyna, M., Arkhipova, K., Carmichael, M., Cruaud, C., et al. (2019) Marine DNA Viral Macro- and Microdiversity from Pole to Pole. Cell 177:1109-1123.e1114. 

      Dominguez-Huerta, G., Zayed, A.A., Wainaina, J.M., Guo, J., Tian, F., Pratama, A.A., Bolduc, B., Mohssen, M., Zablocki, O., Pelletier, E., et al. (2022) Diversity and ecological footprint of Global Ocean RNA viruses. Science 376:1202-1208.

      Comment 5: Line 137. I see the phage upregulation in Figure 1, however in the text and figure it would be good to also elaborate on what the background expression generally looks like. Perhaps a transcriptomic read normalization and recruitment to the genome with a display of the coverage map, highlighting the prophage would be helpful. Are the polysacharides directly influencing phage induction or is there some potential for another cascading effect?  

      Thanks for your comments. We have elaborated all expressions of phage-associated genes under different conditions in the Supplementary Table 1, which showed that the background expressions were very low. The numbers in Fig. 1C were the gene expressions (by taking log2 values) of strain WC36 cultured in rich medium supplemented with 10 g/L laminarin compared with the rich medium alone.

      In addition, our RT-qPCR results (Fig. 1D) also confirmed that these genes encoding phage-associated proteins were significantly upregulated when 10 g/L laminarin was added in the rich medium. According to your suggestions, we have modified this description as “In addition to the up-regulation of genes related to glycan transport and degradation, when 10 g/L laminarin was added in the rich medium, the most upregulated genes were phage-associated (e. g. phage integrase, phage portal protein) (Fig. 1C and Supplementary Table 1), which were expressed at the background level in the rich medium alone.” in the revised manuscript (Lines 136-140). Based on the present results, we speculate that polysaccharides might directly induce phage production, which needs to be verified by a large number of experiments in the future.

      Comment 6: Line 179. We need some assurance that phage was not introduced by your laminarin or starch supplement. Perhaps a check on the TEM/sequencing check of supplement itself would be helpful? This may be what is meant on Line 188 "without culturing bacterial cells" however this is not clearly worded if that is the case. Additional note, further reading reinforces this as a key concern. Many of the subsequent results are consistent with a contaminated starch stock. 

      Thanks for your comments. We are very sure that the phages are specifically derived from the Lentisphaerae strain WC36 but not the polysaccharide stock solution. The reasons are as following: (1) we have performed multiple TEM checks of the rich medium supplemented with 10 g/L laminarin alone (Supplementary Fig. 1A) or in 10 g/L starch alone (Supplementary Fig. 1B), and there were not any phage-like structures, which confirmed that the polysaccharides (laminarin/starch) we used are not contaminated with any phage-like structures. In addition, we also observed the polysaccharides (laminarin/starch) directly by TEM and did not find any phage-like structures (Supplementary Fig. 2). According to your suggestions, we have modified this description as “We also tested and confirmed that there were not any phage-like structures in rich medium supplemented with 10 g/L laminarin alone (Supplementary Fig. 1A) or in 10 g/L starch alone (Supplementary Fig. 1B), ruling out the possibility of phage contamination from the polysaccharides (laminarin/ starch).” in the revised manuscript (Lines 158-162) and “Meanwhile, we also checked the polysaccharides (laminarin/ starch) in rich medium directly by TEM and did not find any phage-like structures (Supplementary Fig. 2).” in the revised manuscript (Lines 178-180). (2) the polysaccharide stock solution was strictly sterilized to remove any phage contamination. (3) the polysaccharide (starch) alone could not promote the growth of Pseudomonas stutzeri 273, however, the supplement of starch together with the extracted Phages-WC36 could effectively facilitate the growth of Pseudomonas stutzeri 273 (Response Figure 1). The above results clearly indicated the phage was derived from the Lentisphaerae strain WC36 but not the polysaccharide stock solution. 

      In addition, given that polysaccharide was a kind of critical energy source for most microorganisms, we sought to ask whether polysaccharide also induces the production of bacteriophages in other deep-sea bacteria. To this end, we cultured deep-sea representatives from other four other phyla (including Chloroflexi, Tenericutes, Proteobacteria, and Actinobacteria) in the medium supplemented with laminarin/starch, and checked the supernatant of cells suspension through TEM as described above. We could not find any phage-like structures in these cells suspension (Author reaponse image 2), which also confirmed that there was no phage contamination in the polysaccharides.

      Author response image 2.

      Growth curve and status of Pseudomonas stutzeri 273 cultivated in basal medium, basal medium supplemented with 20 μl/mL Phages-WC36, basal medium supplemented with 5 g/L starch, basal medium supplemented with 5 g/L starch and 20 μl/mL Phages-WC36.   

      Author response image 3.

      TEM observation of the supernatant of cells suspension of a Chloroflexi strain, a Tenericutes strain, a Proteobacteria strain and an Actinobacteria strain that cultivated in the rich medium supplemented with 10 g/L laminarin and 10 g/L starch. No phage-like particles could be observed.  

      Comment 7: Line 223. Correct generalized wording "long time". 

      Thanks for your comments. We have changed “after for a long time” to “after 30 days” in the revised manuscript (Line 197).

      Comment 8: Line 229. Please more explicitly describe what these numbers are (counts of virion like structures - filamentous and hexagonal respectively?), the units (per µL?), and how these were derived. The word "around" should be replaced with mean and standard deviation values for each count from replicates, without which these are not meaningful.

      Thanks for your comments. The average numbers per microliter (µL) of filamentous and hexagonal phages in each condition were respectively calculated by randomly choosing ten TEM images. According to your suggestions, we have modified this description as “Specifically, the average number per microliter of filamentous phages (9.7, 29 or 65.3) extracted from the supernatant of strain WC36 cultured in rich medium supplemented with 10 g/L laminarin for 5, 10 or 30 days was higher than that cultured in rich medium supplemented with 5 g/L laminarin (4.3, 13.7 or 35.3) (Fig. 3B). The average number per microliter of hexagonal phages (9, 30, 46.7) extracted from the supernatant of strain WC36 cultured in rich medium supplemented with 10 g/L laminarin for 5, 10 or 30 days was higher than that cultured in rich medium supplemented with 5 g/L laminarin (4, 11.3 or 17.7) (Fig. 3C).” in the revised manuscript (Lines 203-210).

      Comment 9: Line 242. This section should be included in the discussion of Figure 2 - around line 194.

      Thanks. According to your suggestion, we have moved this section to the discussion corresponding to Figure 2 (Lines 183-191).

      Comment 10: Figure 3. Stay consistent in the types of figures generated per strain. Figure 3A should be a growth curve.

      Thanks for your comments. Actually, figure 3A was a growth curve, the corresponding description “(A) Growth curve of strain WC36 cultivated in either rich medium alone or rich medium supplemented with 5 g/L or 10 g/L laminarin for 30 days.” was shown in the Figure 3A legend in this manuscript.

      Comment 11: Line 312. Move the discussion of AMGs to after the discussion of the phage genome identification.

      Thanks for your valuable comments. According to your suggestions, we have moved the discussion of AMGs to after the discussion of the phage genome identification.

      Comment 12: Line 312. It would be informative to sequence in-bulk each of your treatments as opposed to just sequencing the viral isolates (starch and no host included) to see what viruses can be identified in each. ABySS is also not a common assembler for viral analysis. Is there literature to support it as a sufficient tool in assembling viral genomes? What sequencing depths were obtained in your samples?

      Thanks for your comments. In previous studies, we did sequence the starch or laminarin alone (no host included) and did not detect any phage-related sequences. The introduction of ABySS software was shown in these literatures (Jackman SD, Vandervalk BP, Mohamadi H, Chu J, Yeo S, Hammond SA, Jahesh G, Khan H, Coombe L, Warren RL, Birol I. ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter. Genome Res. 2017 May;27(5):768-777; Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I. ABySS: a parallel assembler for short read sequence data. Genome Res. 2009 Jun;19(6):1117-23.), which were also used to assemble viral genomes in these literatures (Guo Y, Jiang T. First Report of Sugarcane Mosaic Virus Infecting Goose Grass in Shandong Province, China. Plant Dis. 2024 Mar 21. doi: 10.1094/PDIS-11-23-2514-PDN; Tang M, Chen Z, Grover CE, Wang Y, Li S, Liu G, Ma Z, Wendel JF, Hua J. Rapid evolutionary divergence of Gossypium barbadense and G. hirsutum mitochondrial genomes. BMC Genomics. 2015 Oct 12;16:770.). The sequencing depth of the phages of strain WC36 and zth2 were 350x and 365x, respectively.

      Comment 13: Line 323. Replace "eventually" with more detail about what was done to derive the genomes. Were these the only four sequences identified as viral?

      Thanks for your comments. We have used the ABySS software (http://www.bcgsc.ca/platform/bioinfo/software/abyss) to perform genome assembly with multiple-Kmer parameters. VIBRANT v1.2.1 (Kieft et al., 2020), DRAM-v (Shaffer et al., 2020), VirSorter v1.0.5 (with categories 1 (“pretty sure”) and 2 (“quite sure”)) (Roux et al., 2015) and VirFinder v1.1 (with statistically significant viral prediction: score > 0.9 and P-value < 0.05) (Ren et al., 2017) with default parameters were used to identify viral genomes from these assembly sequences by searching against the both cultured and non-cultured viral NCBI-RefSeq database (http://blast.ncbi.nlm.nih.gov/) and IMG/VR database (Camargo et al., 2023). The GapCloser software (https://sourceforge.net/projects/soapdenovo2/files/GapCloser/) was subsequently applied to fill up the remaining local inner gaps and correct the single base polymorphism for the final assembly results. All the detailed processes were described in the supplementary information. The virus sequences with higher scores are only these four, but they are not complete genomes. Some virus sequences with shorter sequences and lower scores were excluded.

      Comment 14: Line 328. We need some details about the host genomes here. How were these derived? What is their completeness/contamination? What is their size? If the bins are poor, these would not serve as a reliable comparison to identify integrated phage.

      Thanks for your comments. For genomic sequencing, strains WC36 and zth2 were grown in the liquid rich medium supplemented with 5 g/L laminarin and starch and harvested after one week of incubation at 28 °C. Genomic DNA was isolated by using the PowerSoil DNA isolation kit (Mo Bio Laboratories Inc., Carlsbad, CA). Thereafter, the genome sequencing was carried out with both the Illumina NovaSeq PE150 (San Diego, USA) and Nanopore PromethION platform (Oxford, UK) at the Beijing Novogene Bioinformatics Technology Co., Ltd. A complete description of the library construction, sequencing, and assembly was performed as previously described (Zheng et al., 2021). We used seven databases to predict gene functions, including Pfam (Protein Families Database, http://pfam.xfam.org/), GO (Gene Ontology, http://geneontology.org/) (Ashburner et al., 2000), KEGG (Kyoto Encyclopedia of Genes and Genomes, http://www.genome.jp/kegg/) (Kanehisa et al., 2004), COG (Clusters of Orthologous Groups, http://www.ncbi.nlm.nih.gov/COG/) (Galperin et al., 2015), NR (Non-Redundant Protein Database databases), TCDB (Transporter Classification Database), and Swiss-Prot (http://www.ebi.ac.uk/uniprot/) (Bairoch and Apweiler, 2000). A whole genome Blast search (E-value less than 1e-5, minimal alignment length percentage larger than 40%) was performed against above seven databases.

      The completeness of the genomes of strains WC36 and zth2 were 100%, which were checked by the CheckM v1.2.2. The size of the genome of strains WC36 and zth2 were 3,660,783 bp and 3,198,720bp, respectively. The complete genome sequences of strains WC36 and zth2 presented in this study have been deposited in the GenBank database with accession numbers CP085689 and CP071032, respectively. 

      Moreover, to verify whether the absence of microbial contamination in phage sequencing results, we used the new alignment algorithm BWA-MEM (version 0.7.15) to perform reads mapping of host WGS to these phages. We found that all the raw reads of host strains (WC36 and zth2) were not mapping to these phages sequences (Author response image 3, shown as below). In addition, we also performed the evaluation of the assembly graph underlying the host consensus assemblies. Clean reads were mapped to the bacterial complete genome sequences by the Bowtie 2 (version 2.5.0), BWA (version 0.7.8) and SAMTOOLS (version 0.1.18). The results showed that the total mismatch rate of strains WC36 and zth2 were almost 0% and 0.03%, respectively (Author response table 1, shown as below). In addition, we also collected the cells of strains WC36 and zth2, and then sent them to another company for whole genome sequencing (named WC36G and ZTH, GenBank accession numbers CP151801 and CP119760, respectively). The completeness of the genomes of strains WC36G and ZTH were also 100%. The size of the genome of strains WC36G and ZTH were 3,660,783bp and 3,198,714bp, respectively. The raw reads of strains WC36G and zth2 were also not mapping to the phages sequences. Therefore, we can confirm that these bacteriophage genomes were completely outside of the host chromosomes. 

      Author response image 4.

      The read mapping from WGS to phage sequences.

      Author response table 1.

      Sequencing depth and coverage statistics.

      References related to this response:

      Zheng, R., Liu, R., Shan, Y., Cai, R., Liu, G., and Sun, C. (2021b) Characterization of the first cultured free-living representative of Candidatus Izemoplasma uncovers its unique biology ISME J 15:2676-2691. 

      Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium Nat Genet 25:25-29. 

      Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y., and Hattori, M. (2004) The KEGG resource for deciphering the genome Nucleic Acids Res 32:D277-280. 

      Galperin, M.Y., Makarova, K.S., Wolf, Y.I., and Koonin, E.V. (2015) Expanded microbial genome coverage and improved protein family annotation in the COG database Nucleic Acids Res 43:D261-269. 

      Bairoch, A., and Apweiler, R. (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 Nucleic Acids Res 28:45-48.

      Comment 15: Line 333. This also needs some details. What evidence do you have that these are not chromosomal? If not chromosomal where can they be found? Sequencing efforts should also be able to yield extrachromosomal elements such as plasmids etc... If you were to sequence your purified isolate cultures from the rich media alone and include all assemblies (not just those binned for example) as a reference, would you be able to recruit viral reads? The way this reads suggests that Chevallereau et al., worked specifically with these phage, which is not the case - please rephrase.

      Thanks for your comments. We carefully compared the bacteriophage genomes with those of the corresponding hosts (strains WC36 and zth2) using Galaxy Version 2.6.0 (https://galaxy.pasteur.fr/) (Afgan et al., 2018) with the NCBI BLASTN method and used BWA-mem software for read mapping from host whole genome sequencing (WGS) to these bacteriophages. These analyses both showed that the bacteriophage genomes are completely outside of the host chromosomes. Therefore, we hypothesized that the phage genomes might exist in the host in the form similar to that of plasmid.

      Comment 16: Line 335. More to the point here that we need confirmation that these phages were not introduced in the polysaccharide treatment

      Thanks for your comments. Please find our answers for this concern in the responses for comment 1 of “weakness” part and comment 6 of “Recommendations For The Authors” part.

      Comment 17: Line 342. Lacking significant detail here. Phylogeny based on what gene(s), how were the alignments computed/refined, what model used etc..?

      Thanks for your comments. According to your suggestions, all the related information was shown in this section “Materials and methods” of this manuscript. The maximum likelihood phylogenetic tree of Phage-WC36-2 and Phage-zth2-2 was constructed based on the terminase large subunit protein (terL). These proteins used to construct the phylogenetic trees were all obtained from the NCBI databases. All the sequences were aligned by MAFFT version 7 (Katoh et al., 2019) and manually corrected. The phylogenetic trees were constructed using the W-IQ-TREE web server (http://iqtree.cibiv.univie.ac.at) with the “GTR+F+I+G4” model (Trifinopoulos et al., 2016). Finally, we used the online tool Interactive Tree of Life (iTOL v5) (Letunic and Bork, 2021) to edit the tree. 

      Comment 18: Line 346. How are you specifically defining AMGs in this study? Most of these are well-known and studied phage genes with specific life cycle functions and could not be considered as polysaccharide processing AMGs even though in host cells many do play a role in polysaccharide processing systems. A substantially deeper literature review is needed in this section, which would ultimately eliminate most of these from the potential AMG pools. Further, the simple HMM/BLASTp evalues are not sufficient to support the functional annotation of these genes. At a minimum, catalytic/conserved regions should be identified, secondary structures compared, and phylogenetic analysis (where possible) developed etc... My recommendation is to eliminate this section entirely from the manuscript. 

      Categorically:

      - Glycoside hydrolase (various families), glucosaminidases, and transglycosylase are all very common to phage and operate generally as a lysins, facilitating the release of virions from the host cell upon lysis, or injection of viral DNA upon infection https://doi.org/10.3389/fmicb.2016.00745 (and citations therein) https://doi.org/10.1016/j.cmi.2023.10.018 etc... In order to confirm these as distinct AMGs we would need a very detailed analysis indicating that these are not phage infection cycle/host recognition related, however I strongly suspect that under such interrogation, these would prove to be as such.

      -TonB related systems including ExbB are well studied among phages as part of the trans-location step in infection. These could not be considered as AMGs. https://doi.org/10.1128/JB.00428-19. Other TonB dependent receptors play a role in host recognition.

      -Several phage acetyltransferases play a role in suppressing host RNA polymerase in order to reserve host cell resources for virion production, including polysaccharide production. https://doi.org/10.3390/v12090976. Further it has been shown that the E. coli gene neuO (O-acetyltransferase) is a homologue of lambdoid phage tail fiber genes https://doi.org/10.1073/pnas.0407428102. I suspect the latter is also the case here and this is a tail fiber gene.

      Thanks for your valuable comments. According to your suggestions, we have reanalyzed these AMGs and made some modifications (the new version Fig. 5A, shown as below). These genes encoding proteins associated with polysaccharide transport and degradation may be only common in virulent phages, and have never been reported in chronic phages. Unlike virulent phages, these genes typically act as lysozymes, facilitating the release of virions from the host cell upon lysis, or injection of viral DNA upon infection, chronic phages do not lyse the host. It is reported that, filamentous phages could recognize and bind to the host pili, which causes the pili to shrink and brings the filamentous phages closer to and possibly through the outer membrane of host cells (Riechmann et al., 1997; Sun et al., 1987). The possible mechanism of other chronic phage release without breaking the host might be that it was enclosed in lipid membrane and released from the host cells by a nonlytic manner. It has recently been reported that the tailless Caudoviricetes phage particles are enclosed in lipid membrane and are released from the host cells by a nonlytic manner (Liu et al., 2022), and the prophage induction contributes to the production of membrane vesicles by Lacticaseibacillus casei BL23 during cell growth (da Silva Barreira et al., 2022). Therefore, the persistence of these genes in chronic phages may be due to their ability to assist the host in metabolizing polysaccharides. 

      Finally, according to your suggestions, we have weakened the role of AMGs and added “potential” in front of it.

      References related to this response:

      Riechmann L, Holliger P. (1997) The C-terminal domain of TolA is the coreceptor for filamentous phage infection of E. coli Cell 90:351-60.

      Sun TP, Webster RE. (1987) Nucleotide sequence of a gene cluster involved in entry of E colicins and single-stranded DNA of infecting filamentous bacteriophages into Escherichia coli J Bacteriol 169:2667-74. 

      Liu Y, Alexeeva S, Bachmann H, Guerra Martníez J.A, Yeremenko N, Abee T et al. (2022) Chronic release of tailless phage particles from Lactococcus lactis Appl Environ Microbiol 88: e0148321. da Silva Barreira, D., Lapaquette, P., Novion Ducassou, J., Couté, Y., Guzzo, J., and Rieu, A. Spontaneous prophage induction contributes to the production of membrane vesicles by the gram-positive bacterium Lacticaseibacillus casei BL23. mBio_._ 2022;13:e0237522.

      Comment 19: Line 354. To make this statement that these genes are missing from the host, we would need to know that these genomes are complete.

      Thanks for your comments. The completeness of the genomes of strains WC36 and zth2 were 100%, which were checked by the CheckM v1.2.2. The size of the genome of strains WC36 and zth2 were 3,660,783 bp and 3,198,720bp, respectively. The complete genome sequences of strains WC36 and zth2 presented in this study have been deposited in the GenBank database with accession numbers CP085689 and CP071032, respectively. In addition, we also collected the cells of strains WC36 and zth2, and then sent it to another company for whole genome sequencing (named WC36G and ZTH, GenBank accession numbers CP151801 and CP119760, respectively). The completeness of the genomes of strains WC36G and ZTH were also 100%. The size of the genome of strains WC36G and ZTH were 3,660,783bp and 3,198,714bp, respectively. Therefore, these genomes of strains WC36 and zth2 were complete and circular.    

      Comment 20: Figure 5. Please see https://peerj.com/articles/11447/ and https://doi.org/10.1093/nar/gkaa621 for a detailed discussion on vetting AMGs. Several of these should be eliminated according to the standards set in the field. More specifically, and by anecdotal comparison with other inoviridae genomes, for Phage-WC36-1 and Phage-zth2-1, I am not convinced that the transactional regulator and glycoside hydrolase are a part of the phage genome. The phage genome probably ends at the strand switch.

      Thanks for your comments. According to your suggestions, we have analyzed these two articles carefully and modified the genome of Phage-WC36-1 and Phage-zth2-1 by anecdotal comparison with other inoviridae genomes. As you said, the transactional regulator and glycoside hydrolase are not a part of the phage genome.

      The new version Fig. 5A was shown.

      References related to this response:

      Shaffer, M., Borton, M.A., McGivern, B.B., Zayed, A.A., La Rosa, S.L., Solden, L.M., Liu, P., Narrowe, A.B., Rodrgíuez-Ramos, J., Bolduc, B., et al. (2020) DRAM for distilling microbial metabolism to automate the curation of microbiome function Nucleic Acids Res 48:8883-8900 

      Pratama, A.A., Bolduc, B., Zayed, A.A., Zhong, Z.P., Guo, J., Vik, D.R., Gazitúa, M.C., Wainaina, J.M., Roux, S., and Sullivan, M.B. (2021) Expanding standards in viromics: in silico evaluation of dsDNA viral genome identification, classification, and auxiliary metabolic gene curation PeerJ 9:e11447

      Comment 21: Line 380. This section needs to start with detailed evidence that this phage can even infect this particular strain. Added note, upon further reading the serial dilution cultures are not sufficient to prove these phage infect this Pseudomonas. We need at a minimum a one-step growth curve and wet mount microscopy. It is much more likely that some carry over contaminant is invading the culture and influencing OD600. With the given evidence, I am not at all convinced that these phages have anything to do with Pseudomonas polysaccharide use and I recommend either drastically revising this section or eliminating it entirely.

      Line 386-389. Could this be because you are observing your added phage in the starch enriched media while no phage were introduced with the "other types of media" so none would be observed? This could have nothing to do with infection dynamics. Further, this would also be consistent with your starch solution being contaminated by phage.

      Line 399. Again consistent with the starch media being contaminated.

      Line 401-408. This is more likely to do with the augmentation of the media with an additional carbon source and not involving the phage. 

      Line 410. I am not convinced that these viruses infect the Pseudomonas strain. Extensive further evidence of infection is needed to make these assertions.  Figure 6A. We need confirmation that the isolate culture remains pure and there are no other contaminants introduced with the phage.

      Thanks for your comments. We have proved that the polysaccharides (laminarin/ starch) didn't contaminate any phages above. Actually, we selected many marine strains (Pseudomonadota, Planctomycetes, Verrucomicrobia, Fusobacteria, and Tenericutes isolates) to investigate whether Phages-WC36 could assist them in degradation and utilization of polysaccharides, and found that Phages-WC36 could only promote the growth of strain 273. The presence of filamentous phages and hexagonal phages was detected in the supernatant of strain 273 cultured in basal medium supplemented with 5 g/L starch and 20 μl/mL Phages-WC36. After 3 passages of serial cultivation in basal medium supplemented with 5 g/L starch, we found that filamentous phages and hexagonal phages were also present in basal medium supplemented with starch, but not in the basal medium, which may mean that Phages-WC36 could infect strain 273 and starch is an important inducer. In addition, the Phages-WC36 used in the growth assay of strain 273 were multiple purified and eventually suspended in SM buffer (0.01% gelatin, 50 mM Tris-HCl, 100 mM NaCl and 10 mM MgSO4). Thus, these phages are provided do not contain some extracellular enzymes and/or nutrients. In addition, we set up three control groups in the growth assay of strain 273: basal medium, basal medium supplemented with Phages-WC36 and basal medium supplemented with starch. If the Phages-WC36 contains some extracellular enzymes and/or nutrients, strain 273 could also grow well in the basal medium supplemented only with Phages-WC36. However, the poor growth results of strain 273 cultivated in the basal medium supplemented with Phages-WC36 further confirmed that there were not some extracellular enzymes and/or nutrients in these phages.

      Finally, the possible mechanism of the chronic phage release without breaking the host might be that it was enclosed in lipid membrane and released from the host cells by a nonlytic manner. Thus, these chronic phages may have a wider host range. However, we were unable to further disclose the infection mechanism in this paper. Therefore, according to your suggestions, we have deleted this section entirely.

      Comment 27: Line 460. Details about how these genomes were reconstructed is needed here.  

      Thanks for your comments. According to your suggestions, we have added the detailed information about the genome sequencing, annotation, and analysis as “Genome sequencing, annotation, and analysis of strains WC36 and zth2 For genomic sequencing, strains WC36 and zth2 were grown in the liquid rich medium supplemented with 5 g/L laminarin and starch and harvested after one week of incubation at 28 °C. Genomic DNA was isolated by using the PowerSoil DNA isolation kit (Mo Bio Laboratories Inc., Carlsbad, CA). Thereafter, the genome sequencing was carried out with both the Illumina NovaSeq PE150 (San Diego, USA) and Nanopore PromethION platform (Oxford, UK) at the Beijing Novogene Bioinformatics Technology Co., Ltd. A complete description of the library construction, sequencing, and assembly was performed as previously described (Zheng et al., 2021b). We used seven databases to predict gene functions, including Pfam (Protein Families Database, http://pfam.xfam.org/), GO (Gene Ontology, http://geneontology.org/) (Ashburner et al., 2000), KEGG (Kyoto Encyclopedia of Genes and Genomes, http://www.genome.jp/kegg/) (Kanehisa et al., 2004), COG (Clusters of Orthologous Groups, http://www.ncbi.nlm.nih.gov/COG/) (Galperin et al., 2015), NR (Non-Redundant Protein Database databases), TCDB (Transporter Classification Database), and Swiss-Prot (http://www.ebi.ac.uk/uniprot/) (Bairoch and Apweiler, 2000). A whole genome Blast search (E-value less than 1e-5, minimal alignment length percentage larger than 40%) was performed against above seven databases.” in the revised manuscript (Lines 333-351).

      Comment 28: Line 462. Accession list of other taxa in the supplement would help here.  

      Thanks for your comments. The accession numbers of these strains were displayed behind these strains in Figure 1A. According to your suggestions, we have added an accession list of these taxa (Supplementary Table 6) in the revised manuscript.

      Comment 29: Line 463. Is there any literature to support that these are phylogenetically informative genes for Inoviridae?  

      Thanks for your comments. There are some literatures (Zeng et al, 2021; Evseev et al, 2023) to support that these are phylogenetically informative genes for Inoviridae. We have added these literatures in the revised manuscript. 

      References related to this response:

      Zeng, J., Wang, Y., Zhang, J., Yang, S., and Zhang, W. (2021) Multiple novel filamentous phages detected in the cloacal swab samples of birds using viral metagenomics approach Virol J 18:240

      Evseev, P., Bocharova, J., Shagin, D., and Chebotar, I. (2023) Analysis of Pseudomonas aeruginosa isolates from patients with cystic fibrosis revealed novel groups of filamentous bacteriophages. Viruses 15: 2215

      Reviewer #2 (Public Review):

      Summary: This paper investigates virus-host interactions in deep-sea bacteriophage systems which employ a seemingly mutualistic approach to viral replication in which the virus aids host cell polysaccharide import and utilization via metabolic reprogramming. The hypothesis being tested is supported with solid and convincing evidence and the findings are potentially generalizable with implications for our understanding of polysaccharide-mediated virus-host interactions and carbon cycles in marine ecosystems more broadly.

      Thanks for your positive comments.

      Strengths: This paper synthesizes sequencing and phylogenic analyses of two Lentisphaerae bacteria and three phage genomes; electron microscopy imaging of bacterial/phage particles; differential gene expression analyses; differential growth curve analyses, and differential phage proliferation assays to extract insights into whether laminarin and starch can induce both host growth and phage proliferation. The data presented convincingly demonstrate that both host culture density and phage proliferation increase as a result having host, phage, and polysaccharide carbon source together in culture.

      Thanks for your positive comments.  

      Weaknesses (suggestions for improvement): 

      (1) The article would be strengthened by the following additional experiment: providing the phage proteins hypothesized to be aiding host cell growth (red genes from Figure 5...TonB system energizer ExbB, glycosidases, etc) individually or in combination on plasmids rather than within the context of the actual phage itself to see if such additional genes are necessary and sufficient to realize the boosts in host cell growth/saturation levels observed in the presence of the phages tested.

      Thanks for your valuable comments. It is a really good idea to express individually or in combination on plasmids to see the effects of those polysaccharide-degradation proteins in the host cell. However, at present, we failed to construct the genetic and expression system for the strictly anaerobic strain WC36, which hindering our further detailed investigation of the functions of those polysaccharide-degradation proteins. In our lab, we are trying our best to build the genetic and expression system for strain WC36. We will definitely test your idea in the future. 

      (2) The paper would also benefit from additional experiments focused on determining how the polysaccharide processing, transport, and metabolism genes are being used by the phages to either directly increase viral infection/replication or else to indirectly do so by supporting the growth of the host in a more mutualistic manner (i.e. by improving their ability to import, degrade, and metabolize polysaccharides).  

      Thanks for your valuable comments. Indeed, due to the chronic phage genome is not within the chromosome of the host, it is very hard to disclose the exact auxiliary process and mechanism of chronic phages. At present, we are trying to construct a genetic manipulation system for the strictly anaerobic host WC36, and we will gradually reveal this auxiliary mechanism in the future. In addition, combined with the reviewer 1’s suggestions, the focus of revised manuscript is to emphasize that polysaccharides induce deep-sea bacteria to release chronic phages, and most of the content of phage assisting host metabolism of polysaccharides has been deleted.

      (3) The introduction would benefit from a discussion of what is known regarding phage and/or viral entry pathways that utilize carbohydrate anchors during host entry. The discussion could also be improved by linking the work presented to the concept of "selfishness" in bacterial systems (see for instance Giljan, G., Brown, S., Lloyd, C.C. et al. Selfish bacteria are active throughout the water column of the ocean. ISME COMMUN. 3, 11 (2023) https://doi.org/10.1038/s43705-023-00219-7). The bacteria under study are gram negative and it was recently demonstrated (https://www.nature.com/articles/ismej201726) that "selfish" bacteria sequester metabolizable polysaccharides in their periplasm to advantage. It is plausible that the phages may be hijacking this "selfishness" mechanism to improve infectivity and ENTRY rather than helping their hosts to grow and profilerate so they can reap the benefits of simply having more hosts to infect. The current work does not clearly distinguish between these two distinct mechanistic possibilities. The paper would be strengthened by at least a more detailed discussion of this possibility as well as the author's rationale for interpreting their data as they do to favor the "mutualistic" interpretation. In the same light, the paper would benefit from a more careful choice of words which can also help to make such a distinction more clear/evident/intentional. As currently written the authors seem to be actively avoiding giving insights wrt this question.  

      Thanks for your valuable comments. According to your suggestions, we have added the related discussion as “Moreover, it was recently demonstrated that selfish bacteria, which were common throughout the water column of the ocean, could bind, partially hydrolyze, and transport polysaccharides into the periplasmic space without loss of hydrolysis products (Reintjes et al., 2017; Giljan et al., 2023). Based on our results, we hypothesized that these chronic phages might also enter the host through this “selfishness” mechanism while assisting the host in metabolizing polysaccharides, thus not lysing the host. On the other hand, these chronic phages might hijack this “selfishness” mechanism to improve their infectivity and entry, rather than helping their hosts to grow and proliferate, so they could reap the benefits of simply having more hosts to infect. In the future, we need to construct a genetic operating system of the strictly anaerobic host strain WC36 to detailedly reveal the relationship between chronic phage and host.” in the revised manuscript (Lines 305-316). 

      References related to this response:

      Reintjes, G., Arnosti, C., Fuchs, B.M., and Amann, R. (2017) An alternative polysaccharide uptake mechanism of marine bacteria ISME J 11:1640-1650

      Giljan, G., Brown, S., Lloyd, C.C., Ghobrial, S., Amann, R., and Arnosti, C. (2023) Selfish bacteria are active throughout the water column of the ocean ISME Commun 3:11

      (4) Finally, I would be interested to know if the author’s sequencing datasets might be used to inform the question raised above by using bacterial immunity systems such as CRISPR/Cas9. For example, if the phage systems studied are truly beneficial/mutualistic for the bacteria then it’s less likely that there would be evidence of targeted immunity against that particular phage that has the beneficial genes that support polysaccharide metabolism.

      Thanks for your comments. According to your suggestions, we have carefully analyzed the genome of strain WC36, and found that there were no CRISPR/Cas9-related genes. Considering our results that the number of chronic phages was increased with the prolongation of culture time, we speculated that host might have no targeted immunity against these chronic phages.

      Reviewer #2 (Recommendations For The Authors):

      There are some minor grammatical errors and unclear statements (lines 99-100, 107-109, 163, 222, 223, 249-250, 254) which should also be fixed before final publication. 

      Thanks for your valuable comments. We have fixed these minor grammatical errors and unclear statements in the revised manuscript.

      Lines 99-100: we have modified this description as “For instance, AMGs of marine bacteriophages have been predicted to be involved in photosynthesis (Mann et al., 2003), nitrogen cycling (Ahlgren et al., 2019; Gazitúa et al., 2021), sulfur cycling (Anantharaman et al., 2014; Roux et al., 2016), phosphorus cycling (Zeng and Chisholm, 2012), nucleotide metabolism (Sullivan et al., 2005; Dwivedi et al., 2013; Enav et al., 2014), and almost all central carbon metabolisms in host cells (Hurwitz et al., 2013).” in the revised manuscript (Lines 100-105).

      Lines 107-109: we have modified this description as “However, due to the vast majority of deep-sea microbes cannot be cultivated in the laboratory, most bacteriophages could not be isolated.” in the revised manuscript (Lines 110-111).

      Line 163: we have modified this description as “Based on the growth curve of strain WC36, we found that the growth rate of strictly anaerobic strain WC36 was relatively slow.” in the revised manuscript (Lines 149-151).

      Lines 222-223: we have modified this description as “Regardless of whether the laminarin was present, the bacterial cells kept their cell shape intact, indicating they were still healthy after 30 days” in the revised manuscript (Lines 195-197).

      Lines 249-250: we have modified this description as “However, the entry and exit of the hexagonal phages into the WC36 cells were not observed.” in the revised manuscript (Lines 190-191).

      Line 254: we have modified this description as “To explore whether the production of bacteriophages induced by polysaccharide is an individual case, we further checked the effect of polysaccharides on another cultured deep-sea Lentisphaerae strain zth2.” in the revised manuscript (Lines 213-215).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Galanti et al. present an innovative new method to determine the susceptibility of large collections of plant accessions towards infestations by herbivores and pathogens. This work resulted from an unplanned infestation of plants in a greenhouse that was later harvested for sequencing. When these plants were extracted for DNA, associated pest DNA was extracted and sequenced as well. In a standard analysis, all sequencing reads would be mapped to the plant reference genome and unmapped reads, most likely originating from 'exogenous' pest DNA, would be discarded. Here, the authors argue that these unmapped reads contain valuable information and can be used to quantify plant infestation loads.

      For the present manuscript, the authors re-analysed a published dataset of 207 sequenced accessions of Thlaspi arvense. In this data, 0.5% of all reads had been classified as exogenous reads, while 99.5% mapped to the T. arvense reference genome. In a first step, however, the authors repeated read mapping against other reference genomes of potential pest species and found that a substantial fraction of 'ambiguous' reads mapped to at least one such species. Removing these reads improved the results of downstream GWAs, and is in itself an interesting tool that should be adopted more widely.

      The exogenous reads were primarily mapped to the genomes of the aphid Myzus persicae and the powdery mildew Erysiphe cruciferarum, from which the authors concluded that these were the likely pests present in their greenhouse. The authors then used these mapped pest read counts as an approximate measure of infestation load and performed GWA studies to identify plant gene regions across the T. arvense accessions that were associated with higher or lower pest read counts. In principle, this is an exciting approach that extracts useful information from 'junk' reads that are usually discarded. The results seem to support the authors' arguments, with relatively high heritabilities of pest read counts among T. arvense accessions, and GWA peaks close to known defence genes. Nonetheless, I do feel that more validation would be needed to support these conclusions, and given the radical novelty of this approach, additional experiments should be performed.

      A weakness of this study is that no actual aphid or mildew infestations of plants were recorded by the authors. They only mention that they anecdotally observed differences in infestations among accessions. As systematic quantification is no longer possible in retrospect, a smaller experiment could be performed in which a few accessions are infested with different quantities of aphids and/or mildew, followed by sequencing and pest read mapping. Such an approach would have the added benefit of allowing causally linking pest read count and pest load, thereby going beyond correlational associations.

      On a technical note, it seems feasible that mildew-infested leaves would have been selected for extraction, but it is harder to explain how aphid DNA would have been extracted alongside plant DNA. Presumably, all leaves would have been cleaned of live aphids before they were placed in extraction tubes. What then is the origin of aphid DNA in these samples? Are these trace amounts from aphid saliva and faeces/honeydew that were left on the leaves? If this is the case, I would expect there to be substantially more mildew DNA than aphid DNA, yet the absolute read counts for aphids are actually higher. Presumably read counts should only be used as a relative metric within a pest organism, but this unexpected result nonetheless raises questions about what these read counts reflect. Again, having experimental data from different aphid densities would make these results more convincing.

      We agree with the reviewer that additional aphid counts at the time of (or prior to) sequencing would have been ideal, but unfortunately we do not have these data. However, compared to such counts one strength of our sequencing-based approach is that it (presumably) integrates over longer periods than a single observation (e.g. if aphid abundances fluctuated, or winged aphids visited leaves only temporarily), and that it can detect pathogens even when invisible to our eyes, e.g. before a mildew colony becomes visible. Moreover, the key point of our study is that we can detect variation in pest abundance even in the absence of count data, which are really time consuming to collect.

      Conducting a new experiment, with controlled aphid infestations and continuous monitoring of their abundances, to test for correlation between pest abundance and the number of detected reads would require resequencing at least 30-50% of the collection for the results to be reliable. It would be a major experimental study in itself.

      Regarding the origin of aphid reads and the differences in read-counts between e.g. aphids and mildew, we believe this should not be of concern. DNA contamination is very common in all kinds of samples, but these reads are simply discarded in other studies. For example, although we collected and handled samples using gloves, MG-RAST detected human reads (Hominidae, S2 Table), possibly from handling the plants during transplanting or phenotyping 1-2 weeks before sequencing. Therefore, although we did remove aphids from the leaves at collection, aphid saliva or temporary presence on leaves must have been enough to leave detectable DNA traces. Additionally, the fact that the M. persicae load strongly correlates with the Buchnera aphidicola load (R2\=0.86, S6 Table), is reassuring. This obligate aphid symbiont is expected to be found in high amounts when sequencing aphids (see e.g. The International Aphid Genomics Consortium (2010))

      The higher amount of aphid compared to mildew reads, can probably be explained by aphids having expanded more than mildew at the time of plant collection, but most importantly, as already mentioned by the reviewer, the read-counts were meant to compare plant accessions rather then pests to one another. We are interested in relative not absolute values. Comparisons between pest species are a challenge because they can be influenced by several factors such as the availability of sequences in the MG-RAST database and the DNA extraction kit used, which is plant-specific and might bias towards certain groups. All these potential biases are not a concern when comparing different plants as they are equally subject to these biases.

      Reviewer #2 (Public Review):

      Summary:

      Galanti et al investigate genetic variation in plant pest resistance using non-target reads from whole-genome sequencing of 207 field lines spontaneously colonized by aphids and mildew. They calculate significant differences in pest DNA load between populations and lines, with heritability and correlation with climate and glucosinolate content. By genome-wide association analyses they identify known defence genes and novel regions potentially associated with pest load variation. Additionally, they suggest that differential methylation at transposons and some genes are involved in responses to pathogen pressure. The authors present in this study the potential of leveraging non-target sequencing reads to estimate plant biotic interactions, in general for GWAS, and provide insights into the defence mechanisms of Thlaspi arvense.

      Strengths:

      The authors ask an interesting and important question. Overall, I found the manuscript very well-written, with a very concrete and clear question, a well-structured experimental design, and clear differences from previous work. Their important results could potentially have implications and utility for many systems in phenotype-genotype prediction. In particular, I think the use of unmapped reads for GWAS is intriguing.

      Thank you for appreciating the originality and potential of our work.

      Weaknesses:

      I found that several of the conclusions are incomplete, not well supposed by the data and/or some methods/results require additional details to be able to be judged. I believe these analyses and/or additional clarifications should be considered.

      Thank you very much for the supportive and constructive comments. They helped us to improve the manuscript.

      Recommendations for the authors:

      Reviewing Editor (Recommendations For The Authors):

      The authors address an interesting and significant question, with a well-written manuscript that outlines a clear experimental design and distinguishes itself from previous work. However, some conclusions seem incomplete, lacking sufficient support from the data, or requiring additional methodological details for proper evaluation. Addressing these limitations through additional analyses or clarifications is recommended.

      Reviewer #2 (Recommendations For The Authors):

      Major comments:

      - So far it is not clear to me how read numbers were normalised and quantified. For instance, Figure 1C only reports raw read numbers. In L149: "Prior to these analyses, to avoid biases caused by different sequencing depths, we corrected the read counts for the total numbers of deduplicated reads in each library and used the residuals as unbiased estimates of aphid, mildew and microbe loads". Was library size considered? Is the load the ratio between exogenous vs no exogenous reads? It is described in L461, but according to this, read counts were normalised and duplicated reads were removed. Now, why read counts were used? As opposite to total coverage / or count of bases per base? I cannot follow how variation in sequencing quality was considered. I can imagine that samples with higher sequencing depth will tend to have higher exogenous reads (just higher resolution and power to detect something in a lower proportion).

      Correcting for sequencing depth/library size is indeed very important. As the reviewer noted, we had explained how we did this in the methods section (L464), and we now also point to it in the results (L151):

      “Finally, we log transformed all read counts to approximate normality, and corrected for the total number of deduplicated reads by extracting residuals from the following linear model, log(read_count + 1) ∼ log(deduplicated_reads), which allowed us to quantify non-Thlaspi loads, correcting for the sequencing depth of each sample.”

      We showed the uncorrected read-counts only in Fig 1 to illustrate the orders of magnitude but used the corrected read-counts (also referred to as “loads”) for all subsequent analyses.

      In our view, theoretically, the best metric to correct the number of reads of a specific contaminant organism, is the total number of DNA fragments captured. Importantly, this is not well reflected by the total number of raw reads because of PCR and optical duplicates occurring during library prep and sequencing. For this reason we estimated the total number of reads captured multiplying total raw reads (after trimming) by the deduplication rate obtained from FastQC (methods L409-411). This metric reflects the amount of DNA fragments sampled better than the raw reads. Also it better reflects MG-RAST metrics as this software also deduplicates reads (Author response image 1 below). We also removed duplicates in our strict mappings to the M. persicae and B. aphidicola genomes.

      Coverage is not a good option for correction, because it is defined for a specific reference genome and many of the read-counts output by MG-RAST do not have a corresponding full assembly. Moreover, coverage and base counts are influenced by read size, which depends on library prep and is not included in the read-counts produced by MG-RAST.

      Author response image 1.

      Linear correlations between the number of MG-RAST reads post-QC and either total (left) or deduplicated (right) reads from fastq files of four full samples (not only unmapped reads).

      - The general assumption is that plants with different origins will have genetic variants or epigenetic variations associated with pathogen resistance, which can be tracked in a GWAS. However, plants from different regions will also have all variants associated with their origin (isolation by state as presented in the manuscript). In line 169: "Having established that our method most likely captured variation in plant resistance, we were interested in the ecological drivers of this variation". It is not clear to me how variation in plant resistance is differentiated from geographical variation (population structure). in L203: "We corrected for population structure using an IBS matrix and only tested variants with Minor Allele Frequency (MAF) > 0.04 (see Methods).". However, if resistant variants are correlated with population structure as shown in Table 1, how are they differentiated? In my opinion, the analyses are strongly limited by the correlation between phenotype and population structure.

      The association of any given trait with population structure is surely a very important aspect in GWAS studies and when looking at correlations of traits with environmental variables. If a trait is strongly associated with population structure, then disentangling variants associated with population structure vs. the ones associated with the trait can indeed be challenging, a good example being flowering time in A. thaliana (e.g. Brachi et al. 2013).

      In our case, although the pest and microbiome loads are associated with population structure to some extent, this association is not very strong. This can be observed for example in Fig. 1C, where there is no clear separation of samples from different regions. This means that we can correct for population structure (in both GWAS and correlations with climatic variables) without removing the signals of association. It is possible that other associations were missed if specific variants were indeed strongly associated with structure, but these would be unreliable within our dataset, so it is prudent to exclude them.

      - Similarly, in L212: "we still found significant GWA peaks for Erysiphales but not for other types of exogenous reads (excluding isolated, unreliable variants) (Figure 3A and S3 Figure)." In a GWA analysis, multiple variants will constitute an association pick (as shown for instance in main Figure 3A) only when the pick is accentuated by lockage disequilibrium around the region under selection (or around the variant explaining phenotypic variation in this case). However, in this case, I suspect there is a strong component of population structure (which still needs to be corroborated as suggested in the previous comment). But if variants are filtered by population structure, the only variants considered are those polymorphic within populations. In this case, I do not think clear picks are expected since most of the signal, correlated with population has been removed. Under this scenario, I wonder how informative the analyses are.

      As mentioned above, the traits we analyse (aphid and mildew loads) are only partially associated with population structure. This is evident from Fig. 1C (see answer above) but also from the SNP-based heritability (Table 1, last column) which measures indeed the proportion of variance explained by genetic population structure. Although some variance is explained (i.e. the reviewer is correct that there is some association) there is still plenty of leftover variance to be used for GWAS and correlations with environmental variables. The fact that we still find GWAS peaks confirms this, as otherwise they would be lost by the population structure correction included in our mixed model.

      - How were heritability values calculated? Were related individuals filtered out? I suggest adding more detail in both the inference of heritability and the kinship matrix (IBS matrix). Currently missing in methods (for heritability I only found the mention of an R package in the caption of Table 1).

      We somehow missed this in the methods and thank the reviewer for noticing. We now added this paragraph to the chapter “Exogenous reads heritability and species identification”:<br /> “To test for variation between populations we used a general linear model with population as a predictor. To measure SNP-based heritability, i.e. the proportion of variance explained by kinship, we used the marker_h2() function from the R package heritability (Kruijer and Kooke 2019), which uses a genetic distance matrix as predictor to compute REML-estimates of the genetic and residual variance. We used the same IBS matrix as for GWAS and for the correlations with climatic variables.”

      We also added the reference to the R package heritability to the Table 1 caption.

      - Figure 2C. in line 188: "Although the baseline levels of benzyl glucosinolates were very low and probably sometimes below the detection level, plant lines where benzyl glucosinolate was detected had significantly lower aphid loads (over 70% less reads) in the glasshouse (Figure 3C)". It is not clear to me how to see these values in Figure 2C. From the boxplot, the difference in aphid loads between detected and not detected benzyl seems significantly lower. From the boxplot distribution is not clear how this difference is statistically significant. It rather seems like a sampling bias (a lot of non-detected vs low detected values). Is the difference still significant when random subsampling of groups is considered?

      Here the “70% less reads” refers to the uncorrected read-counts directly (difference in means between samples where benzyl-GS were detected vs. not). We agree with the reviewer that this is confusing when referred to figure 2C which depicts the corrected M. persicae load (residuals). We therefore removed that information.

      Regarding the significance of the difference, we re-calculated the p value with the Welch's t-test, which accounts for unequal variances, and with a bootstrap t-test. Both tests still found a significant difference. We now report the p value of the Welch’s t-test.

      - I think additional information regarding the read statistics needs to be improved. At the moment some sections are difficult to follow. I found this information mainly in Supplementary Table 1. I could not follow the difference in the manuscript and supplementary materials between read (read count), fragment, ambiguous fragments, target fragments, etc. I didn't find information regarding mean coverage per sample and relative plant vs parasite coverage. This lack of clarity led me to some confusion. For instance, in L207: "We suspected that this might be because some non-Thlaspi reads were very similar to these highly conserved regions and, by mapping there, generated false variants only in samples containing many non-Thlaspi reads". I find it difficult to follow how non-Thlaspi reads will interfere with genotyping. I think the fact that the large pick is lost after filtering reads is already quite insightful. However, in principle I would expect the relative coverage between non-Thlaspi:Thlaspi reads to be rather low in all cases. I would say below 1%. Thus, genotyping should be relatively accurate for the plant variants for the most part. In particular, considering genotyping was done with GATK, where low-frequency variants (relative coverage) should normally be called reference allele for the most part.

      We agree with the reviewer that some clarification over these points is necessary! We modified Supplementary Table 1 to include coverage information for all samples before and after removal of ambiguous reads and explained thoroughly how each value in the table was obtained. Regarding reads and fragments, we define each fragment as having two reads (R1 and R2). The classification into Target, Ambiguous and Unmapped reads was based on fragments, so we used that term in the table, but referring to reads has the same meaning in this context as for example an unmapped read is a read whose fragment was classified as unmapped.

      We did not include the pest coverage specifically, because this cannot be calculated for any of the read counts obtained with MG-RAST as this tool is mapping to online databases where genome size is not necessarily known. What is more meaningful instead are the read counts, which are in Supplementary tables 2 and 6. Importantly as mentioned in other answers, if different taxa are differently represented in the databases this does not affect the comparison of read counts across different samples, but only the comparison of different taxa which was not used for any further analyses.

      Regarding the ambiguous reads causing unreliable variants, these occur only in very few regions of the Thlaspi genome that are highly conserved in evolution or of very low complexity. In these regions reads generated from both plant or for instance aphid DNA, can map, but the ones from aphid might contain variants when mapping to the Thlaspi reference genome (L207 and L300). The reviewer is right that there is only a very small difference in average coverage when removing those ambiguous reads (~1X, S1 Table), but that is not true for those few regions where coverage changes massively when removing ambiguous reads as shown on the right side Y axes of S2 Figure. Therefore these unreliable variants are not low-frequency and therefore not removed by GATK.

      - L215. I am not very convinced with the enrichment analyses, justified with a reference (52). For instance, how many of the predicted picks are not close to resistance genes? How was the randomisation done? At the moment, the manuscript reads rather anecdotally by describing only those picks that effectively are "close" to resistance genes. For instance, if random windows (let's say 20kb windows) are sampled along the genome, how often there are resistant genes in those random windows, and how is the random sampling compared with observed picks (windows).

      Enrichment is by definition an increase in the proportion of true positives (observed frequency: proportion of significant SNPs located close to a priori candidate genes) compared to the background frequency (number of all SNPs located close to a priori candidate genes). So the background likelihood of SNPs to fall into a priori candidate SNPs (i.e. the occurrence of a priori candidate genes in randomly sampled windows, as suggested by the reviewer) is already taken into account as the background frequency. We now explained more extensively how enrichment is calculated in the relevant methods section (L545-549), but it is an extensively used method, established in a large body of literature, so it can be found in many papers (e.g. Atwell et al. 2010, Brachi et al. 2010, Kawakatsu et al. 2016, Kerdaffrec et al. 2017, Sasaki et al. 2015-2019-2022, Galanti et al. 2022, Contreras-Garrido et al. 2024).

      Although we had already calculated an upper bound for the FDR based on the a priori candidates, as in previous literature, we now further calculated the significance of the enrichment for the Bonferroni-corrected -log(p) threshold for Erysiphales. Calculating significance requires adopting a genome rotation scheme that preserves the LD structure of the data, as described in the previously mentioned literature (eg. Kawakatsu et al. 2016, Sasaki et al. 2022). Briefly, we calculated a null distribution of enrichments by randomly rotating the p values and a priori candidate status of the genetic variants within each chromosome, for 10 million permutations. We then assessed significance by comparing the observed enrichment to the null distribution. We found that the enrichment at the Bonferroni corrected -log(p) threshold is indeed significant for Erysiphales (p = 0.016). We added this to the relevant methods section and the code to the github page.

      In addition, many other genes very close (few kb max) to significant SNPs were not annotated with the “defense response” GO term but still had functions relatable to it. Some examples are CAR8, involved in ABA signalling, PBL7 in stomata closure and SRF3 in cell wall building and stress response  (Fig 3D). This means that our enrichment is actually most likely underestimated compared to if we had a more complete functional annotation.

      - L247. Additional information is needed regarding sampling. It is not clear to me why methylation analyses are restricted to 20 samples, contrary to whole genome analyses.

      The sampling is best described in the original paper (on natural DNA methylation variation; Galanti et al. 2022), although the most important parts are repeated in the first chapter of the methods.<br /> Regarding methylation analysis, they are not restricted to 20 samples. Only the DMR calling was restricted to the 20 vs. 20 samples with the most divergent values (of pest loads) to identify regions of variation. This analysis was used to subset the genome to potential regions associated with pest presence rather than thoroughly testing actual methylation variants associated with pest presence. The latter was done in the second step, EWAS, which was based on the whole dataset with the exclusions of samples with high non-conversion rate. This left 188 samples for EWAS. We added this number in the new manuscript (L251 and L571).

      To clarify, we made a few additions to the results (L250) and methods (last two subchapters) sections, where we explain the above.

      - No clear association with TEs: in L364: "Erysiphales load was associated with hypomethylated Copia TEs upstream of MAPKKK20, a gene involved in ABA-mediated signaling and stomatal closure. Since stomatal closure is a known defense mechanism to block pathogen access (21), it is tempting to conclude that hypomethylation of the MAPKKK20 promoter might induce its overexpression and consequent stomatal closure, thereby preventing mildew access to the leaf blade. Overall, we found associations between pathogen load and TE methylation that could act both in cis (eg. Copia TE methylation in MAPKKK20 promoter) and in trans, possibly through transposon reactivation (eg. LINE, Helitron, and Ty3/Gypsi TEs isolated from genes)." I find the whole discussion related to transposable elements, first, rather anecdotical, and second very speculative. To claim: "Overall, we found associations between pathogen load and TE methylation", I believe a more detailed analysis is needed. For instance, how often there is an association? In general, there are some rather anecdotical examples, several of which are presented as association with pathogen load on the basis of being "in proximity" to a particular region/pick. The same regions contain multiple other genes and annotations, but the authors limit the discussion to the particular gene or TE concordant with the hypothesis. This is for both the discussion and results sections.

      Here we are referring to associations in a purely statistical sense. The fact that “Overall, we found associations between pathogen load and TE methylation” is simply a conclusion drawn from Fig. 4b, without implying any causality. Some methylation variants are statistically associated with the traits (aphid or mildew loads), and whether they are true positives or causal is of course more difficult to assess.

      Regarding the methylation variants associated with mildew load in proximity of MAPKKK20, those are the only two significant ones, located close to each other and close to many other variants that, although not significant, have low P-values (Author response image 2 below), so it is the most obvious association warranting further exploration. The reviewer is correct that there are other genes flanking the large DMR that covers the TEs (Fig. 4D), but the DMR is downstream of these genes, so less likely to affect their transcription.

      Author response image 2.

      Regarding all other associations found with M. persicae load, we stated that these are not really reliable due to a skewed P-value distribution (L269, S5B Fig), but we think that for future reference it is still worth reporting the closeby genes and TEs.

      We slightly changed the wording of the passage the reviewer is citing above to make it clearer that we are only offering potential explanations for the associations we observe with TE methylation, but by no means we state that TE reactivation is surely what is happening.

      - One conclusion in the manuscript is that DMRs have been mostly the result of hypomethylation. This is shown for instance in supplementary Figure 4. However, no general statistic is shown of methylation distribution (not only restricted to DMRs). Was the ratio methylation over de-methylation proportional along the genome? Thus the finding in DMRs is out of the genome-wide distribution? Or on the contrary, the DMRs are just a random sampling of the global distribution. The same for different annotated regions. For instance, I would expect that in general coding regions would be less methylated (not restricted to DMRs).

      Complete and exhaustive analyses of the methylomes were already published in the original manuscript (Galanti et al 2022). However, the variation among these methylomes is complex and influenced by multiple factors including genetic background and environment of origin, and talking about these things would have been beyond the scope of our paper. In this paper, we just took advantage of the existing methylome information to identify the few genomic regions that are consistently differentially methylated between samples with extreme values of pest loads. As for the GWAS, the phenotypes are only partially associated with population structure, so the 20 samples with the lowest and the 20 with the highest pathogen loads are not e.g. all Swedish vs. all German but they are a mixture, which allowed us to correct for population structure running EWAS with a mixed model that includes a genetic distance matrix.

      In this study we called DMRs between two defined groups: samples with the lowest amounts of pathogen DNA (not-infected; the “control” group) vs. samples with the highest amounts of pathogens (infected or the “treatment” group), so we could define a directionality (“hyper vs. “hypo” methylation). However, this is not the case for population DMRs called between many different combinations of populations. This is why the hyper- and hypomethylated regions found here cannot be compared to the genome-wide averages, which are influenced by other factors than the pathogens. Even with relaxed thresholds we indeed found very few DMRs associated to pathogen presence here.

      Specifically about coding regions, the reviewer is correct that they are less methylated, especially because T. arvense has largely lost gene body methylation (Nunn et al. 2021, Galanti et al. 2022), but this is unrelated and was discussed in the original publication (Galanti et al. 2022).

      Minor comments:- Figure 1B: it would be good to add also percentage values.

      As the figure is already tightly packed, we rather keep it simple. As the chart gives a good impression of frequencies of different kingdoms, and the frequences of several relevant groups. Also, as explained in a previous answer, comparing different taxonomic groups could be imprecise (as opposed to comparing the same group between different samples), so exact percentages seem unnecessary. If needed, the exact percentages can still be calculated from S2 Table.

      - L159: It is not clear to me what "enemy variation" is referring to here.

      We are referring to variation in enemy densities (attack rates) in the field, that could potentially be carried over to the greenhouse to cause the patterns of infection we observed. We changed it to “variation in enemy densities” to make it more clear.

      - L259: "In accordance with previous studies (8,9), most DMRs were hypomethylated in the affected samples, indicating that genes needed for defense might be activated through demethylation". Not clear to me what "affected samples" is referring to. Samples with lower load?

      Affected samples have a higher load of pathogen reads. We changed it to “infested” to make it more clear.

      - L336. Figure should be Fig 3E.

      We fixed it, thanks for noticing.

      ADDITIONAL CHANGES

      We updated reference 43 to point to the published paper rather than the preprint.

      We corrected the phenotype names in S3 Fig, to make them consistent with the rest of the manuscript and increased font size on the axes to make it more readable.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      This manuscript introduced a new behavioral apparatus to regulate the animal's behavioral state naturally. It is a thermal maze where different sectors of the maze can be set to different temperatures; once the rest area of the animal is cooled down, it will start searching for a warmer alternative region to settle down again. They recorded with silicon probes from the hippocampus in the maze and found that the incidence of SWRs was higher at the rest areas and place cells representing a rest area were preferentially active during rest-SWRs as well but not during non-REM sleep.

      We thank the reviewer for carefully reading our manuscript and providing useful and constructive comments.

      Strengths:

      The maze can have many future applications, e.g., see how the duration of waking immobility can influence learning, future memory recall, or sleep reactivation. It represents an out-of-the-box thinking to study and control less-studies aspects of the animals' behavior.

      Weaknesses:

      The impact is only within behavioral research and hippocampal electrophysiology.

      We agree with this assessment but would like to add that the intersection of electrophysiological recordings in behaving animals is a very large field. Behavioral thermoregulation is a hotly researched area also by investigators using molecular tools as well. The ThermoMaze can be used for juxtacellular/intracellular recordings in behaving animals. Restricting the animal’s movement during these recordings can improve the length of recording time and recorded single unit yield in these experiments. 

      Moreover, the fact that animals can sleep within the task can open up new possibilities to compare the role of sleep in learning without having to move the animal from a maze back into its home cage. The cooling procedure can be easily adapted to head-fixed virtual reality experiments as well.

      I have only a few questions and suggestions for future analysis if data is available.

      Comment-1: Could you observe a relationship between the duration of immobility and the preferred SWR activation of place cells coding for the current (SWR) location of the animal? In the cited O'Neill et al. paper, they found that the 'spatial selectivity' of SWR activity gradually diminished within a 2-5min period, and after about 5min, SWR activity was no longer influenced by the current location of the animal. Of course, I can imagine that overall, animals are more alert here, so even over more extended immobility periods, SWRs may recruit place cells coding for the current location of the animal.

      We thank the reviewer for raising this question, which is a fundamental issue that we attempted to address using the ThermoMaze. First, we indeed observed persistent place-specific firing of CA1 neurons for up to around 5 minutes, which was the maximal duration of each warm spot epoch, as shown by the decoding analysis (based on firing rate map templates constructed during SPW-Rs) in Figure 5C and D. However, we did not observe above-chance-level decoding of the current position of the animal during sharp-wave ripples using templates constructed during theta, which aligns with previous observation that CA1 neurons during “iSWRs” (15–30 s time windows surrounding theta oscillations) did not show significant differences in their peak firing rate inside versus outside the place field (O’Neil et al., 2006). We reasoned that this could be potentially explained by a different (although correlated, see Figure 5E) neuronal representation of space during theta and during awake SPW-R.

      Comment-2: Following the logic above, if possible, it would be interesting to compare immobility periods on the thermal maze and the home cage beyond SWRs, as it could give further insights into differences in rest states associated with different alertness levels. E.g., power spectra may show a stronger theta band or reduced delta band compared to the home cage.

      If we are correct the Reviewer would like to know whether the brain state of the animal was similar in the ThermoMaze (warm spot location) and in the home cage during immobility. A comparison of the time-evolved power spectra shows similar changes from walking to immobility in both situations without notable differences. This analysis was performed on a subset of animals (n = 17 sessions in 7 mice) that were equipped with an accelerometer (home cage behavior was not monitored by video). We detected rest epochs that lasted at least 2 seconds during wakefulness in both the home cage and ThermoMaze. Using these time points we calculated the event-triggered power spectra for the delta and theta band (±2 s around the transition time) and found no difference between the home cage and ThermoMaze (Suppl. Fig. 4D).

      Prompted by the Reviewer’s question, we further quantified the changes in LFP in the two environments. We did not find any significant change in the frequencies between 1-40 Hz during Awake periods, but we did find higher delta power (1-4 Hz) in some animals in the ThermoMaze (Suppl. Fig. 4A, B). 

      We have also quantified the delta and theta power spectra in the few cases, when the warm spot was maintained, and the animal fell asleep. The time-resolved spectra classified the brain state as NREM, similar to sleeping in the home cage. Both delta and theta power were higher in the ThermoMaze following Awake-NREM transitions (±30 seconds around the transition, Suppl. Fig. 4C). It might well be that immobility/sleep outside the mouse’s nest might reflect some minor (but important) differences but our experiments with only a single camera recording do not have the needed resolution to reveal minor differences in posture.

      We added these results to the revised Supplementary material (Suppl. Fig. 4).

      Comment-3: Was there any behavioral tracking performed on naïve animals that were placed the first time in the thermal maze? I would expect some degree of learning to take place as the animal realizes that it can find another warm zone and that it is worth settling down in that area for a while. Perhaps such a learning effect could be quantified.

      Unfortunately, we did not record videos during the first few sessions in the ThermoMaze. Typically, we transferred a naïve animal into the ThermoMaze for an hour on the first day to acclimatize them to the environment. This was performed without video analysis. In addition, because the current version of the maze is relatively small (20 x 20 cm), the animal usually walked around the edges of the maze before settling down at a heated warm spot. It appeared to us that there was only a very weak drive to learn the sequence and location of the warm spot, and therefore we did not quantified learning in the current experiment. We agree with the reviewer that in future studies, it will be interesting to explore whether the ThermoMaze could be adapted to a land-version of the Morris water maze by increasing the size of the maze and performing more controlled behavioral training and testing.

      Comment-4: There may be a mislabeling in Figure 6g because the figure does not agree with the result text - the figure compares the population vector similarly of waking SWR vs sleep SWRs to exploration vs waking SWR and exploration vs sleep SWRs.

      We thank the reviewer for raising the point, we have updated the labels accordingly.

      Reviewer #2 (Public Review):

      In this manuscript, Vöröslakos and colleagues describe a new behavioural testing apparatus called ThermoMaze, which should facilitate controlling when a mouse is exploring the environment vs. remaining immobile. The floor of the apparatus is tiled with 25 plates, which can be individually heated, whereas the rest of the environment is cooled. The mouse avoids cooled areas and stays immobile on a heated tile. The authors systematically changed the location of the heated tile to trigger the mouse's exploratory behaviours. The authors showed that if the same plate stays heated longer, the mouse falls into an NREM sleep state. The authors conclude their apparatus allows easy control of triggering behaviours such as running/exploration, immobility and NREM sleep. The authors also carried out single-unit recordings of CA1 hippocampal cells using various silicone probes. They show that the location of a mouse can be decoded with above-chance accuracy from cell activity during sharp wave ripples, which tend to occur when the mouse is immobile or asleep. The authors suggest that consistent with some previous results, SPW-Rs encode the mouse's current location and any other information they may encode (such as past and future locations, usually associated with them).

      We thank the reviewer for carefully reading our manuscript and providing useful and constructive comments.

      Strengths:

      Overall, the apparatus may open fruitful avenues for future research to uncover the physiology of transitions from different behavioural states such as locomotion, immobility, and sleep. The setup is compatible with neural recordings. No training is required.

      Weaknesses:

      I have a few concerns related to the authors' methodology and some limitations of the apparatus's current form. Although the authors suggest that switching between the plates forces animal behaviour into an exploratory mode, leading to a better sampling of the enclosure, their example position heat maps and trajectories suggest that the behaviour is still very stereotypical, restricted mostly to the trajectories along the walls or the diagonal ones (between two opposite corners). This may not be ideal for studying spatial responses known to be affected by the stereotypicity of the animal's trajectories. Moreover, given such stereotypicity of the trajectories mice take before and after reaching a specific plate, it may be that the stable activity of SWR-P ripples used for decoding different quadrants may be representing future and/or past trajectories rather than the current locations suggested by the authors. If this is the case, it may be confusing/misleading to call such activity ' place-selective firing', since they don't necessarily encode a given place per se (line 281).

      We agree with the reviewer that the current version of the ThermoMaze does not necessarily motivate the mice to sample the entire maze during warm spot transitions. However, we did show correlational evidence that neuronal firing during awake sharp-wave ripples is place-selective. Both firing rate ratios and population vectors of CA1 neurons showed a reliable correlation between those during movement and awake sharp-wave ripples (Figure 5 E and F), indicating that spatial coding during movement persists into awake SWR-P state. This finding rejects the hypothesis that neuronal firing during ripples throughout the Cooling sub-session encodes past/future trajectories, which could be explained by a lack of goal-directed behavior in order to perform the task. We hope to test whether such place-specific firing during ripples can be causally involved in maintaining an egocentric representation of space in a future study.

      Besides, we have attempted to motivate the animal to visit the center of the maze during the Cooling sub-session. Moving the location of warm spots from the corners can shape the animals’ behavior and promote more exploration of the environment as we show in Suppl. Fig. 5. We agree with the Reviewer that the current size of the ThermoMaze poses these limitations. However, an example future application could be to warm the floor of a radial-arm maze by heating Peltier elements at the ends of maze arms and center in an otherwise cold room, allowing the experimenter to induce ambulation in the 1-dimensional arms, followed by extended immobility and sleep at designated areas.

      Another main study limitation is the reported instability of the location cells in the Thermomaze. This may be related to the heating procedure, differences in stereotypical sampling of the enclosure, or the enclosure size (too small to properly reveal the place code). It would be helpful if the authors separate pyramidal cells into place and non-place cells to better understand how stable place cell activity is. This information may also help to disambiguate the SPW-R-related limitations outlined above and may help to solve the poor decoding problem reported by the authors (lines 218-221).

      The ThermoMaze is a relatively small enclosure (20 x 20 cm) compared to typical 2D arenas (60 x 60 cm) used in hippocampal spatial studies. Due to the small environment, one possibility is that CA1 neurons encode less spatial information and only a small number of place cells could be found. Therefore, we identified place cells in each sub-session. We found 40.90%, 45.32%, and 41.26% of pyramidal cells to be place cells in the Pre-cooling, Cooling, and Post-cooling sub-sessions, respectively. Furthermore, we found on average 17.36% of pyramidal neurons pass the place cell criteria in all three sub-sessions in a daily session. Therefore, the strong decorrelation of spatial firing maps across sub-sessions cannot be explained by poor recording quality or weak neuronal encoding of spatial information but is potentially due to changes in environmental conditions.

      Some additional points/queries:

      Comment-1: Since the authors managed to induce sleeping on the warm pads during the prolonged stays, can they check their hypothesis that the difference in the mean ripple peak frequency (Fig. 4D) between the home cage and Thermomaze was due to the sleep vs. non-sleep states?

      In response to the reviewer’s comment, we compared the ripple peak frequency that occurred during wakefulness and NREM epochs in the home cage and ThermoMaze (n = 7 sessions in 4 mice). We found that the peak frequency of the awake ripples was higher compared to both home cage and ThermoMaze NREM sleep (one-way ANOVA with Tukey’s posthoc test, ripple frequencies were: 171.63 ± 11.69, 172.21 ± 11.86, 168.19 ± 11.10 and 168.26 ± 11.08 Hz mean±SD for home cage awake, ThermoMaze awake, home cage NREM and ThermoMaze NREM conditions, p < 0.001 between awake and NREM states). We added this quantification to the revised manuscript.

      Author response image 1.

      NREM sleep either in home cage or in ThermoMaze affects ripple mean peak frequency similarly.

      Comment-2: How many cells per mouse were recorded? How many of them were place cells? How many place cells at the same time on average? What are the place field size, peak, and mean firing rate distributions in these various conditions? It would be helpful if they could report this.

      For each animal on a given day, the average number of cells recorded was 57.5, which depended on the electrodes and duration after implantation. We first applied peak firing rate and spatial information thresholds to identify place cells in each sub-session (see more details in the revised Methods section for place cell definition). We found 40.90%, 45.32%, and 41.26% of pyramidal cells to be place cells in the Pre-cooling, Cooling, and Post-cooling sub-sessions respectively. Furthermore, we found on average 17.36% of pyramidal neurons pass the place cell criteria in all three sub-sessions in a daily session.

      For place cells identified in each sub-session, their place fields size is on average 61.03, 79.86, and 57.51 cm2 (standard deviation = 60.13, 69.98, and 49.64 cm2; Pre-cooling, Cooling, and Post-cooling correspondingly). A place field was defined to be a contiguous region of at least 20 cm2 (20 spatial bins) in which the firing rate was above 60% of the peak firing rate of the cell in the maze (Roux and Buzsaki et al., 2017). A place field also needs to contain at least one bin above 80% of the peak firing rate in the maze. With such definition, the average place field peak firing rate is 5.84, 5.22, and 6.48 Hz (standard deviation = 5.11, 4.65, and 5.83 Hz) and the average mean firing rate within the place fields is 4.54, 4.05, and 5.07 Hz (standard deviation = 4.00, 3.60, and 4.60).

      We would like to point out that these values depend strongly on the definition of place fields, which vary widely across studies. We reason that the ThermoMaze paradigm induced place field remapping which has been reported to occur upon changes in the environment such as visual cues (Leutgeb et al., 2009). We hypothesize that temperature gradient is an important aspect among the environmental cues, thus remapping is expected. Overall, we did not aim for biological discoveries in the first presentation of the ThermoMaze. Instead, our limited goal was the detailed description of the method and its validation for behavioral and physiological experiments.

      References

      (1) Mizuseki K, Royer S, Diba K, Buzsáki G. Activity dynamics and behavioral correlates of CA3 and CA1 hippocampal pyramidal neurons. Hippocampus. 2012 Aug;22(8):1659-80. doi: 10.1002/hipo.22002. Epub 2012 Feb 27. PMID: 22367959; PMCID: PMC3718552.

      (2) Skaggs WE,McNaughton BL,Gothard KM,Markus EJ. 1993. An information-theoretic approach to deciphering the hippocampal code. In: SJ Hanson, JD Cowan, CL Giles, editors. Advances in Neural Information Processing Systems, Vol. 5. San Francisco, CA: Morgan Kaufmann. pp 1030–1037.

      (3) Roux L, Hu B, Eichler R, Stark E, Buzsáki G. Sharp wave ripples during learning stabilize the hippocampal spatial map. Nat Neurosci. 2017 Jun;20(6):845-853. doi: 10.1038/nn.4543. Epub 2017 Apr 10. PMID: 28394323; PMCID: PMC5446786.

      (4) Markus, E.J., Barnes, C.A., McNaughton, B.L., Gladden, V.L. & Skaggs, W.E. Spatial information content and reliability of hippocampal CA1 neurons: effects of visual input. Hippocampus 4, 410–421 (1994).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      We thank the reviewers for the detailed assessment of our work as well as their praise and constructive feedback which helped us to significantly improve our manuscript.

      Reviewer #1 (Public Review):

      The inferior colliculus (IC) is the central auditory system's major hub. It integrates ascending brainstem signals to provide acoustic information to the auditory thalamus. The superficial layers of the IC ("shell" IC regions as defined in the current manuscript) also receive a massive descending projection from the auditory cortex. This auditory cortico-collicular pathway has long fascinated the hearing field, as it may provide a route to funnel "high-level" cortical signals and impart behavioral salience upon an otherwise behaviorally agnostic midbrain circuit.

      Accordingly, IC neurons can respond differently to the same sound depending on whether animals engage in a behavioral task (Ryan and Miller 1977; Ryan et al., 1984; Slee & David, 2015; Saderi et al., 2021; De Franceschi & Barkat, 2021). Many studies also report a rich variety of non-auditory responses in the IC, far beyond the simple acoustic responses one expects to find in a "low-level" region (Sakurai, 1990; Metzger et al., 2006; Porter et al., 2007). A tacit assumption is that the behaviorally relevant activity of IC neurons is inherited from the auditory cortico-collicular pathway. However, this assumption has never been tested, owing to two main limitations of past studies:

      (1) Prior studies could not confirm if data were obtained from IC neurons that receive monosynaptic input from the auditory cortex.

      (2) Many studies have tested how auditory cortical inactivation impacts IC neuron activity; the consequence of cortical silencing is sometimes quite modest. However, all prior inactivation studies were conducted in anesthetized or passively listening animals. These conditions may not fully engage the auditory cortico-collicular pathway. Moreover, the extent of cortical inactivation in prior studies was sometimes ambiguous, which complicates interpreting modest or negative results.

      Here, the authors' goal is to directly test if auditory cortex is necessary for behaviorally relevant activity in IC neurons. They conclude that surprisingly, task relevant activity in cortico-recipient IC neuron persists in absence of auditory cortico-collicular transmission. To this end, a major strength of the paper is that the authors combine a sound-detection behavior with clever approaches that unambiguously overcome the limitations of past studies.

      First, the authors inject a transsynaptic virus into the auditory cortex, thereby expressing a genetically encoded calcium indicator in the auditory cortex's postsynaptic targets in the IC. This powerful approach enables 2-photon Ca2+ imaging from IC neurons that unambiguously receive monosynaptic input from auditory cortex. Thus, any effect of cortical silencing should be maximally observable in this neuronal population. Second, they abrogate auditory cortico-collicular transmission using lesions of auditory cortex. This "sledgehammer" approach is arguably the most direct test of whether cortico-recipient IC neurons will continue to encode task-relevant information in absence of descending feedback. Indeed, their method circumvents the known limitations of more modern optogenetic or chemogenetic silencing, e.g. variable efficacy.

      I also see three weaknesses which limit what we can learn from the authors' hard work, at least in the current form. I want to emphasize that these issues do not reflect any fatal flaw of the approach. Rather, I believe that their datasets likely contain the treasure-trove of knowledge required to completely support their claims.

      (1) The conclusion of this paper requires the following assumption to be true: That the difference in neural activity between Hit and Miss trials reflects "information beyond the physical attributes of sound." The data presentation complicates asserting this assumption. Specifically, they average fluorescence transients of all Hit and all Miss trials in their detection task. Yet, Figure 3B shows that mice's d' depends on sound level, and since this is a detection task the smaller d' at low SPLs presumably reflects lower Hit rates (and thus higher Miss rates). As currently written, it is not clear if fluorescence traces for Hits arise from trials where the sound cue was played at a higher sound level than on Miss trials. Thus, the difference in neural activity on Hit and Miss trials could indeed reflect mice's behavior (licking or not licking). But in principle could also be explained by higher sound-evoked spike rates on Hit compared to Miss trials, simply due to louder click sounds. Indeed, the amplitude and decay tau of their indicator GCaMP6f is non-linearly dependent on the number and rate of spikes (Chen et al., 2013), so this isn't an unreasonable concern.

      (2) The authors' central claim effectively rests upon two analyses in Figures 5 and 6. The spectral clustering algorithm of Figure 5 identifies 10 separate activity patterns in IC neurons of control and lesioned mice; most of these clusters show distinct activity on averaged Hit and Miss trials. They conclude that although the proportions of neurons from control and lesioned mice in certain clusters deviates from an expected 50/50 split, neurons from lesioned mice are still represented in all clusters. A significant issue here is that in addition to averaging all Hits and Miss trials together, the data from control and lesioned mice are lumped for the clustering. There is no direct comparison of neural activity between the two groups, so the reader must rely on interpreting a row of pie charts to assess the conclusion. It's unclear how similar task relevant activity is between control and lesioned mice; we don't even have a ballpark estimate of how auditory cortex does or does not contribute to task relevant activity. Although ideally the authors would have approached this by repeatedly imaging the same IC neurons before and after lesioning auditory cortex, this within-subjects design may be unfeasible if lesions interfere with task retention. Nevertheless, they have recordings from hundreds to thousands of neurons across two groups, so even a small effect should be observable in a between-groups comparison.

      (3) In Figure 6, the authors show that logistic regression models predict whether the trial is a Hit or Miss from their fluorescence data. Classification accuracy peaks rapidly following sound presentation, implying substantial information regarding mice's actions. The authors further show that classification accuracy is reduced, but still above chance in mice with auditory cortical lesions. The authors conclude from this analysis task relevant activity persists in absence of auditory cortex. In principle I do not disagree with their conclusion.

      The weakness here is in the details. First, the reduction in classification accuracy of lesioned mice suggests that auditory cortex does nevertheless transmit some task relevant information, however minor it may be. I feel that as written, their narrative does not adequately highlight this finding. Rather one could argue that their results suggest redundant sources of task-relevant activity converging in the IC. Secondly, the authors conclude that decoding accuracy is impaired more in partially compared to fully lesioned mice. They admit that this conclusion is at face value counterintuitive, and provide compelling mechanistic arguments in the Discussion. However, aside from shaded 95% CIs, we have no estimate of variance in decoding accuracy across sessions or subjects for either control or lesioned mice. Thus we don't know if the small sample sizes of partial (n = 3) and full lesion (n = 4) groups adequately sample from the underlying population. Their result of Figure 6B may reflect spurious sampling from tail ends of the distributions, rather than a true non-monotonic effect of lesion size on task relevant activity in IC.

      Our responses to the ‘recommendations for the authors’ below lay out in detail how we addressed each comment and concern. Besides filling in key information about how our original analysis aimed at minimizing any potential impact of differences in sound level distributions - namely that trials used for decoding were limited to a subset of sound levels - and which was accidentally omitted in the original manuscript, we have now carried out several additional analyses.

      We would like to highlight one of these because it supplements both the clustering and decoding analysis that we conducted to compare hit and miss trial activity, and directly addresses what the reviewer identified as our work’s main weakness (a possible confound between animal behavior and sound level distributions) and the request for an analysis that operates at the level of single units rather than the population level. Specifically, we assessed, separately for each recorded neuron, whether there was a statistically significant difference in the magnitude of neural activity between hit and miss trials. This approach allowed us to fully balance the numbers of hit and miss trials at each sound level that were entered into the analysis. The results revealed that a large proportion (close to 50%) of units were task modulated, i.e. had significantly different response magnitudes between hit and miss trials, and that this proportion was not significantly different between lesioned and non-lesioned mice. We hope that this, together with the rest of our responses, convincingly demonstrates that the shell of the IC encodes mouse sound detection behavior even when top-down input from the auditory cortex is absent.

      Reviewer #2 (Public Review):

      Summary:

      This study takes a new approach to studying the role of corticofugal projections from auditory cortex to inferior colliculus. The authors performed two-photon imaging of cortico-recipient IC neurons during a click detection task in mice with and without lesions of auditory cortex. In both groups of animals, they observed similar task performance and relatively small differences in the encoding of task-response variables in the IC population. They conclude that non-cortical inputs to the IC provide can substantial task-related modulation, at least when AC is absent. Strengths:

      This study provides valuable new insight into big and challenging questions around top-down modulation of activity in the IC. The approach here is novel and appears to have been executed thoughtfully. Thus, it should be of interest to the community.

      Weaknesses: There are, however, substantial concerns about the interpretation of the findings and limitations to the current analysis. In particular, Analysis of single unit activity is absent, making interpretation of population clusters and decoding less interpretable. These concerns should be addressed to make sure that the results can be interpreted clearly in an active field that already contains a number of confusing and possibly contradictory findings.

      Our responses to the ‘recommendations for the authors’ below lay out in detail how we addressed each comment and concern. Several additional analyses have now been carried out including ones that operate at the level of single units rather than the population level, as requested by the reviewer. We would like to briefly highlight one here because it supplements both the clustering and decoding analysis that we conducted to compare hit and miss trial activity and directly addresses what the other reviewers identified as our work’s main weakness (a possible confound between animal behavior and sound level distributions). Specifically, we assessed, separately for each recorded neuron, whether there was a statistically significant difference in the magnitude of neural activity between hit and miss trials. This approach allowed us to fully balance the numbers of hit and miss trials at each sound level that were entered into the analysis. The results revealed that a large proportion (close to 50%) of units were task modulated, i.e. had significantly different response magnitudes between hit and miss trials, and that this proportion was not significantly different between lesioned and non-lesioned mice. We hope that this, together with the rest of our responses, convincingly demonstrates that the shell of the IC encodes mouse sound detection behavior even when top-down input from the auditory cortex is absent.

      Reviewer #3 (Public Review):

      Summary:

      This study aims to demonstrate that cortical feedback is not necessary to signal behavioral outcome to shell neurons of the inferior colliculus during a sound detection task. The demonstration is achieved by the observation of the activity of cortico-recipient neurons in animals which have received lesions of the auditory cortex. The experiment shows that neither behavior performance nor neuronal responses are significantly impacted by cortical lesions except for the case of partial lesions which seem to have a disruptive effect on behavioral outcome signaling. Strengths:

      The experimental procedure is based on state of the art methods. There is an in depth discussion of the different effects of auditory cortical lesions on sound detection behavior. Weaknesses:

      The analysis is not documented enough to be correctly evaluated. Have the authors pooled together trials with different sound levels for the key hit vs miss decoding/clustering analysis? If so, the conclusions are not well supported, as there are more misses for low sound levels, which would completely bias the outcome of the analysis. It would possible that the classification of hit versus misses actually only reflects a decoding of sound level based on sensory responses in the colliculus, and it would not be surprising then that in the presence or absence of cortical feedback, some neurons responds more to higher sound levels (hits) and less to lower sound levels (misses). It is important that the authors clarify and in any case perform an analysis in which the classification of hits vs misses is done only for the same sound levels. The description of feedback signals could be more detailed although it is difficult to achieve good temporal resolution with the calcium imaging technique necessary for targeting cortico-recipient neurons.

      Our responses to the ‘recommendations for the authors’ below lay out in detail how we addressed each comment and concern. Besides filling in key information about how our original analysis aimed at minimizing any potential impact of differences in sound level distributions - namely that trials used for decoding were limited to a subset of sound levels - and which was accidentally omitted in the original manuscript, we have now carried out several additional analyses to directly address what the reviewer identified as our work’s main weakness (a possible confound between animal behavior and sound level distributions). This includes an analysis in which we were able to demonstrate for one imaging session with a sufficiently large number of trials that limiting the trials entered into the decoding analysis to those from a single sound level did not meaningfully impact decoding accuracy. We would like to highlight another new analysis here because it supplements both the clustering and decoding analyses that we conducted to compare hit and miss trial activity and addresses the other reviewers’ request for an analysis that operates at the level of single units rather than the population level. Specifically, we assessed, separately for each recorded neuron, whether there was a statistically significant difference in the magnitude of neural activity between hit and miss trials. This approach allowed us to fully balance the numbers of hit and miss trials at each sound level that were entered into the analysis. The results revealed that a large proportion (close to 50%) of units were task modulated, i.e. had significantly different response magnitudes between hit and miss trials, and that this proportion was not significantly different between lesioned and non-lesioned mice. We hope that this, together with the rest of our responses, convincingly demonstrates that the shell of the IC encodes mouse sound detection behavior even when top-down input from the auditory cortex is absent.

      Reviewer #1 (Recommendations For The Authors):

      Thank you for the opportunity to read your paper. I think the conclusion is exciting. Indeed, you indicate that perhaps contrary to many of our (untested) assumptions, task-relevant activity in the IC may persist in absence of auditory cortex.

      As mentioned in my public review: Despite my interest in the work, I also think that there are several opportunities to significantly strengthen your conclusions. I feel this point is important because your work will likely guide the efforts of future students and post-docs working on this topic. The data can serve as a beacon to move the field away from the (somewhat naïve) idea that the evolved forebrain imparts behavioral relevance upon an otherwise uncivilized midbrain. This knowledge will inspire a search for alternative explanations. Indeed, although you don't highlight it in your narrative, your results dovetail nicely with several studies showing task-relevant activity in more ventral midbrain areas that project to the IC (e.g., pedunculopontine nuclei; see work from Hikosaka in monkeys, and more recently in mice from Karel Svoboda's lab).

      Thanks for the kind words.

      These studies, in particular the work by Inagaki et al. (2022) outlining how the transformation of an auditory go signal into movement could be mediated via a circuit involving the PPN/MRN (which might rely on the NLL for auditory input) and the motor thalamus, are indeed highly relevant.

      We made the following changes to the manuscript text.

      Line 472:”...or that the auditory midbrain, thalamus and cortex are bypassed entirely if simple acousticomotor transformations, such as licking a spout in response to a sound, are handled by circuits linking the auditory brainstem and motor thalamus via pedunculopontine and midbrain reticular nuclei (Inagaki et al., 2022).”

      The beauty of the eLife experiment is that you are free to incorporate or ignore these suggestions. After all, it's your paper, not mine. Nevertheless, I hope you find my comments useful.<br /> First, a few suggestions to address my three comments in the public review.

      Suggestion for public comment #1: An easy way to address this issue is to average the neural activity separately for each trial outcome at each sound level. That way you can measure if fluorescence amplitude (or integral) varies as a function of mice's action rather than sound level. This approach to data organization would also open the door to the additional analyses for addressing comment #2, such as directly comparing auditory and putatively non-auditory activity in neurons recorded from control and lesioned mice.

      We have carried out additional analyses for distinguishing between the two alternative explanations of the data put forward by the reviewer: That the difference in neural activity between hit and miss trials reflects a) behavior or b) sound level (more precisely: differences in response magnitude arising from a higher proportion of high-sound-level trials in the hit trial group than in the miss trial group). If the data favored b), we would expect no difference in activity between hit and miss trials when plotted separately for each sound level. The new Figure 4 - figure supplement 1 indicates that this is not the case. Hit and miss trial activity are clearly distinct even when plotted separately for different sound levels, confirming that this difference in activity reflects the animals’ behavior rather than sensory information.

      Changes to manuscript.

      Line 214: “While averaging across all neurons cannot capture the diversity of responses, the averaged response profiles suggest that it is mostly trial outcome rather than the acoustic stimulus and neuronal sensitivity to sound level that shapes those responses (Figure 4 – figure supplement 1).”

      Additionally, we assessed for each neuron separately whether there was a significant difference between hit and miss trial activity and therefore whether the activity of the neuron could be considered “task-modulated”. To achieve this, we used equal numbers of hit and miss trials at each sound level to ensure balanced sound level distributions and thus rule out any potential confound between sound level distributions and trial outcome. This analysis revealed that the proportion of task-modulated neurons was very high (close to 50%) and not significantly different between lesioned and non-lesioned mice (Figure 6 - figure supplement 3).

      Changes to the manuscript.

      Line 217: “Indeed, close to half (1272 / 2649) of all neurons showed a statistically significant difference in response magnitude between hit and miss trials…”

      Line 307: “Although the proportion of individual neurons with distinct response magnitudes in hit and miss trials in lesioned mice did not differ from that in non-lesioned mice, it was significantly lower when separating out mice with partial lesions (Figure 6 – figure supplement 3).”

      Differences in the distributions of sound levels in the different trial types could also potentially confound the decoding into hit and miss trials. Our original analysis was actually designed to take this into account but, unfortunately, we failed to include sufficient details in the methods section.

      Changes to the manuscript.

      Line 710: “Rather than including all the trials in a given session, only trials of intermediate difficulty were used for the decoding analysis. More specifically, we only included trials across five sound levels, comprising the lowest sound level that exceeded a d’ of 1.5 plus the two sound levels below and above that level. That ensured that differences in sound level distributions would be small, while still giving us a sufficient number of trials to perform the decoding analysis.“

      In this context, it is worth bearing in mind that a) the decoding analysis was done on a frame-byframe basis, meaning that the decoding score achieved early in the trial has no impact on the decoding score at later time points in the trial, b) sound-driven activity predominantly occurs immediately after stimulus onset and is largely over about 1 s into the trial (see cluster 3, for instance, or average miss trial activity in Figure 4 – figure supplement 1), c) decoding performance of the behavioral outcome starts to plateau 500-1000 ms into the trial and remains high until it very gradually begins to decline after about 2 s into the trial. In other words, decoding performance remains high far longer than the stimulus would be expected to have an impact on the neurons’ activity. Therefore, we would expect any residual bias due to differences in the sound level distribution that our approach did not control for to be restricted to the very beginning of the trial and not to meaningfully impact the conclusions derived from the decoding analysis.

      Finally, we carried out an additional decoding analysis for one imaging session in which we had a sufficient number of trials to perform the analysis not only over the five (59, 62, 65, 68, 71 dB SPL) original sound levels, but also over a reduced range of three (62, 65, 68 dB SPL) sound levels, as well as a single (65 dB SPL) sound level (Figure 6 - figure supplement 1). The mean sound level differences between the hit trial distributions and miss trial distributions for these three conditions were 3.08, 1.01 and 0 dB, respectively. This analysis suggests that decoding performance is not meaningfully impacted by changing the range of sound levels (and sound level distributions), other than that including fewer sound levels means fewer trials and thus noisier decoding.

      Changes to manuscript.

      Line 287: ”...and was not meaningfully affected by differences in sound level distributions between hit and miss trials (Figure 6 – figure supplement 1).”

      Suggestion for public comment #2: Perhaps a solution would be to display example neuron activity in each cluster, recorded in control and lesioned mice. The reader could then visually compare example data from the two groups, and immediately grasp the conclusion that task relevant activity remains in absence of auditory cortex. Additionally, one possibility might be to calculate the difference in neural activity between Hit and Miss trials for each task-modulated neuron. Then, you could compare these values for neurons recorded in control and lesion mice. I feel like this information would greatly add to our understanding of cortico-collicular processing.

      I would also argue that it's perhaps more informative to show one (or a few) example recordings rather than averaging across all cells in a cluster. Example cells would give the reader a better handle on the quality of the imaging, and this approach is more standard in the field. Finally, it would be useful to show the y axis calibration for each example trace (e.g. Figure 5 supp 1). That is also pretty standard so we can immediately grasp the magnitude of the recorded signal.

      We agree that while the information we provided shows that neurons from lesioned and nonlesioned groups are roughly equally represented across the clusters, it does not allow the reader to appreciate how similar the activity profiles of neurons are from each of the two groups. However, picking examples can be highly subjective and thus potentially open to bias. We therefore opted instead to display, separately for lesioned and non-lesioned mice, the peristimulus time histograms of all neurons in each cluster, as well as the cluster averages of the response profiles (Figure 5 - figure supplement 3). This, we believe, convincingly illustrates the close correspondence between neural activity in lesioned and non-lesioned mice across different clusters. All our existing and new figures indicate the response magnitude either on the figures’ y-axis or via scale/color bars.

      Changes to manuscript.

      Line 254: “Furthermore, there was a close correspondence between the cluster averages of lesioned and non-lesioned mice (Figure 5 – figure supplement 3).”

      Furthermore, we’ve now included a video of the imaging data which, we believe, gives the reader a much better handle on the data quality than further example response profiles would.

      Changes to manuscript.

      Line 197: ”...using two-photon microscopy (Figure 4B, Video 1).”

      Suggestion for public comment #3: In absence of laborious and costly follow-up experiments to boost the sample size of partial and complete lesion groups, it may be more prudent to simply tone down the claims that lesion size differentially impacts decoding accuracy. The results of this analysis are not necessary for your main claims.

      Our new results on the proportions of ‘task-modulated’ neurons (Figure 6 - figure supplement 3) across different experimental groups show that there is no difference between non-lesioned and lesioned mice as a whole, but mice with partial lesions have a smaller proportion of taskmodulated neurons than the other two groups. While this corroborates the results of the decoding analysis, we certainly agree that the small sample size is a caveat that needs to be acknowledged.

      Changes to manuscript.

      Line 477: ”Some differences were observed for mice with only partial lesions of the auditory cortex.

      Those mice had a lower proportion of neurons with distinct response magnitudes in hit and miss trials than mice with (near-)complete lesions. Furthermore, trial outcomes could be read out with lower accuracy from these mice. While this finding is somewhat counterintuitive and is based on only three mice with partial lesions, it has been observed before that smaller lesions…”

      A few more suggestions unrelated to public review:

      Figure 1: This is somewhat of an oddball in this manuscript, and its inclusion is not necessary for the main point. Indeed, the major conclusion of Fig 1 is that acute silencing of auditory cortex impairs task performance, and thus optogenetic methods are not suitable to test your hypothesis. However, this conclusion is also easily supported from decades of prior work, and thus citations might suffice.

      We do not agree that these data can easily be substituted with citations of prior published work. While previous studies (Talwar et al., 2001, Li et al., 2017) have demonstrated the impact of acute pharmacological silencing on sound detection in rodents, pharmacological and optogenetic silencing are not equivalent. Furthermore, we are aware of only one published study (Kato et al., 2015) that investigated the impact of optogenetically perturbing auditory cortex on sound detection (others have investigated its impact on discrimination tasks). Kato et al. (2015) examined the effect of acute optogenetic silencing of auditory cortex on the ability of mice to detect the offsets of very long (5-9 seconds) sounds, which is not easily comparable to the click detection task employed by us. Furthermore, when presenting our work at a recent meeting and leaving out the optogenetics results due to time constraints, audience members immediately enquired whether we had tried an optogenetic manipulation instead of lesions. Therefore, we believe that these data represent a valuable piece of information that will be appreciated by many readers and have decided not to remove them from the manuscript.

      A worst case scenario is that Figure 1 will detract from the reader's assessment of experimental rigor. The data of 1C are pooled from multiple sessions in three mice. It is not clear if the signed-rank test compares performance across n = 3 mice or n = 13 sessions. If the latter, a stats nitpicker could argue that the significance might not hold up with a nested analysis considering that some datapoints are not independent of one another. Finally, the experiment does not include a control group, gad2-cre mice injected with a EYFP virus. So as presented, the data are equally compatible with the pessimistic conclusion that shining light into the brain impairs mice's licking. My suggestion is to simply remove Figure 1 from the paper. Starting off with Figure 3 would be stronger, as the rest of the study hinges upon the knowledge that control and lesion mice's behavior is similar.

      Instead of reporting the results session-wise and doing stats on the d’ values, we now report results per mouse and perform stats on the proportions of hits and false alarms separately for each mouse. The results are statistically significant for each mouse and suggest that the differences in d’ are primarily caused by higher false alarm rates during the optogenetic perturbation than in the control condition.

      Changes to manuscript.

      New Figure 1.

      We agree that including control mice not expressing ChR2 would be important for fully characterizing the optogenetic manipulation and that the lack of this control group should be acknowledged. However, in the context of this study, the outcome of performing this additional experiment would be inconsequential. We originally considered using an optogenetic approach to explore the contribution of cortical activity to IC responses, but found that this altered the animals’ sound detection behavior. Whether that change in behavior is due to activation of the opsin or simply due to light being shone on the brain has no bearing on the conclusion that this type of manipulation is unsuitable for determining whether auditory cortex is required for the choice-related activity that we recorded in the IC.

      Changes to manuscript.

      Line 106: ”Although a control group in which the auditory cortex was injected with an EYFP virus lacking ChR2 would be required to confirm that the altered behavior results from an opsindependent perturbation of cortical activity, this result shows that this manipulation is also unsuitable… ”

      Figure 2, comment #1: The micrograph of panel B shows the densest fluorescence in the central IC. You interpret this as evidence of retrograde labeling of central IC neurons that project to the shell IC. This is a nice finding, but perhaps a more relevant micrograph would be to show the actual injection site in the shell layers. The rest of Figure 2 documents the non-auditory cortical sources of forebrain feedback. Since non-auditory cortical neurons may or may not target distinct shell IC sub-circuits, it's important to know where the retrograde virus was injected. Stylistic comment: The flow of the panels is somewhat unorthodox. Panel A and B follow horizontally, then C and D follow vertically, followed by E-H in a separate column. Consider sequencing either horizontally or vertically to maximize the reader's experience.

      Figure 2, comment # 2: It would also be useful to show more rostral sections from these mice, perhaps as a figure supplement, if you have the data. I think there is a lot of value here given a recent paper (Olthof et al., 2019 Jneuro) arguing that the IC receives corticofugal input from areas more rostral to the auditory cortex. So it would be beneficial for the field to know if these other cortical sources do or do not represent likely candidates for behavioral modulation in absence of auditory cortex.

      Figure 2, comment #3: You have a striking cluster of retrogradely labeled PPC neurons, and I'm not sure PPC has been consistently reported as targeting the IC. It would be good to confirm that this is a "true" IC projection as opposed to viral leakage into the SC. Indeed, Figure 2, supplement 2 also shows some visual cortex neurons that are retrogradely labeled. This has bearing on the interpretations, because choice-related activity is rampant in PPC, and thus could be a potential source of the task relevant activity that persists in your recordings. This could be addressed as the point above, by showing the SC sections from these same mice.

      All IC injections were made under visual guidance with the surface of the IC and adjacent brain areas fully exposed after removal of the imaging window. Targeting the IC and steering clear of surrounding structures, including the SC, was therefore relatively straightforward.

      We typically observed strong retrograde labeling in the central nucleus after viral injections into the dorsal IC and, given the moderate injection volume (~50 nL at each of up to three sites), it was also typical to see spatially fairly confined labeling at the injection sites. For the mouse shown in Figure 2, we do not have further images of the IC. This was one of the earliest mice to be included in the study and we did not have access to an automatic slide scanner at the time. We had to acquire confocal images in a ‘manual’ and very time-consuming manner and therefore did not take further IC images for this mouse. We have now included, however, a set of images spanning the whole IC and the adjacent SC sections for the mouse for which we already show sections in Figure 2 - figure supplement 2. These were added as Figure 2 - figure supplement 3A to the manuscript. These images show that the injections were located in the caudal half of the IC and that there was no spillover into the SC - close inspection of those sections did not reveal any labeled cell bodies in the SC. Furthermore, we include as Figure 2 - figure supplement 3B a dozen additional rostral cortical sections of the same mouse illustrating corticocollicular neurons in regions spanning visual, parietal, somatosensory and motor cortex. Given the inclusion of the IC micrographs in the new supplementary figure, we removed panel B from Figure 2. This should also make it easier for the reader to follow the sequencing of the remaining panels.

      Changes to manuscript.

      New Figure 2 - figure supplement 3.

      Line 159: “After the experiments, we injected a retrogradely-transported viral tracer (rAAV2-retrotdTomato) into the right IC to determine whether any corticocollicular neurons remained after the auditory cortex lesions (Figure 2, Figure 2 – figure supplement 2, Figure 2 – figure supplement 3). The presence of retrogradely-labeled corticocollicular neurons in non-temporal cortical areas (Figure 2) was not the result of viral leakage from the dorsal IC injection sites into the superior colliculus (Figure 2 – figure supplement 3).”

      Line 495: “...projections to the IC, such as those originating from somatosensory cortical areas (Lohse et al., 2021; Lesicko et al., 2016) and parietal cortex may have contributed to the response profiles that we observed.

      Figure 5 (see also public review point #2): I am not convinced that this unsupervised method yields particularly meaningful clusters; a grain of salt should be provided to the reader. For example, Clusters 2, 5, 6, and 7 contain neurons that pretty clearly respond with either short latency excitation or inhibition following the click sound on Hits. I would argue that neurons with such diametrically opposite responses should not be "classified" together. You can see the same issue in some of Namboodiri/Stuber's clustering (their Figure 1). It might be useful to make it clear to the reader that these clusters can reflect idiosyncrasies of the algorithm, the behavior task structure, or both.

      We agree.

      Changes to manuscript.

      Line 666: “While clustering is a useful approach for organizing and visualizing the activity of large and heterogeneous populations of neurons, we need to be mindful that, given continuous distributions of response properties, the locations of cluster boundaries can be somewhat arbitrary and/or reflect idiosyncrasies of the chosen method and thus vary from one algorithm to another. We employed an approach very similar to that described in Namboodiri et al. (2019) because it is thought to produce stable results in high-dimensional neural data (Hirokawa et al. 2019).”

      Methods:

      How was a "false alarm" defined? Is it any lick happening during the entire catch trial, or only during the time period corresponding to the response window on stimulus trials?

      The response window was identical for catch and stimulus trials and a false alarm was defined as licking during the response window of a catch trial.

      Changes to manuscript.

      Line 598: “During catch trials, neither licking (‘false alarm’) during the 1.5-second response window …”

      L597 and so forth: What's the denominator in the conversion from the raw fluorescence traces into DF/F? Did you take the median or mode fluorescence across a chunk of time? Baseline subtract average fluorescence prior to click onset? Similarly, please provide some more clarification as to how neuropil subtraction was achieved. This information will help us understand how the classifier can decode trial outcome from data prior to sound onset.

      Signal processing did not involve the subtraction of a pre-stimulus period.

      Changes to manuscript.

      Line 629: ”Neuropil extraction was performed using default suite2p parameters (https://suite2p.readthedocs.io/en/latest/settings.html), neuropil correction was done using a coefficient of 0.7, and calcium ΔF/F signals were obtained by using the median over the entire fluorescence trace as F0. To remove slow fluctuations in the signal, a baseline of each neuron’s entire trace was calculated by Gaussian filtering in addition to minimum and maximum filtering using default suite2p parameters. This baseline was then subtracted from the signal.”

      Was the experimenter blinded to the treatment group during the behavior experiments? If not, were there issues that precluded blinding (limited staffing owing to lab capacity restrictions during the pandemic)? This is important to clarify for the sake of rigor and reproducibility.

      Changes to manuscript.

      Line 574: “The experimenters were not blinded to the treatment group, i.e. lesioned or non-lesioned, but they were blind to the lesion size both during the behavior experiments and most of the data processing.”

      Minor:

      L127-128: "In order to test...lesioned the auditory cortex bilaterally in 7 out of 16 animals". I would clarify this by changing the word animals to "mice" and 7 out of 16 by stating n = 9 and n = 7 are control and lesion groups, respectively.

      Agreed.

      Changes to manuscript.

      Line 129: “...compared the performance of mice with bilateral lesions of the auditory cortex (n = 7) with non-lesioned controls (n = 9)”

      L225-226: You rule out self-generated sounds as a likely source of behavioral modulation by citing Nate Sawtell's paper in the DCN. However, Stephen David's lab suggested that in marmosets, post sound activity in central IC may in fact reflect self-generated sounds during licking. I suggest addressing this with a nod to SVD's work (Singla et al., 2017; but see Shaheen et al., 2021).

      Agreed.

      Changes to manuscript.

      Line 243: “(Singla et al., 2017; but see Shaheen et al., 2021)”

      Line 238 - 239: You state that proportions only deviate greater than 10% for one of the four statistically significant clusters. Something must be unclear here because I don't understand: The delta between the groups in the significant clusters of Fig 5C is (from left to right) 20%, 20%, 38%, and 12%. Please clarify.

      Our wording was meant to convey that a deviation “from a 50/50 split” of 10% means that each side deviates from 50 by 10% resulting in a 40/60 (or 60/40) split. We agree that that has the potential to confuse readers and is not as clear as it could be and have therefore dropped the ambiguous wording.

      Changes to manuscript.

      Line 253: ”,..the difference between the groups was greater than 20% for only one of them.”

      L445: I looked at the cited Allen experiment; I'd be cautious with the interpretation here. A monosynaptic IC->striatum projection is news to me. I think Allen Institute used an AAV1-EGFP virus for these experiments, no? As you know, AAV1 is quite transsynaptic. The labeled fibers in striatum of that experiment may reflect disynaptic labeling of MGB neurons (which do project to striatum).

      Agreed. We deleted the reference to this Allen experiment.

      L650: Please define "network activity". Is this the fluorescence value for each ROI on each frame of each trial? Averaged fluorescence of each ROI per frame? Total frame fluorescence including neuropil? Depending on who you ask, each of these measures provides some meaningful readout of network activity, so clarification would be useful.

      Changes to manuscript.

      Line 707: “Logistic regression models were trained on the network activity of each session, i.e., the ΔF/F values of all ROIs in each session, to classify hit vs miss trials. This was done on a frame-by-frame basis, meaning that each time point (frame) of each session was trained separately.

      Figure 3 narrative or legend: Listing the F values for the anova would be useful. There is pretty clearly a main effect of training session for hits, but what about for the false alarms? That information is important to solidify the result, and would help more specialized readers interpret the d-prime plot in this figure.

      Agreed. There were significant main effects of training day for both hit rates and false alarm rates (as well as d’).

      Changes to manuscript.

      Line 165: “The ability of the mice to learn and perform the click detection task was evident in increasing hit rates and decreasing false alarm rates across training days (Figure 3A, p < 0.01, mixed-design ANOVAs).”

      In summary, thank you for undertaking this work. Your conclusions are provocative, and thus will likely influence the field's direction for years to come.

      Thank you for those kind words and valuable and constructive feedback, which has certainly improved the manuscript.

      Reviewer #2 (Recommendations For The Authors):

      MAJOR CONCERNS

      (1) (Fig. 5) What fraction of individual neurons actually encode task-related information in each animal group? How many neurons respond to sound? The clustering and decoding analyses are interesting, but they obscure these simple questions, which get more directly at the main questions of the study. Suggested approach: For a direct comparison of AC-lesioned and -non-lesioned animals, why not simply compare the mean difference between PSTH response for each neuron individually? To test for trial outcome effects, compare Hit and Miss trials (same stimulus, different behavior) and for sound response effects, compare Hit and False alarm trials (same behavior, different response). How do you align for time in the latter case when there's no stimulus? Align to the first lick event. The authors should include this analysis or explain why their approach of jumping right to analysis of clusters is justified.

      We have now calculated the fraction of neurons that encode trial outcome by comparing hit and miss trial activity. That fraction does not differ between non-lesioned animals and lesioned animals as a whole, but is significantly smaller in mice with partial lesions. The author’s suggestion of comparing hit and false alarm trial activity to assess sound responsiveness is problematic because hit trials involve reward delivery and consumption. Consequently, they are behaviorally very different from false alarm trials (not least because hit trials tend to contain much more licking). Therefore, we calculated the fraction of neurons that respond to the acoustic stimulus by comparing activity before and after stimulus onset in miss trials. We found no significant difference between the non-lesioned and lesioned mice or between subgroups.

      We have addressed these points with the following changes to the manuscript:

      Line 217: “Indeed, close to half (1272 / 2649) of all neurons showed a statistically significant difference in response magnitude between hit and miss trials, while only a small fraction (97 / 2649) exhibited a significant response to the sound.”

      Line 307: “Although the proportion of individual neurons with distinct response magnitudes in hit and miss trials in lesioned mice did not differ from that in non-lesioned mice, it was significantly lower when separating out mice with partial lesions (Figure 6 – figure supplement 3).”

      Line 648: “Analysis of task-modulated and sound-driven neurons. To identify individual neurons that produced significantly different response magnitudes in hit and miss trials, we calculated the mean activity for each stimulus trial by taking the mean activity over the 5 seconds following stimulus presentation and subtracting the mean activity over the 2 seconds preceding the stimulus during that same trial. A Mann-Whitney U test was then performed to assess whether a neuron showed a statistically significant difference (Benjamini-Hochberg adjusted p-value of 0.05) in response magnitude between hit and miss trials. The analysis was performed using equal numbers of hit and miss trials at each sound level to ensure balanced sound level distributions. If, for a given sound level, there were more hit than miss trials, we randomly selected a sample of hit trials (without substitution) to match the sample size for the miss trials and vice versa. Sounddriven neurons were identified by comparing the mean miss trial activity before and after stimulus presentation. Specifically, we performed a Mann-Whitney U test to assess whether there was a statistically significant difference (Benjamini-Hochberg adjusted p-value of 0.05) between the mean activity over the 2 seconds preceding the stimulus and the mean activity over the 1 second period following stimulus presentation.”

      Some more specific concerns about focusing only on cluster-level and population decoding analysis are included below.

      (2) (L 234) "larger field of view". Do task-related or lesion-dependent effects depend on the subregion of IC imaged? Some anatomists would argue that the IC shell is not a uniform structure, and concomitantly, task-related effects may differ between fields. Did coverage of IC subregions differ between experimental groups? Is there any difference in task related effects between subregions of IC? Or maybe all this work was carried out only in the dorsal area? The differences between lesioned and non-lesioned animals are relatively small, so this may not have a huge impact, but a more nuanced discussion that accounts for observed or potential (if not tested) differences between regions of the IC.

      The specific subregion coverage could also impact the decoding analysis (Fig 6), and if possible it might be worth considering an interaction between field of view and lesion size on decoding.

      Each day we chose a new imaging location to avoid recording the same neurons more than once and aimed to sample widely across the optically accessible surface of the IC. We typically stopped the experiment only when there were no more new areas to record from. In terms of the depth of the imaged neurons, we were limited by the fact that corticorecipient neurons become sparser with depth and that the signal available from the GCaMP6f labeling of the Ai95 mice becomes rapidly weaker with increasing distance from the surface. This meant that we recorded no deeper than 150 µm from the surface of the IC. Consequently, while there may have been some variability in the average rostrocaudal and mediolateral positioning of imaging locations from animal to animal due to differences between mice in how much of the IC surface was visible, cranial window positioning, and in neuronal labeling etc, our dataset is anatomically uniform in that all recorded neurons receive input from the auditory cortex and are located within 150 µm of the surface of the IC. Therefore, we think it highly unlikely that small sampling differences across animals could have a meaningful impact on the results.

      Given that there is no consensus as to where the border between the dorsal and external/lateral cortices of the IC is located and that it is typically difficult to find reliable anatomical reference points (the location of the borders between the IC and surrounding structures is not always obvious during imaging, i.e. a transition from a labeled area to a dark area near the edge of the cranial window could indicate a border with another structure, but also the IC surface sloping away from the window or simply an unlabeled area within the IC), we made no attempt to assign our recordings from corticorecipient neurons to specific subdivisions of the IC.

      Changes to manuscript.

      Line 195: “We then proceeded to record the activity of corticorecipient neurons within about 150 µm of the dorsal surface of the IC using two-photon microscopy (Figure 4B, Video 1).”

      Line 375: “We imaged across the optically accessible dorsal surface of the IC down to a depth of about 150 µm below the surface. Consequently, the neurons we recorded were located predominantly in the dorsal cortex. However, identifying the borders between different subdivisions of the IC is not straightforward and we cannot rule out the possibility that some were located in the lateral cortex.”

      (3) (L 482-483) "auditory cortex is not required for the task-related activity recording in IC neurons of mice performing a sound detection task". Most places in the text are clearer, but this statement is confusing. Yes, animals with lesions can have a "normal"-looking IC, but does that mean that AC does not strongly modulate IC during this behavior in normal animals? The authors have shown convincingly that subcortical areas can both shape behavior and modulate IC normally, but AC may still be required for IC modulation in non-lesioned animals. Given the complexity of this system, the authors should make sure they summarize their results consistently and clearly throughout the manuscript.

      The reviewer raises an important point. What we have shown is that corticorecipient dorsal IC neurons in mice without auditory cortex show neural activity during a sound detection task that is largely indistinguishable from the activity of mice with an intact auditory cortex. In lesioned mice, the auditory cortex is thus not required. Whether the IC activity of the non-lesioned group can be shaped by input from the auditory cortex in a meaningful way in other contexts, such as during learning, is a question that our data cannot answer.

      Changes to manuscript.

      Line 508: "While modulation of IC activity by this descending projection has been implicated in various functions, most notably in the plasticity of auditory processing, we have shown in mice performing a sound detection task that IC neurons show task-related activity in the absence of auditory cortical input."

      LESSER CONCERNS

      (L. 106-107) "Optogenetic suppression of cortical activity is thus also unsuitable..." It appears that behavior is not completely abolished by the suppression. One could also imagine using a lower dose of muscimol for partial inactivation of AC feedback. When some behavior persists, it does seem possible to measure task-related changes in the IC. This may not be necessary for the current study, but the authors should consider how these transient methods could be applied usefully in the Discussion. What about inactivation of cortical terminals in the IC? Is that feasible?

      Our argument is not that acute manipulations are unsuitable because they completely abolish the behavior, but because they significantly alter the behavior. Although it would not be trivial to precisely measure the extent of pharmacological cortical silencing in behaving mice that have been fitted with a midbrain window, it should be possible to titrate the size of a muscimol injection to achieve partial silencing of the auditory cortex that does not fully abolish the ability to detect sounds. However, such an outcome would likely render the data uninterpretable. If no effect on IC activity was observed, it would not be possible to conclude whether this was due to the fact that the auditory cortex was only partially silenced or that projections from the auditory cortex have no influence on the recorded IC activity. Similarly, if IC activity was altered, it would not be possible to say whether this was due to altered descending modulation resulting from the (partially) silenced auditory cortex or to the change in behavior, which would likely be reflected in the choice-related activity measured in the IC.

      Silencing of corticocollicular axons in the IC is potentially a more promising approach and we did devote a considerable amount of time and effort to establishing a method that would allow us to simultaneously image IC neurons while silencing corticocollicular axons, trying both eNpHR3.0 and Jaws with different viral labeling approaches and mouse lines. However, we ultimately abandoned those attempts because we were not convinced that we had achieved sufficient silencing or that we would be able to convincingly verify this. Furthermore, axonal silencing comes with its own pitfalls and the interpretation of its consequences is not straightforward. Given that our discussion already contains a section (line 421) on axonal silencing, we do not feel there would be any benefit in adding to that.

      (Figure 1). Can the authors break down the performance for FA and HR, as they do in Fig. 3? It would be helpful to know what aspect of behavior is impaired by the transient inactivation.

      Good point. Figure 1 has been updated to show the results separately for hit rates, false alarms and d’. The new figure indicates that the change in d’ is primarily a consequence of altered false alarm rates. Please also see our response to a related comment by reviewer #1.

      Changes to manuscript.

      New figure 1.

      (Figure 4 legend). Minor: Please clarify, what is time 0 in panel C? Time of click presentation?

      Yes, that is correct.

      Changes to manuscript.

      Line 209: ”Vertical line at time 0 s indicates time of click presentation.”

      (L. 228-229). There has been a report of lick and other motor related activity in the IC - e.g., see Shaheen, Slee et al. (J Neurosci 2021), the timing of which suggests that some of it may be acoustically driven.

      Thanks for pointing this out. Shaheen et al., 2021 should certainly have been cited by us in this context as well as in other parts of the manuscript.

      Changes to manuscript.

      Line 243: “(Singla et al., 2017; but see Shaheen et al., 2021)”

      Also, have the authors considered measuring a peri-lick response? The difference between hit and miss trials could be perceptual or it could reflect differences in motor activity. This may be hard to tease apart, but, for example, one can test whether activity is stronger on trials with many licks vs. few licks?

      (L. 261) "Behavior can be decoded..." similar or alternative to the previous question of evoked activity, can you decode lick events from the population activity?

      The difference between hit and miss trial activity almost certainly partially reflects motor activity associated with licking. This was stated in the Discussion, but to make that point more explicitly, we now include a plot of average false alarm trial activity, i.e. trials without sound (catch trials) in which animals licked (but did not receive a reward).

      Given a sufficient number of catch trials, it should be possible to decode false alarm and correct rejection trials. However, our experiment was not designed with that in mind and contains a much smaller number of catch trials than stimulus trials (approximately one tenth the number of stimulus trials), so we have not attempted this.

      Changes to manuscript.

      New Figure 4 - figure supplement 1.

      (L. 315) "Pre-stimulus activity..." Given reports of changes in activity related to pupil-indexed arousal in the auditory system, do the authors by any chance have information about pupil size in these datasets?

      Given that all recordings were performed in the dark, fluctuations in pupil diameter were relatively small. Therefore, we have not made any attempt to relate pupil diameter to any of the variables assessed in this manuscript.

      (L. 412) "abolishes sound detection". While not exactly the same task, the authors might comment on Gimenez et al (J Neurophys 2015) which argued that temporary or permanent lesioning of AC did not impair tone discrimination. More generally, there seems to be some disagreement about what effects AC lesions have on auditory behavior.

      Thank you for this suggestion. Gimenez et al. (2015) investigated the ability of freely moving rats to discriminate sounds (and, in addition, how they adapt to changes in the discrimination boundary). Broadly consistent with later reports by Ceballo et al. (2019) (mild impairment) and O’Sullivan et al. (2019) (no impairment), Gimenez et al. (2015) reported that discrimination performance is mildly impaired after lesioning auditory cortex. Where the results of Gimenez et al. (2015) stand out is in the comparatively mild impairments that were seen in their task when they used muscimol injections, which contrast with the (much) larger impairments reported by others (e.g. Talwar et al., 2001; Li et al., 2017; Jaramillo and Zador, 2014).

      Changes to manuscript.

      Line 433: ”However, transient pharmacological silencing of the auditory cortex in freely moving rats (Talwar et al., 2001), as well as head-fixed mice (Li et al., 2017), completely abolishes sound detection (but see Gimenez et al., 2015).”

      (L. 649) "... were generally separable" Is the claim here that the clusters are really distinct from each other? This is unexpected, and it might be helpful if the authors could show this result in a figure.

      The half-sentence that this comment refers to has been removed from the methods section. Please also see a related comment by reviewer #1 which prompted us to add the following to the methods section.

      Changes to manuscript.

      Line 666: “While clustering is a useful approach for organizing and visualizing the activity of large and heterogeneous populations of neurons we need to be mindful that, given continuous distributions of response properties, the locations of cluster boundaries can be somewhat arbitrary and/or reflect idiosyncrasies of the chosen method and thus vary from one algorithm to another. We employed an approach very similar to that described in Namboodiri et al. (2019) because it is thought to produce stable results in high-dimensional neural data (Hirokawa et al. 2019).”

      Reviewer #3 (Recommendations For The Authors):

      (1) The authors must absolutely clarify if the hit versus misses decoding and clustering analysis is done for a single sound level or for multiple sound levels (what is the fraction of trials for each sound leve?). If the authors did it for multiple sound levels they should redo all analyses sound-level by sound-level, or for a single sound level if there is one that dominates. No doubt that there is information about the trial outcome in IC, but it should not be over-estimated by a confound with stimulus information.

      This is an important point. The original clustering analysis was carried out across different sound levels. We have now carried out additional analysis for distinguishing between two alternative explanations of the data, which were also raised by reviewer #1. – that the difference in neural activity between hit and miss trials could reflect a) the animals’ behavior or b) relatively more hit trials at higher sound levels, which would be expected to produce stronger responses. If the data favored b), we would expect no difference in activity between hit and miss trials when plotted separately for different sound levels. The new figure 4 - figure supplement 1 indicates that that is not the case. Hit and miss trial activity are clearly distinct even when plotted separately for different sound levels, confirming that this difference in activity reflects the animals’ behavior rather than sensory information.

      We made the following changes to manuscript.

      Line 214: “While averaging across all neurons cannot capture the diversity of responses, the averaged response profiles suggest that it is mostly trial outcome rather than the acoustic stimulus and neuronal sensitivity to sound level that shapes those responses (Figure 4 – figure supplement 1).”

      Differences in the distributions of sound levels in the different trial types could also potentially confound the decoding into hit and miss trials. Our analysis actually aimed to take this into account but, unfortunately, we failed to include sufficient details in the methods section.

      Changes to manuscript.

      Line 710: “Rather than including all the trials in a given session, only trials of intermediate difficulty were used for the decoding analysis. More specifically, we only included trials across five sound levels, comprising the lowest sound level that exceeded a d’ of 1.5 plus the two sound levels below and above that level. That ensured that differences in sound level distributions would be small, while still giving us a sufficient number of trials to perform the decoding analysis.“

      In this context, it is worth bearing in mind that a) the decoding analysis was done on a frame-byframe basis, meaning that the decoding score achieved early in the trial has no impact on the decoding score at later time points in the trial, b) sound-driven activity predominantly occurs immediately after stimulus onset and is largely over about 1 s into the trial (see cluster 3, for instance, or average miss trial activity in figure 4 - figure supplement 1), c) decoding performance of the behavioral outcome starts to plateau 500-1000 ms into the trial and remains high until it very gradually begins to decline after about 2 s into the trial. In other words, decoding performance remains high far longer than the stimulus would be expected to have an impact on the neurons’ activity. Therefore, we would expect any residual bias due to differences in the sound level distribution that our approach did not control for to be restricted to the very beginning of the trial and not to meaningfully impact the conclusions derived from the decoding analysis.

      Furthermore, we carried out an additional decoding analysis for one imaging session in which we had a sufficient number of trials to perform the analysis not only over the five (59, 62, 65, 68, 71 dB SPL) original sound levels, but also over a reduced range of three (62, 65, 68 dB SPL) sound levels, as well as a single (65 dB SPL) sound level (Figure 6 - figure supplement 1). The mean sound level difference between the hit trial distributions and miss trial distributions for these three conditions were 3.08, 1.01 and 0 dB, respectively. This analysis suggests that decoding performance is not meaningfully impacted by changing the range of sound levels (and sound level distributions) other than that including fewer sound levels means fewer trials and thus noisier decoding.

      Changes to manuscript.

      Line 287: ”...and was not meaningfully affected by differences in sound level distributions between hit and miss trials (Figure 6 – figure supplement 1).”

      Finally, in order to supplement the decoding analysis, we determined for each individual neuron whether there was a significant difference between the average hit and average miss trial activity. Note that this was done using equal numbers of hit and miss trials at each sound level to ensure balanced sound level distributions and to rule out any potential confound of sound level. This revealed that the proportion of neurons containing “information about trial outcome” was generally very high, close to 50% on average, and not significantly different between lesioned and non-lesioned mice.

      Changes to manuscript.

      Line 307: “Although the proportion of individual neurons with distinct response magnitudes in hit and miss trials in lesioned mice did not differ from that in non-lesioned mice, it was significantly lower when separating out mice with partial lesions (Figure 6 – figure supplement 3).”

      Line 648: “Analysis of task-modulated and sound-driven neurons. To identify individual neurons that produced significantly different response magnitudes in hit and miss trials, we calculated the mean activity for each stimulus trial by taking the mean activity over the 5 seconds following stimulus presentation and subtracting the mean activity over the 2 seconds preceding the stimulus during that same trial. A Mann-Whitney U test was then performed to assess whether a neuron showed a statistically significant difference (Benjamini-Hochberg adjusted p-value of 0.05) in response magnitude between hit and miss trials. The analysis was performed using equal numbers of hit and miss trials at each sound level to ensure balanced sound level distributions. If, for a given sound level, there were more hit than miss trials we randomly selected a sample of hit trials (without substitution) to match the sample size for the miss trials and vice versa. ”

      (2) I have the feeling that the authors do not exploit fully the functional data recorded with two-imaging. They identify several cluster but do not describe their functional differences. For example, cluster 3 is obviously mainly sensory driven as it is not modulated by outcome. This could be mentioned. This could also be used to rule out that trial outcome is the results of insufficient sensory inputs. Could this cluster be used to predict trial outcome at the onset response? Could it be used to predict the presence of the sound, and with which accuracy. The authors discuss a bit the different cluster type, but in a very elusive manner. I recognize that one should be careful with the use of signal analysis methods in calcium imaging but a simple linear deconvolution of the calcium dynamic who help to illustrate the conclusions that the authors propose based on peak responses. It would also be very interesting to align the clusters responses (deconvolved) to the timing of licking and rewards event to check if some clusters do not fire when mice perform licks before the sound comes. It would help clarify if the behavioral signals described here require both the presence of the sound and the behavioral action or are just the reflection of the motor command. As noted by the authors, some clusters have late peak responses (2 and 5). However, 2 and 5 are not equivalent and a deconvolution would evidence that much better. 2 has late onset firing. 5 has early onset but prolonged firing.

      We agree with the reviewer’s statement that “cluster 3 is obviously mainly sensory driven”. In the Discussion we refer to cluster 3 as having a “largely behaviorally invariant response profile to the auditory stimulus” (line X), which is consistent with the statement of the reviewer. With regard to the reviewer’s suggestion to describe the “functional differences” between the clusters, we would like to refer to the subsequent three sentences of the same paragraph in which we speculate on the cognitive and behavioral variables that may underlie the response profiles of different clusters. Given the limitations imposed by the task structure, we do not think it is justified to expand on this.

      We have added an additional analysis in order to explicitly address the question of which neurons are sound responsive (please also see response to point 3 below and to point 1 of reviewer #2). That trial outcome could be predicted on the basis of only the sound-responsive neurons’ activity during the initial period of the trial (“predict trial outcome at the onset response”) is unlikely given their small number (only 97 of 2649 neurons show a statistically significant sound-evoked response) and given that only a minority (42/98) of those sound-driven neurons are also modulated by trial outcome within that initial trial period (i.e. 0-1s after stimulus onset; data not shown).

      Changes to manuscript.

      Line 219: “..., while only a small fraction (97 / 2649) exhibited a significant response to the sound.”

      Line 658: “Sound-driven neurons were identified by comparing the mean miss trial activity before and after stimulus presentation. Specifically, we performed a Mann-Whitney U test to assess whether there was a statistically significant difference (Benjamini-Hochberg adjusted p-value of 0.05) between the mean activity over the 2 seconds preceding the stimulus and the mean activity over the 1 second period following stimulus presentation. This analysis was performed using miss trials with click intensities from 53 dB SPL to 65 dB SPL (many sessions contained very few or no miss trials at higher sound levels).”

      While calcium traces represent an indirect measure of neural activity, deconvolution does not necessarily provide an accurate picture of the spiking underlying those traces and has the potential to introduce additional problems. For instance, deconvolution algorithms tend to perform poorly at inferring the spiking of inhibited neurons (Vanwalleghem et al., 2021). Given that suppression is such a prominent feature of IC activity and is evident both in our calcium data as well as in the electrophysiology data of others (Franceschi and Barkat, 2021), we decided against using deconvolved spikes in our analyses. See also the side-by-side comparison below of the hit and miss trial activity of one example neuron based on either the calcium trace (left) or deconvolved spikes (right) (extracted using the OASIS algorithm (Friedrich et al., 2017) incorporated into suite2p (Pachitariu et al., 2016).

      Author response image 1.

      (3) Along the same line, the very small proportion of really sensory driven neurons (cluster 3) is not discussed. Is it what on would expect in typical shell or core IC neurons?

      As requested by reviewer #2 and mentioned in response to the previous point, we have now quantified the number of neurons in the dataset that produced significant responses to sound (97 / 2649). For a given imaging area, the fraction of neurons that show a statistically significant change in neural activity following presentation of a click of between 53 dB SPL and 65 dB SPL rarely exceeded ten percent. While that number is low, it is not necessarily surprising given the moderate intensity and very short duration of the stimuli. For comparison: Using the same transgenics, labeling approach and imaging setup and presenting 200-ms long pure tones at 60 dB SPL with frequencies between 2 kHz and 64 kHz, we typically find that between a quarter and a third of neurons in a given imaging area exhibit a statistically significant response (data not shown).

      Changes to manuscript.

      Line 219: “..., while only a small fraction (97 / 2649) exhibited a significant response to the sound.”

      Line 658: “Sound-driven neurons were identified by comparing the mean miss trial activity before and after stimulus presentation. Specifically, we performed a Mann-Whitney U test to assess whether there was a statistically significant difference (Benjamini-Hochberg adjusted p-value of 0.05) between the mean activity over the 2 seconds preceding the stimulus and the mean activity over the 1 second period following stimulus presentation. This analysis was performed using miss trials with click intensities from 53 dB SPL to 65 dB SPL (many sessions contained very few or no miss trials at higher sound levels).”

      Line 220: “While the number of sound-responsive neurons is low, it is not necessarily surprising given the moderate intensity and very short duration of the stimuli. For comparison: Using the same transgenics, labeling approach and imaging setup and presenting 200-ms long pure tones at 60 dB SPL with frequencies between 2 kHz and 64 kHz, we typically find that between a quarter and a third of neurons in a given imaging area exhibit a statistically significant response (data not shown).”

      (4) In the discussion, the interpretation of different transient and permanent cortical inactivation experiment is very interesting and well balanced given the complexity of the issue. There is nevertheless a comment that is difficult to follow. The authors state:

      If cortical lesioning results in a greater weight being placed on the activity in spared subcortical circuits for perceptual judgements, we would expect the accuracy with which trial-by-trial outcomes could be read out from IC neurons to be greater in mice without auditory cortex. However, that was not the case.

      However, there is no indication that the activity they observe in shell IC is causal to the behavioral decision and likely it is not. There is also no indication that the behavioral signals seen by the authors reflect the weight put on the subcortical pathway for behavior. I find this argument handwavy and would remove it.

      While we are happy to amend this section, we would not wish to remove it because a) we believe that the point we are trying to make here is an important and reasonable one and b) because it is consistent with the reviewer’s comment. Hopefully, the following will make this clearer: In order for the mouse to make a perceptual judgment and act upon it - in the context of our task, hearing a sound and then licking a spout - auditory information needs to be read out and converted into a motor command. If the auditory cortex normally plays a key role in such perceptual judgments, cortical lesions would require the animal to base its decisions on the information available from the remaining auditory structures, potentially including the auditory midbrain. This might result in a greater correspondence between the mouse’s behavior and the neural activity in those structures. That we did not observe this outcome for the IC could mean that the auditory cortex did not contribute to the relevant perceptual judgments (sound detection) in the first place. Therefore, no reweighting of signals from the other structures is necessary. Alternatively, greater weight might be placed exclusively on structures other than the auditory midbrain, e.g. the thalamus. The latter would imply that the contribution of the IC remains the same. This includes the possibility that the IC shell does not play a causal role in the behavioral decision – in either control mice or mice with cortical lesions – as suggested by the reviewer.

      Changes to manuscript.

      Line 471: “This could imply that, following cortical lesions, greater weight is placed on structures other than the IC, with the thalamus being the most likely candidate, ..”

      (5) In Fig. 5 the two colors used in B and C are the same although they describe different categories.

      The dark green and ‘deep orange’ we used to distinguish between non-lesioned and lesioned in Figure 5C are slightly lighter than the colors used to distinguish between these two categories in other figures and therefore might be more easily confused with the blue and red in Figure 5B. This has been changed.

    1. Author response:

      The following is the authors’ response to the current reviews.

      We thank the Reviewers and Editors for the constructive comments, which we believe have significantly improved the quality of our manuscript.


      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      (1) With respect to the predictions, the authors propose that the subjects, depending on their linguistic background and the length of the tone in a trial, can put forward one or two predictions. The first is a short-term prediction based on the statistics of the previous stimuli and identical for both groups (i.e. short tones are expected after long tones and vice versa). The second is a long-term prediction based on their linguistic background. According to the authors, after a short tone, Basque speakers will predict the beginning of a new phrasal chunk, and Spanish speakers will predict it after a long tone.

      In this way, when a short tone is omitted, Basque speakers would experience the violation of only one prediction (i.e. the short-term prediction), but Spanish speakers will experience the violation of two predictions (i.e. the short-term and long-term predictions), resulting in a higher amplitude MMN. The opposite would occur when a long tone is omitted. So, to recap, the authors propose that subjects will predict the alternation of tone durations (short-term predictions) and the beginning of new phrasal chunks (long-term predictions).

      The problem with this is that subjects are also likely to predict the completion of the current phrasal chunk. In speech, phrases are seldom left incomplete. In Spanish is very unlikely to hear a function-word that is not followed by a content-word (and the opposite happens in Basque). On the contrary, after the completion of a phrasal chunk, a speaker might stop talking and a silence might follow, instead of the beginning of a new phrasal chunk.

      Considering that the completion of a phrasal chunk is more likely than the beginning of a new one, the prior endowed to the participants by their linguistic background should make us expect a pattern of results actually opposite to the one reported here.

      We thank the Reviewer #1 for this pertinent comment and the opportunity to address this issue. A very similar concern was also raised by Reviewer #2. Below we try to clarify the motivations that led us to predict that the hypothesized long-term predictions should manifest at the onset (and not within or the end) of a perceptual chunk. 

      Reviewers #1 and #2 contest a critical assumption of our study i.e., the fact that longterm predictions should occur at the beginning of a rhythmic chunk as opposed to its completion. They also contest the prediction deriving from this view i.e., omitting the first sound in a perceptual chunk (short for Spanish, long for Basque) would lead to larger error responses than omitting a later element. They suggest an alternative view: the omission of tones at the end of a perceptual rhythmic chunk would evoke larger error responses than omissions at its onset, as subjects are more likely to predict the completion of the chunk than its beginning. This view predicts an interaction effect in the opposite direction of our findings. 

      While we acknowledge this as a plausible hypothesis, we believe that the current literature provides strong support for our view. Indeed, many studies in the rhythm and music perception literature have investigated the ERP responses to deviant sounds and omissions placed at different positions within rhythmic patterns (e.g., Ladinig et al., 2009; Bouwer et al., 2016; Brochard et al., 2003; Potter et al., 2009; Yabe et al., 2001). For instance, Lading et al., 2009 presented participants with metrical rhythmical sound sequences composed of eight tones. In some deviant sequences, the first or a later tone was omitted. They found that earlier omissions elicited earlier and higher-amplitude MMN responses than later omissions (irrespective of attention). Overall, this and other studies showed that the amplitude of ERP responses are larger when deviants occur at positions that are expected to be the “start” of a perceptual group - “on the beat” in musical terms - and decline toward the end of the chunk. According to some of these studies, the first element of a chunk is particularly important to track the boundaries of temporal sequences, which is why more predictive resources are invested at that position. We believe that this body of evidence provides robust bases for our hypotheses and the directionality of our predictions.

      An additional point that should be considered concerns the amplitude of the prediction error response elicited by the omission. From a predictive coding perspective, the omission of the onset of a chunk should elicit larger error responses because the system is expecting the whole chunk (i.e., two tones/more acoustic information). On the other hand, the omission of the second tone - in the transition between two tones within the chunk - should elicit a smaller error response because the system is expecting only the missing tone (i.e. less acoustic information). 

      Given the importance of these points, we have now included them in the updated version of the paper, in which we try to better clarify the rationale behind our hypothesis (see Introduction section, around the 10th paragraph).

      (2) The authors report an interaction effect that modulates the amplitude of the omission response, but caveats make the interpretation of this effect somewhat uncertain. The authors report a widespread omission response, which resembles the classical mismatch response (in MEG) with strong activations in sensors over temporal regions. Instead, the interaction found is circumscribed to four sensors that do not overlap with the peaks of activation of the omission response.

      We thank the Reviewer for this comment. As mentioned in the provisional response, the approach employed to identify the presence of an interaction effect was conservative: We utilized a non-parametric test on combined gradiometers data, without making a priori assumptions about the location of the effect, and employed small cluster thresholds (cfg.clusteralpha = 0.05) to increase the chances of detecting highly localized clusters with large effect sizes. The fact that the interaction effect arises in a relatively small cluster of sensors does not alter its statistical robustness. It should be also considered that in the present analyses we focused on planar gradiometer data that, compared to magnetometers and axial gradiometers, present more fine-grained spatial resolution and are more suited for picking up relatively small effects. 

      The partial overlap of the cluster with the activation peaks may simply reflect the fact that different sources contribute to the generation of the omission-MMN, which has been reported in several studies (e.g., Zhang et al., 2018; Ross & Hamm, 2020).  We value the Reviewer’s input and are grateful for the opportunity to address these considerations.

      Furthermore, the boxplot in Figure 2E suggests that part of the interaction effect might be due to the presence of two outliers (if removed, the effect is no longer significant). Overall, it is possible that the reported interaction is driven by a main effect of omission type which the authors report, and find consistently only in the Basque group (showing a higher amplitude omission response for long tones than for short tones). Because of these points, it is difficult to interpret this interaction as a modulation of the omission response.

      We thank the Reviewer for the comment and appreciate the opportunity to address these concerns. We have re-evaluated the boxplot in Figure 2E and want to clarify that the two participants mentioned by Reviewer #1, despite being somewhat distant from the rest of the group, are not outliers according to the standard Tukey’s rule. As shown in the figure below, no participant fell outside the upper (Q3+1.5xIQR) and lower whiskers (Q1-1.5xIQR) of the boxplot. 

      Moreover, we believe that the presence of a main effect of omission type does not impact the interpretation of the interaction, especially considering that these effects emerge over distinct clusters of channels (see Fig. 1 C; Supplementary Fig. 2 A). 

      Based on these considerations - and along with the evidence collected in the control study and the source reconstruction data reported in the new version of the manuscript - we find it unlikely that the interaction effect is driven by outliers or by a main effect of omission type. We appreciate the opportunity provided by the Reviewer to address these concerns, as we believe they strengthen the claim that the observed effect is driven by the hypothesized long-term linguistic priors rather than uncontrolled group differences.

      Author response image 1.

      It should also be noted that in the source analysis, the interaction only showed a trend in the left auditory cortex, but in its current version the manuscript does not report the statistics of such a trend.

      We  appreciate  the  Reviewer’s  suggestion  to  incorporate  more comprehensive source analyses. In the new version of the paper, we perform new analyses on the source data using a new Atlas with more fine-grained parcellations of the regions of interests (ROIs) (Brainnetome atlas; Fan et al., 2016) and focusing on peak activity to increase response’s sensitivity in space and time. We therefore invite the Reviewer to read the updated part on source reconstruction included in the Results and Methods sections of the paper.  

      Reviewer #1 (Recommendations For The Authors):

      While I have described my biggest concerns with respect to this work in the public review, here I list more specific points that I hope will help to improve the manuscript. Some of these are very minor, but I hope you will still find them constructive. 

      (1) I understand the difficulties implied in recruiting subjects from two different linguistic groups, but with 20 subjects per group and a between-groups design, the current study is somewhat underpowered. A post-hoc power analysis shows an achieved power of 46% for medium effect sizes (d = 0.5, and alpha = 0.05, one-sided test). A sensitivity analysis shows that the experiment only has 80% power for effect sizes of d = 0.8 and above. It would be important to acknowledge this limitation in the manuscript. 

      We thank the Reviewer for reporting these analyses. It must be noted that our effect of interest was based on Molnar et al.’s (2016) behavioral experiment, in which a sample size of 16 subjects per group was sufficient to detect the perceptual grouping effect. In Yoshida et al., (2010), the perceptual grouping effect emerged with two groups of 20 7–8-month-old Japanese and English-learning infants. Based on these previous findings, we believe that a sample size of 20 participants per group can be considered appropriate for the current MEG study. We clarified these aspects in the Participants section of the manuscript, in which we specified that previous behavioral studies detected the perceptual grouping with similar sample sizes. Moreover, to acknowledge the limitation highlighted by the Reviewer, we also include the power and sensitivity analysis in a note in the same section (see note 2 in the Participants section).

      (2) All the line plots in the manuscript could be made much more informative by adding 95% CI bars. For example, in Figure 4A, the omission response for the long tone departs from the one for the short tone very early. Adding CIs would help to assess the magnitude of that early difference. Error bars are present in Figure 3, but it is not specified what these bars represent. 

      Thanks for the comments. We added the explanation of the error bars in the new version of Figure 3. For the remaining figures, we prefer maintaining the current version of the ERF, as the box-plots accompanying them provide information about the distribution of the effect across participants.

      (3) In the source analysis, there is only mention of an interaction trend in the left auditory cortex, but no statistics are presented. If the authors prefer to mention such a trend, I think it would be important to provide its stats to allow the reader to assess its relevance. 

      We performed new analysis on the source data, all reported in the updated version of the manuscript.

      (4) In the discussion section, the authors refer to the source analysis and state that "the interaction is evident in the left". But if only a statistical trend was observed, this statement would be misleading. 

      We agree with this comment. We invite the Reviewer to check the new part on source reconstruction, in which contrasts going in the same direction of the sensor level data are performed.

      (5) In the discussion the authors argue that "This result highlights the presence of two distinct systems for the generation of auditory" that operate at different temporal scales, but the current work doesn't offer evidence for the existence of two different systems. The effects of long-term priors and short-term priors presented here are not dissociated and instead sum up. It remains possible that a single system is in place, collecting statistics of stimuli over a lifetime, including the statistics experienced during the experiment. 

      Thanks for pointing that out. We changed the sentence above as follows: “This result highlights the presence of an active predictive system that relies on natural sound statistics learned over a lifetime to process incoming auditory input”.

      (6) In the discussion, the authors acknowledge that the omission response has been interpreted both as pure prediction and as pure prediction error. Then they declare that "Overall, these findings are consistent with the idea that omission responses reflect, at least in part, prediction error signals.". However an argument for this statement is not provided. 

      Thanks for pointing out this lack of argument. In the new version of the manuscript, we explained our rationale as follows: “Since sensory predictive signals primarily arise in the same regions as the actual input, the activation of a broader network of regions in omission responses compared to tones suggests that omission responses reflect, at least in part, prediction error signals”.

      (7) In the discussion the authors present an alternative explanation in which both groups might devote more resources to the processing of long events, because these are relevant content words. Following this, they argue that "Independently on the interpretation, the lack of a main effect of omission type in the control condition suggests that the long omission effect is driven by experience with the native language." However as there was no manipulation of duration in the control experiment, a lack of the main effect of omission type there does not rule out the alternative explanation that the authors put forward. 

      This is correct; thanks for noticing it. We removed the sentence above to avoid ambiguities.

      Minor points: 

      (8) The scale of the y-axis in Figure 2C might be wrong, as it goes from 9 to 11 and then to 12. If the scale is linear, the top value should be 13, or the bottom value should be 10. 

      Figure 2C has been modified accordingly, thanks for noticing the error.

      (9) There is a very long paragraph starting on page 7 and ending on page 8. Toward the end of the paragraph, the analysis of the control condition is presented. That could start a new paragraph.

      Thanks for the suggestion. We modified the manuscript as suggested.

      Reviewer #2 (Public Review):

      (1) Despite the evidence provided on neural responses, the main conclusion of the study reflects a known behavioral effect on rhythmic sequence perceptual organization driven by linguistic background (Molnar et al. 2016, particularly). Also, the authors themselves provide a good review of the literature that evidences the influence of longterm priors in neural responses related to predictive activity. Thus, in my opinion, the strength of the statements the authors make on the novelty of the findings may be a bit far-fetched in some instances.

      Thanks for the suggestion. A similar point was also advanced by Reviewer 1. In general, we believe our work speaks about the predictive nature of such experiencedependent  effects, and show that these linguistic priors shape sensory processes at very early stages. This is discussed in the sixth and seventh paragraphs of the Discussion section. In the new version of the article, we modified some statements and tried to make them more coherent with the scope of the present work. For instance, we changed "This result highlights the presence of two distinct systems for the generation of auditory predictive models, one relying on the transition probabilities governing the recent past, and another relying on natural sound statistics learned over a lifetime“ with “This result highlights the presence of an active predictive system that relies on natural sound statistics learned over a lifetime to process incoming auditory input”.

      (2) Albeit the paradigm is well designed, I fail to see the grounding of the hypotheses laid by the authors as framed under the predictive coding perspective. The study assumes that responses to an omission at the beginning of a perceptual rhythmic pattern will be stronger than at the end. I feel this is unjustified. If anything, omission responses should be larger when the gap occurs at the end of the pattern, as that would be where stronger expectations are placed: if in my language a short sound occurs after a long one, and I perceptually group tone sequences of alternating tone duration accordingly, when I hear a short sound I will expect a long one following; but after a long one, I don't necessarily need to expect a short one, as something else might occur.

      A similar point was advanced by Reviewer #1. We tried to clarify the rationale behind our hypothesis. Please refer to the response provided to the first comment of Reviewer #1 above.

      (3) In this regard, it is my opinion that what is reflected in the data may be better accounted for (or at least, additionally) by a different neural response to an omission depending on the phase of an underlying attentional rhythm (in terms of Large and Jones rhythmic attention theory, for instance) and putative underlying entrained oscillatory neural activity (in terms of Lakatos' studies, for instance). Certainly, the fact that the aligned phase may differ depending on linguistic background is very interesting and would reflect the known behavioral effect.

      We thank the Reviewer for this comment. We explored in more detail the possibility that the aligned phase may differ depending on linguistic background, which is indeed a very interesting hypothesis. In the phase analyses reported below we focused on the instantaneous phase angle time locked to the onset of short and long tones presented in the experiment.

      In short, we extracted time intervals of two seconds centered on the onset of the tones for each participant (~200 trials per condition) and using a wavelet transform (implemented in Fieldtrip ft_freqanalysis) we targeted the 0.92 Hz frequency that corresponds to the rhythm of presentation of our pairs of tones. We extracted the phase angle for each time point and using the circular statistics toolbox implemented in Matlab we computed the Raleigh z scores across all the sensor space for each tone (long and short tone) and group (Spanish (Spa) dominants and Basque (Eus) dominants). This method evaluates the instantaneous phase clustering at a specific time point, thus evaluating the presence of a specific oscillatory pattern at the onset of the specific tone. 

      Author response image 2.

      Here we observe that the phase clustering was stronger in the right sensors for both groups. The critical point is to evaluate the phase angle (estimated in phase radians) for the two groups and the two tones and see if there are statistical differences. We focused first on the sensor with higher clustering (right temporal MEG1323) and observed very similar phase angles for the two groups both for long and short tones (see image below). We then focused on the four left fronto-temporal sensor pairs who showed the significant interaction: here we observed one sensor (MEG0412) with different effects for the two groups (interaction group by tone was significant, p=0.02): for short tones the “Watson (1961) approximation U2 test” showed a p-value of 0.11, while for long tones the p-value was 0.03 (after correction for multiple comparisons). 

      Overall, the present findings suggest the tendency to phase aligning differently in the two groups to long and short tones in the left fronto-temporal hemisphere. However, the effect could be detected only in one gradiometer sensor and it was not statistically robust. The effect in the right hemisphere was statistically more robust, but it was not sensitive to group language dominance. 

      Due to the inconclusive nature of these analyses regarding the role of language experience in shaping the phase alignment to rhythmic sound sequences, we prefer to keep these results in the public review rather than incorporating them in the article.  Nonetheless, we believe that this decision does not undermine the main finding that the group differences in the MMN amplitude are driven by long-term predictions – especially in light of the many studies indicating the MMN as a putative index of prediction error (e.g., Bendixen et al., 2012; Heilbron and Chait, 2018). Moreover, as suggested in the preliminary reply, despite evoked responses and oscillations are often considered distinct electrophysiological phenomena, current evidence suggests that these phenomena are interconnected (e.g., Studenova et al., 2023). In our view, the hypotheses that the MMN reflects differences in phase alignment and long-term prediction errors are not mutually exclusive.

      Author response image 3.

      (4) Source localization is performed on sensor-level significant data. The lack of  sourcelevel statistics weakens the conclusions that can be extracted. Furthermore, only the source reflecting the interaction pattern is taken into account in detail as supporting their hypotheses, overlooking other sources. Also, the right IFG source activity is not depicted, but looking at whole brain maps seems even stronger than the left. To sum up, source localization data, as informative as it could be, does not strongly support the author's claims in its current state. 

      A similar comment was also advanced by Reviewer #1 (comment 2). We appreciate the suggestion to incorporate more comprehensive source analyses. In the new version of the paper, we perform new analyses on the source data using a new Atlas with more fine-grained parcellations of the ROIs, and focusing on peak activity to increase response’s sensitivity in space and time. We therefore invite the Reviewer to read the updated part on source reconstruction included in the Results and Methods sections of the paper. 

      In the article, we report only the source reconstruction data from ROIs in the left hemisphere, because it is there that the interaction effect arises at the sensor level. However, we also explored the homologous regions in the right hemisphere, as requested by the Reviewer. A cluster-based permutation test focusing on the interaction between language group and omission type was performed on both the right STG and IFG data. No significant interaction emerged in any of these regions. Below a plot of the source activity time series over ROIs in the right STG and IFG. 

      Author response image 4.

      Reviewer #2 (Recommendations For The Authors):

      In this set of private recommendations for the authors, I will outline a couple of minor comments and try to encourage additional data analyses that, in my opinion, would strengthen the evidence provided by the study. 

      (1) As I noted in the public review, I believe an oscillatory analysis of the data would, on one hand, provide stronger support for the behavioral effect of rhythmic perceptual organization given the lack of behavioral direct evidence; and, on the other hand, provide evidence (to be discussed if so) for a role of entrained oscillation phase in explaining the different pattern of omission responses. One analysis the authors could try is to measure the phase angle of an oscillation, the frequency of which relates to the length of the binary pattern, at the onset of short and long tones, separately, and compare it across groups. Also, single trials of omission responses could be sorted according to that phase. 

      Thanks for the suggestion. Please see phase analyses reported above.

      (2) I wonder why source activity for the right IFG was not shown. I urge the authors to provide and discuss a more complete picture of the source activity found. Given the lack of source statistics (which could be performed), I find it a must to give an overall view. I find it so because I believe the distinction between perceptual grouping effects due to inherent acoustic differences across languages or semantic differences is so interesting. 

      Thanks again for the invitation to provide a more complete picture of the source activity data. As mentioned in the response above, we invite the Reviewer to read the new related part included in the Results and Methods sections of the paper. In our updated source reconstruction analysis, we find that some regions around the left STG show a pattern that resembles the one found at the sensor-level, providing further support for the “acoustic” (rather than syntactic/semantic) nature of the effect. 

      We did not report ROI analysis on the right hemisphere because the interaction effect at sensor level emerged on the left hemisphere. Yet, we included a summary of this analysis in the public response above. 

      (3) Related to this, I have to acknowledge I had to read the whole Molnar et al. (2016) study to find the only evidence so far that, acoustically, in terms of sound duration, Basque and Spanish differ. This was hypothesized before but only at Molnar, an acoustic analysis is performed. I think this is key, and the authors should give it a deeper account in their manuscript. I spend my review of this study thinking, well, but when we speak we actually bind together different words and the syllabic structure does not need to reflect the written one, so maybe the effect is due to a high-level statistical prior related to the content of the words... but Molnar showed me that actually, acoustically, there's a difference in accent and duration: "Taken together, Experiments 1a and 1b show that Basque and Spanish exhibit the predicted differences in terms of the position of prosodic prominence in their phonological phrases (Basque: trochaic, Spanish: iambic), even though the acoustic realization of this prominence involves not only intensity in Basque but duration, as well. Spanish, as predicted, only uses duration as a cue to mark phrasal prosody." 

      Thanks for the suggestion, the distinction in terms of sound duration in Spanish and Basque reported by Molnar is indeed very relevant for the current study. 

      We add a few sentences to highlight the acoustic analysis by Molnar and the consequent acoustic nature of the reported effect.

      In the introduction: “Specifically, the effect has been proposed to depend on the quasiperiodic alternation of short and long auditory events in the speech signal – reported in previous acoustic analyses (Molnar et al., 2016) – which reflect the linearization of function words (e.g., articles, prepositions) and content words (e.g., nouns, adjectives, verbs).”

      In the discussion, paragraph 3, we changed “We hypothesized that this effect is linked to a long-term “duration prior” originating from the syntactic function-content word order of language, and specifically, from its acoustic consequences on the prosodic structure” with “We hypothesized that this effect is linked to a long-term “duration prior” originating from the acoustic properties of the two languages, specifically from the alternation of short and long auditory events in their prosody”.

      In the discussion, end of paragraph eight: “The reconstruction of cortical sources associated with the omission of short and long tones in the two groups showed that an interaction effect mirroring the one at the sensor level was present in the left STG, but not in the left IFG (fig. 3, B, C, D). Pairwise comparisons within different ROIs of the left STG indicated that the interaction effect was stronger over primary (BA 41/42) rather than associative (BAs 22) portions of the auditory cortex. Overall, these results suggest that the “duration prior” is linked to the acoustic properties of a given language rather than its syntactic configurations”.

      Now, some minor comments: 

      (1) Where did the experiments take place? Were they in accordance with the Declaration of Helsinki? Did participants give informed consent? 

      All the requested information has been added to the updated version of the manuscript. Thanks for pointing out this.

      (2) The fixed interval should be called inter-stimulus interval. 

      Thanks for pointing this out. We changed the wording as suggested.

      (3) The authors state that "Omission responses allow to examine the presence of putative error signals decoupled from bottom-up sensory input, offering a critical test for predictive coding (Walsh et al 2020, Heilbron and Chait, 2018).". However the way omission responses are computed in their study is by subtracting the activity from the previous tone. This necessarily means that in the omission activity analyzed, there's bottom-up sensory input activity. As performing another experiment with a control condition in which a sequence of randomly presented tones with different durations to compare directly the omission activity in both sequences (experimental and control) is possibly too demanding, I at least urge the authors to incorporate the fact that their omission responses do reflect also tone activity. And consider, for future experiments, the inclusion of further control conditions. 

      Thanks for the opportunity to clarify this aspect. Actually, the way we computed the omission MMN is not by subtracting the activity of the previous tone from the omission, but by subtracting the activity of randomly selected tones across the whole experiment. That is, we randomly selected around 120 long and short tones (i.e., about the same number as the omissions); we computed the ERF for the long and short tones; we subtracted these ERF from the ERF of the corresponding short and long omissions. We clarified these aspects in both the Materials and Methods (ERF analysis paragraph) and Results section.

      Moreover, the subtraction strategy - which is the standard approach to calculate the MMN - allows to handle possible neural carryover effects arising from the perception of the tone preceding the omission.

      The sentence "Omission responses allow to examine the presence of putative error signals decoupled from bottom-up sensory input, offering a critical test for predictive coding (Walsh et al 2020, Heilbron and Chait, 2018)." simply refer to the fact that the error responses resulting from an omission are purely endogenous, as omissions are just absence of an expected input (i.e., silence). On the other hand, when a predicted sequence of tones is disrupted by an auditory deviants (e.g., a tone with a different pitch or duration than the expected one), the resulting error response is not purely endogenous, but it partially includes the response to the acoustic properties of the deviant.

      (4) When multiple clusters emerged from a comparison, only the most significant cluster was reported. Why? 

      We found more than one significant cluster only in the comparison between pure omissions vs tones (figure 2 A, B). The additional significant cluster from this comparison is associated with a P-value of 0.04, emerges slightly earlier in time, and goes in the same direction as the cluster reported in the paper i.e., larger ERF responses for omission vs tones. We added a note specifying the presence of this second cluster, along with a figure on the supplementary material (Supplementary Fig. 1 A, B).

      (5) Fig 2, if ERFs are baseline corrected -50 to 0ms, why do the plots show pre-stimulus amplitudes not centered at 0? 

      This is because we combined the latitudinal and longitudinal gradiometers on the ERF obtained after baseline correction, by computing the root mean square of the signals at each sensor position (see also  https://www.fieldtriptoolbox.org/example/combineplanar_pipelineorder/). This information is reported in the methods part of the article.

      (6) Fig 2, add units to color bars. 

      Sure.

      (7) Fig 2 F and G, put colorbar scale the same for all topographies. 

      Sure, thanks for pointing this out.

      (8) The interaction effect language (Spanish; Basque) X omission type (short; long) appears only in a small cluster of 4 sensors not located at the locations with larger amplitudes to omissions. Authors report it as left frontotemporal, but it seems to me frontocentral with a slight left lateralization.

      (1) the fact that the cluster reflecting the interaction effect does not overlap with the peaks of activity is not surprising in our view. Many sources contribute to the generation of the MMN. The goal of our work was to establish whether there is also evidence for a long-term system (among the many) contributing to this. That is why we perform a first analysis on the whole omission response network (likely including many sources and predictive/attentional systems), and then we zoom in and focus on our hypothesized interaction. We never claim that the main source underlying the omissionMMM is the long-term predictive system. 

      (2) The exact location of those sensors is at the periphery of the left-hemisphere omission response, which mainly reflects activity from the left temporal regions. The sensor location of this cluster could be influenced by multiple factors, including (i) the direction of the source dipoles determining an effect; (ii) the combination of multiple sources contributing to the activity measured at a specific sensor location, whose unmixing could be solved only with a beamforming source approach. Based on the whole evidence we collected also in the source analyzes we concluded that the major contributors to the sensor-level interaction are emerging from both frontal and temporal regions.

      Reviewer #3 (Public Review):

      (1) The main weaknesses are the strength of the effects and generalisability. The sample size is also relatively small by today's standards, with N=20 in each group. Furthermore, the crucial effects are all mostly in the .01>P<.05 range, such as the crucial interaction P=.03. It would be nice to see it replicated in the future, with more participants and other languages. It would also have been nice to see behavioural data that could be correlated with neural data to better understand the real-world consequences of the effect.

      We appreciate the positive feedback from Reviewer #3. We agree that it would be nice to see this study replicated in the future with larger sample sizes and a behavioral counterpart. Below are a few comments concerning the weakness highlighted: 

      (i) Concerning the sample size: a similar point was raised by Reviewer #1. We report our reply as presented above: “Despite a sample size of 20 participants per group can be considered relatively small for detecting an effect in a between-group design, it must be noted that our effect of interest was based on Molnar et al.’s (2016) experiment, where a sample size of 16 subjects per group was sufficient to detect the perceptual grouping effect. In Yoshida et al., 2010, the perceptual grouping effect arose with two groups of 20 7–8-month-old Japanese and English-learning infants. Based on these findings, we believe that a sample size of 20 participants per group can be considered appropriate for the current study”. We clarified these aspects in the new version of the manuscript.

      (ii) We believe that the lack of behavioral data does not undermine the main findings of this study, given the careful selection of the participants and the well-known robustness of the perceptual grouping effect (e.g., Iversen 2008; Yoshida et al., 2010; Molnar et al. 2014; Molnar et al. 2016). As highlighted by Reviewer #2, having Spanish and Basque dominant “speakers as a sample equates that in Molnar et al. (2016), and thus overcomes the lack of direct behavioral evidence for a difference in rhythmic grouping across linguistic groups. Molnar et al. (2016)'s evidence on the behavioral effect is compelling, and the evidence on neural signatures provided by the present study aligns with it”. (iii) Regarding the fact that the “crucial effects are all mostly in the .01>P<.05 range”: we want to stress that the approach we used to detect the interaction effect was conservative, using a cluster-based permutation approach with no a priori assumptions about the location of the effect. The robustness of our approach has also been highlighted by Reviewer 2: “Data analyses. Sound, state-of-the-art methodology in the event-related field analyses at the sensor level.” In sum, despite some crucial effects being in the .01>P<.05 range, we believe that the statistical soundness of our analysis, combined with the lack of effect in the control condition, provides compelling evidence for our H1.

      Reviewer #3 (Recommendations For The Authors):

      Figures - Recommend converting all diagrams and plots to vector images to ensure they remain clear when zoomed in the PDF format. 

      Sure, thanks. 

      Figure 1: To improve clarity, the representation of sound durations in panels C and D should be revisited. The use of quavers/eighth notes can be confusing for those familiar with musical notation, as they imply isochrony. If printed in black and white, colour distinctions may be lost, making it difficult to discern the different durations. A more universal representation, such as spectrograms, might be more effective. 

      Thanks for the suggestion. It’s true that the quavers/eighth notes might be confusing in that respect. However, we find this notation as a relatively standard approach to define paradigms in auditory neuroscience, see for instance the two papers below. In the new version of the manuscript, we specified in the captions under the figure that the notes refer to individual tones, in order to avoid ambiguities.

      - Wacongne, C., Labyt, E., Van Wassenhove, V., Bekinschtein, T., Naccache, L., & Dehaene, S. (2011). Evidence for a hierarchy of predictions and prediction errors in human cortex. Proceedings of the National Academy of Sciences, 108(51), 20754-20759.

      - Dehaene, S., Meyniel, F., Wacongne, C., Wang, L., & Pallier, C. (2015). The neural representation of sequences: from transition probabilities to algebraic patterns and linguistic trees. Neuron, 88(1), 2-19.

      Figure 2 : In panel C of Figure 2, please include the exact p-value for the interaction observed. Refrain from using asterisks or "n.s." and opt for exact p-values throughout for the sake of clarity. 

      Thank you for your suggestion. We have included the exact p-value for the interaction in panel C of Figure 2. However, for the remaining figures, we have chosen to maintain the use of asterisks and "n.s.". We would like our pictures to convey the key findings concisely, while the numerical details can be found in the article text. The caption below the image also provides guidance on the interpretation of the p-values: (statistical significance: **p < 0.01, *p < 0.05, and ns p > 0.05).  

      Figure 3 Note typo "Omission reponse"

      Fixed. Thanks for noticing the typo. 

      A note: we moved the figure reflecting the main effect of long tone omission and the lack of main effect of language background (Figure 4 in the previous manuscript) in the supplementary material (Supplementary Figure 2).

      References

      Bendixen, A., SanMiguel, I., & Schröger, E. (2012). Early electrophysiological indicators for predictive processing in audition: a review. International Journal of Psychophysiology, 83(2), 120-131.

      Heilbron, M., & Chait, M. (2018). Great expectations: is there evidence for predictive coding in auditory cortex?. Neuroscience, 389, 54-73.

      Iversen, J. R., Patel, A. D., & Ohgushi, K. (2008). Perception of rhythmic grouping depends on auditory experience. The Journal of the Acoustical Society of America, 124(4), 22632271.

      Molnar, M., Lallier, M., & Carreiras, M. (2014). The amount of language exposure determines nonlinguistic tone grouping biases in infants from a bilingual environment. Language Learning, 64(s2), 45-64.

      Molnar, M., Carreiras, M., & Gervain, J. (2016). Language dominance shapes non-linguistic rhythmic grouping in bilinguals. Cognition, 152, 150-159.

      Ross, J. M., & Hamm, J. P. (2020). Cortical microcircuit mechanisms of mismatch negativity and its underlying subcomponents. Frontiers in Neural Circuits, 14, 13.

      Simon, J., Balla, V., & Winkler, I. (2019). Temporal boundary of auditory event formation: An electrophysiological marker. International Journal of Psychophysiology, 140, 53-61.

      Studenova, A. A., Forster, C., Engemann, D. A., Hensch, T., Sander, C., Mauche, N., ... & Nikulin, V. V. (2023). Event-related modulation of alpha rhythm explains the auditory P300 evoked response in EEG. bioRxiv, 2023-02.

      Yoshida, K. A., Iversen, J. R., Patel, A. D., Mazuka, R., Nito, H., Gervain, J., & Werker, J. F. (2010). The development of perceptual grouping biases in infancy: A Japanese-English cross-linguistic study. Cognition, 115(2), 356-361.

      Zhang, Y., Yan, F., Wang, L., Wang, Y., Wang, C., Wang, Q., & Huang, L. (2018). Cortical areas associated with mismatch negativity: A connectivity study using propofol anesthesia. Frontiers in Human Neuroscience, 12, 392.

      Ladinig, O., Honing, H., Háden, G., & Winkler, I. (2009). Probing attentive and preattentive emergent meter in adult listeners without extensive music training. Music Perception, 26(4), 377-386. 

      Brochard, R., Abecasis, D., Potter, D., Ragot, R., & Drake, C. (2003). The “ticktock” of our internal clock: Direct brain evidence of subjective accents in isochronous sequences. Psychological Science, 14(4), 362-366.

      Potter, D. D., Fenwick, M., Abecasis, D., & Brochard, R. (2009). Perceiving rhythm where none exists: Event-related potential (ERP) correlates of subjective accenting. Cortex, 45(1), 103-109.

      Bouwer, F. L., Werner, C. M., Knetemann, M., & Honing, H. (2016). Disentangling beat perception from sequential learning and examining the influence of attention and musical abilities on ERP responses to rhythm. Neuropsychologia, 85, 80-90.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Joint Public Review:

      Summary

      This manuscript explores the transcriptomic identities of olfactory ensheathing cells (OECs), glial cells that support life-long axonal growth in olfactory neurons, as they relate to spinal cord injury repair. The authors show that transplantation of cultured, immunopurified rodent OECs at a spinal cord injury site can promote injury-bridging axonal regrowth. They then characterize these OECs using single-cell RNA sequencing, identifying five subtypes and proposing functional roles that include regeneration, wound healing, and cell-cell communication. They identify one progenitor OEC subpopulation and also report several other functionally relevant findings, notably, that OEC marker genes contain mixtures of other glial cell type markers (such as for Schwann cells and astrocytes), and that these cultured OECs produce and secrete Reelin, a regrowth-promoting protein that has been disputed as a gene product of OECs.

      This manuscript offers an extensive, cell-level characterization of OECs, supporting their potential therapeutic value for spinal cord injury and suggesting potential underlying repair mechanisms. The authors use various approaches to validate their findings, providing interesting images that show the overlap between sprouting axons and transplanted OECs, and showing that OEC marker genes identified using single-cell RNA sequencing are present in vivo, in both olfactory bulb tissue and spinal cord after OEC transplantation.

      Despite the breadth of information presented, however, further quantification of results and explanation of experimental approaches would be needed to support some of the authors' claims. Additionally, a more thorough discussion is needed to contextualize their findings relative to previous work.

      (1) a. Important quantification is lacking for the data presented. For example, multiple figures include immunohistochemistry or immunocytochemistry data (Figures 1, 5, 6), but they are presented without accompanying measures like fractions of cells labeled or comparisons against controls.

      We would like to clarify that the immunohistochemistry or immunocytochemistry data presented are meant to be qualitative rather than quantitative. The main purpose of the images is to show the presence or absence of markers of OEC subtypes rather than how much is present. That being said, in the revision we now add quantitative estimates of cell fractions for OECs along with other major cell types in Supplemental Table 1 and each OEC subtype marker in Supplemental Table 2. 

      b. As a result, for axons projecting via OEC bridges in Figure 1, it is unclear how common these bridges are in the presence or absence of OECs.

      We note that the number of spinal cord transected rats with bridges of axons crossing the lesion core are extremely rare following a severe spinal cord injury in adult mammals. Our first example of axon bridging following a complete spinal cord transection followed by OEC transplants was reported in Thornton et al., (2018) and compared to an incomplete transection in a fibroblast-transplanted control in his Figure 4. That figure also appeared the cover of Experimental Neurology when the paper was published. Figure 1 in the current paper was from an independent experiment which replicated the previously observed rare bridge formation. We noted this in the revised manuscript.

      Page 6: “We note, however, that such bridge formation is rare following a severe spinal cord injury in adult mammals.”

      c. For Figure 6., it is unclear whether cells having an alternative OEC morphology coincide with progenitor OEC subtype marker genes to a statistically significant degree. (see top paragraph on page 11)

      Franceschini & Barnett (1996) suggested that there were 2 distinct types of OECs that could be distinguished by their different morphology: one type resembling a Schwann cell and the other, an astrocyte. The purpose of Figure 6 is to determine if there is a link between our OEC subtypes based on scRNAseq with those previously described based on morphology alone (Franceschini and Barnett, 1996). There could be agreement between large, flat or small fusiform OECs morphological and their progenitor status, but it is not required that the two classification types would significantly overlap. Here we report the percentage of morphology-based cell subtypes that show expression of our OEC subtype markers to estimate the overlap between the two. Our results indicate the two types of OEC morphologies share a certain degree of overlap, a finding that indicates similarities as well as differences between the two classification methods.

      In our results section we show that ~3/4ths of the Ki67-expressing OEC progenitor cells sampled were astrocyte-like, i.e., flat in shape and weakly Ngfr<sup>p75</sup>-labeled. The remaining ~1/4th of the Ki67-labeled  OECs were fusiform in shape and expressed Ngfr<sup>p75</sup> strongly. We feel that this is important to include as it is the only previous report of OB-OEC subtypes. The statistics of these results were in our original manuscript on page 11 and we further revise the text as follows:

      Page 12: “To determine if the proliferative OECs differ in appearance from adult OECs, and whether there is concordance between our OEC subtypes based on gene expression markers and previously described morphology-based OEC subtyping (Franceschini & Barnett, 1996), we analyzed OECs identified with the anti-Ki67 nuclear marker and anti-Ngfr<sup>p75</sup>  (Figure 6g-h). Of the Ki67-positive OECs in our cultures, 24% ± 8% were strongly Ngfr<sup>p75</sup>-positive and spindle-shaped, whereas 76% ± 8% were flat and weakly Ngfr<sup>p75</sup>-labeled (n=4 cultures, p\= 0.023). Here we show that a large percentage (~3/4ths) of proliferative OECs are characterized by large, flat morphology and weak Ngfr<sup>p75</sup> expression resembling the previously described morphology-based astrocyte-like subtype. Our results indicate the two types of OEC classifications share a certain degree of overlap, indicating similarities but also differences between the two classification methods.”

      d. Similar quantification is missing in other types of data such as Western blot images (Fig. 9) and OEC marker gene data (for which p-values are not reported; Table S2). 

      Response on Western blots: The Western blot signals shown in Figure 9 are from experiments that were designed to be qualitative rather than quantitative, by addressing the question, “Can we detect Reelin signals or not? in the different samples.” Both Western blots show that Reln<sup>+/+</sup> mouse olfactory bulbs (d) or cortices (e) contain Reelin whereas Reln<sup>-/-</sup>  samples do not and therefore provide positive and negative controls, respectively. The rat olfactory nerve layer (ONL, laminae I-II of olfactory bulb, d lane 1; e lane 3) contains mainly OECs wrapped around the axons of the olfactory sensory neurons that transmit olfactory signals into the olfactory bulb. To address your request for quantification, Dr. Khankan measured the density of the three isoforms of Reelin, 400 kD, 300 kD and 180 kD in Fig. 9e and normalized them against the GADPH control (37 kD). The graph below shows the normalized band density in arbitrary units on the Y-axis relative to the first 3 conditions, i.e., Reln<sup>+/+</sup> and Reln<sup>-/-</sup> mouse cerebral cortices and rat  Reln<sup>+/+</sup> ONL. Because the conditioned medium was collected from tissue culture medium rather than cells or tissue, the GAPDH control was not present and therefore these data cannot be normalized in a similar analysis.  

      Author response image 1.

      Response for OEC marker gene data: We now add new full supplementary Table S1 (for major cell types) and Table S2 (for OEC subtypes) to report statistical p values and adjusted p values, as well as additional statistics information including percent cell expressing a subtype marker in a given subtype versus in other subtypes. 

      e. The addition of quantitative measures and, where appropriate, statistical comparisons with p-values or other significance measures, would be important for supporting the authors' claims and more rigorously conveying the results.

      As detailed in the above responses, we now add quantifications and statistics to support the claims and enhance the rigor of our analysis.

      (2) a. Some aspects of the experimental design that are relevant to the interpretation of the results are not explained. For example, OECs appear to be collected from only female rats, but the potential implications of this factor are not discussed.

      We added a short explanation in the Discussion and Methods section regarding why spinal cord injury studies are carried out on female rats.

      Page 24, Discussion: “Due to the extensive urinary tract dysfunction in spinal cord transected rats, most studies prefer females as their short urethra facilitates daily manual bladder expression. Our study, therefore, was carried out only on adult female rats, so sex differences and the generalizability of our findings to adult male rats would require further investigation.”

      Page 26, Methods: “Only females were used in order to match the sex of previous SCI studies conducted exclusively on female rats (Dixie, 2019; Khankan et al., 2016; Takeoka et al., 2011; Thornton et al., 2018). Following complete thoracic spinal cord transection, an adult rat is unable to urinate voluntarily and therefore urine must be manually “expressed” twice a day throughout the experiment. Females have a shorter urethra than males, and thus their bladders are easier to empty completely.”

      b. Additionally, it is unclear from the manuscript to what degree immunopurified cells are OECs as opposed to other cell types. The antibody used to retain OECs, nerve growth factor receptor p75 (Ngfr-p75), can also be expressed by non-OEC olfactory bulb cell types including astrocytes [1-3]. The possible inclusion of Ngfr-p75-positive but non-OEC cell types in the OEC culture is not sufficiently addressed.

      (a) Cragnolini, A.B. et al., Glia, (2009), doi: 10.1002/glia.20857.

      (b) Vickland H. et al., Brain Res., (1991), doi: 10.1016/0006-8993(91)91659-O.

      (c) Ung K. et al., Nat Commun., (2021), doi: 10.1038/s41467-021-25444-3.

      Our OECs are dissected primarily from the olfactory nerve layer that is concentrated medially and ventrally around the olfactory bulb together with a small part of the glomerular layer (layer II). OECs are the only glia present in olfactory nerve layer. Thus, although it is possible that other cell types also express Ngfr-p75 as pointed out by the reviewer and in the references provided, our OEC dissection method severely limits the number of astrocytes that might be included in our cultures. We further provide additional evidence (see updated Figure 2d and the detailed responses to the next question) that our immunopanned OECs using our dissection method consistently express all classic OEC markers but do not consistently express the majority of classic markers for other glial cell types such as astrocytes or oligodendrocytes.

      Such non-OEC cell types are also not distinguished in the analysis of single-cell RNA sequencing data (only microglia, fibroblasts, and OECs are identified; Figure 2). Thus, it is currently unclear whether results related to the OEC subtype may have been impacted by these experimental factors.

      We need to clarify that when determining potential cell types in Figure 2, we compared our cell cluster marker genes against a broad array of cell types including astrocytes, oligodendrocytes and Schwann cells, but the gene overlap was only significant for microglia, fibroblasts, and OECs, which we labeled in new Figure 2d. We added more details in methods and results to clarify how we determined the cell types in Figure 2 (text added below). We did consider all the potential cell types that could have been present in our OEC cultures, including astrocytes. However, astrocyte or oligodendrocyte markers were not significantly enriched in the clusters, but markers for microglia, fibroblasts, and OECs were prominent in the cell clusters.

      In the revised Figure 2d, we now illustrate that the OEC clusters not only express typical OEC markers, but also express a few but not all marker genes from other glial cells. We show the comparative data on markers for astrocytes, oligodendrocytes, and Schwann cells in Figure 2d in parallel with the marker genes for OECs, microglia, and fibroblasts. For each of the other glial cell types, there are some genes which overlap with OECs, and that is the reason why we identified OECs as hybrid glia.

      Page 6, Results: “Based on previously reported cell type marker genes for fibroblasts and major glial cell types including OECs, astrocytes, oligodendrocytes, and microglia, we found elevated expression of OEC marker genes in clusters 2, 3 and 7, microglia marker genes in clusters 4, 6, and 7, and fibroblast marker genes in clusters 0, 1, and 5 (Figure 2d).”

      Page 33, Methods: “Additional marker genes for fibroblasts and multiple glial cell types including astrocytes, oligodendrocytes, and microglia were also used to compare with those of the cell clusters.”

      (3) The introduction, while well written, does not discuss studies showing no significant effect of OEC implantation after spinal cord injury. The discussion also fails to sufficiently acknowledge this variability in the efficacy of OEC implantation. This omission amplifies bias in the text, suggesting that OECs have significant effects that are not fully reflected in the literature. The introduction would need to be expanded to properly address the nuance suggested by the literature regarding the benefits of OECs after spinal cord injury. Additionally, in the discussion, relating the current study to previous work would help clarify how varying observations may relate to experimental or biological factors.

      We appreciate the insightful comment and have now included information about the variability in OEC transplantation in previous studies in both the introduction and discussion sections. We discuss technical differences that lead to variability in the Introduction and how our findings could help interpret the variability in the Discussion.

      Page 4-5: Text added to the Introduction: “The outcomes of OEC transplantation studies after spinal cord injury vary substantially in the literature due to many technical differences between their experimental designs. The source of OECs has a great impact on the outcome, with OB-OECs showing more promise than peripheral lamina propria-derived OECs, and purified, freshly-prepared OECs being required for optimal OEC survival. Other important variables include the severity of the injury (hemisection to complete spinal cord transection), the age of the spinal cord injured host (early postnatal versus adult), and OEC transplant strategies (delayed or acute transplantation, cell transplants with or without a matrix; Franssen et al., 2007). Franssen et al. (2007) evaluated studies that used only OECs as a transplant, and reported that 41 out of 56 studies showed positive effects, such as OEC stimulation of regeneration, positive interactions with the glial scar and remyelination of axons. More recent systematic reviews and meta-analyses on the effects of OEC transplantation following different spinal cord injury models reported that OECs significantly improved locomotor function (Watzlawick et al.2016; Nakjavan-Shahraki et al., 2018), but did not improve neuropathic pain (Nakjavan-Shahraki et al., 2018.)”

      Pages 24-25: Discussion on OEC source variability  “Extensive differences between OEC preparations contribute to the large variation in results from OEC treatments following spinal cord injury. This scRNA-seq study focused entirely on OB-OECs, and the next step would be to carry out similar studies on the peripheral, lamina-propria-derived OECs to discern the differences between these OEC populations. Such comparative studies using scRNA-seq will help define the underlying mechanisms and help resolve the variability in results from OEC-based therapy. Detailed studies of the composition of different OEC transplant types will contribute to identifying the most reparative cell transplantation treatments.”

      Reviewer #1 (Recommendations For The Authors):

      This is an extremely well-written and impactful series of experiments from a renowned leader in the field. The experimental questions are timely, with similar therapeutic approaches being prepared for clinical trial. The results address a gap that has persisted in the field for several decades and one that has been considered by many scientists long before technology existed to find answers. This highlights the importance of these experiments and the results reported here. With these things in mind, there are only a few minor factors that I have, that should be addressed to strengthen the paper.

      We truly appreciate the positive evaluations from the reviewer!

      Primary concerns

      (1) Quantification of results: The authors report on the data with broad brush strokes, missing the opportunity to quantify results and strengthen the interpretations. For instance, when describing gene expression, what proportion of cells analyzed were expressing these genes? How did this compare with detectable levels of protein? Can the author draw correlations between data sets collected that could offer even more insight into the identities of the cells studied? There is also a missed opportunity to evaluate how transplantation into injured neural tissue might alter gene expression of the phenotypes identified prior to transplantation.

      We appreciate these insightful comments and have added quantitative information and other relevant discussions in the revision. We now add Suppl Tables 1 (for major cell types including OECs, fibroblast, and microglia) and 2 (for OEC subtypes) to indicate the proportion of cells expressing each marker gene in each given cell cluster/subtype in the column. “Percentage of cells expressing the gene in the subtype/cell type” versus the proportion of cells expression the given marker genes in other cell types in the column “Percentage of cells expressing the gene in the other subtypes/cell types.” In the new supplementary tables, we report statistical p values and adjusted p values after multiple testing correction to indicate statistical significance.

      Regarding the comparison with protein levels, we carried out immunohistochemistry experiments to confirm the proteins corresponding to OEC subtype markers. Our findings show that proteins for the gene markers can be detected, and thereby supports our sc-seq findings. However, the immunofluorescence only provides a qualitative measure of protein levels in situ, so we cannot perform a correlation analysis. This is something we plan to  pursue in a follow-up study with measurable protein levels. We also discuss future directions to examine the genes and proteins in in vivo transplantation studies in the Discussion.

      (2) Discussion and interpretation: Greater depth to interpretation and discussion of data and its impact on future work is needed. For example, on pages 20-21, the authors reflect briefly on why Reelin might be of interest (it could lead to Dab-1 expression), but why is that important? There are several instances like this where it would be useful for the authors to provide a little more insight into the potential impact of these data and interpretations.

      We appreciate these valuable suggestions. We have revised our Results and Discussion sections to offer deeper insight and interpretation of the importance of the data, especially that for Reelin.

      Page 17: Results: “In the canonical Reelin-signaling pathway, Reelin binds to the very-low-density lipoprotein receptor (Vldlr) and apolipoprotein E receptor 2 (ApoER2) and induces Src-mediated tyrosine phosphorylation of the intracellular adaptor protein Disabled-1 (Dab1). Both Reelin and Dab1 are highly expressed in embryos and contribute to correct neuronal positioning.”

      Page 22-23, Discussion: “Reelin is a developmentally expressed protein detected in specific neurons, in addition to OECs and Schwann cells. The canonical Reelin-signaling pathway involves neuronal-secreted Reelin binding to Vldlr and ApoER2 receptors expressed on Dab1-labeled neurons. Following Reelin binding, Dab1 is phosphorylated by Src family kinases which initiates multiple downstream pathways. Very little is known, however, about Reelin secreted by glia. Panteri et al. (2006) reported that Schwann cells express low levels of Reelin in adults, and that it is upregulated following a peripheral nerve crush, as is reported above for many neurotrophic factors. Reelin loss in Schwann cells reduced the diameter of small myelinated axons but did not affect unmyelinated axons (Panteri et al., 2005). In the olfactory system, OECs ensheath the Dab1-labeled, unmyelinated axons of olfactory sensory neurons which are continuously generated and die throughout life. OEC transplantation following spinal cord injury would provide an exogenous source of Reelin that could phosphorylate Dab1-containing neurons or their axons. Dab1 is expressed at high levels in the axons of some projection neurons, such as the corticospinal pathway (Abadesco et al., 2014). Future experiments are needed to explore the function that glial-secreted Reelin may have on axonal regeneration.”

      Minor concerns

      (3) The authors reflect on the spontaneous glial bridge that develops in the repairing spinal cord of Zebrafish, but perhaps even more relevant is that this same phenomenon occurs in mammals as well if the spinal cord is injured during early development (opossum; Lane et al, EJN 2007). This should be considered and the statement that there is little regeneration in the mammalian spinal cord should be clarified.

      We appreciate this insightful comment. We now add discussions of the axonal regeneration and bridging observed following severe spinal cord injury in young developing mouse and opossum spinal cords.

      Page 23: “Adult mammals show little evidence of spontaneous axonal regeneration after a severe spinal cord injury in contrast to transected neonatal rats (Bregman, 1987; Bregman et al., 1993) and young postnatal opossums (Lane et al., 2007). In immature mammals, axons continue to project across or bridge the spinal cord transection site during development. Lower organisms such as fish, show even more evidence of regeneration following severe SCI. Mokalled et al. (2016) reported that glial secretion of Ctgfa/Ccn2 was both necessary and sufficient to stimulate a glial bridge for axon regeneration across the zebrafish transection site. Cells in the injury site that express Ctgf include ependymal cells, endothelial cells, and reactive astrocytes (Conrad et al., 2005; Mokalled et al., 2016; Schwab et al., 2001). Here we show that, although rare, Ctgf-positive OECs can contribute to glial bridge formation in adult rats. The most consistent finding among our severe SCI studies combined with OEC transplantation is the extent of remodeling of the injury site and axons growing into the inhibitory lesion site, together with OECs and astrocytes. The formation of a glial bridge across the injury was critical to the spontaneous axon generation seen in zebrafish (Mokalled et al., 2016) and likely contributed to the axon regeneration detected in our OEC transplanted, transected rats (Dixie, 2019; Khankan et al., 2016; Takeoka et al., 2011; Thornton et al., 2018).

      Reviewer #2 (Recommendations For The Authors):

      (1) The manuscript title and abstract must include the species and sex studied.

      The title and abstract have been modified as suggested.

      Page 1: “Olfactory ensheathing cells from adult female rats are hybrid glia that promote neural repair”

      (2) OECs submitted for sequencing were like those about to be transplanted; however, the phenotype of the cells would likely change immediately and shift over time post-implantation. Please briefly address or discuss this point in the Discussion (or Results).

      We have added this important discussion point.

      Pages 23-24: Discussion: “We recognize that this study is a single snapshot of OEC gene expression derived from adult female rats before they are transplanted above and below the spinal cord transection site. We would expect the gene expression of transplanted OECs to change in each new environment, i.e. as they migrate into the injury site, integrate into the glial scar, and wrap around axons. Based on our past studies, OECs survived in an outbred Sprague-Dawley rat model for ~ 4 weeks (Khankan et al., 2016) and in an inbred Fischer 344 model for 5 months (Dixie, 2019). As spinal cord injury transplant procedures are further enhanced and OEC survival improves, these hybrid glial cells should be examined at multiple time points to better evaluate their proregenerative characteristics.”

      (3) Page 12: Use of "monocytes" - the word "monocyte" implies a circulating, undifferentiated innate immune cell. This should not be used interchangeably with macrophage or microglia.

      We agree and now refer to microglia or macrophages depending on the context. We did leave the term monocyte in Table 3 if these cells were found in a top 20 gene reported in the references.

      (4) Page 12: "We now show that these unique monocytes reported between the bundles of olfactory axons surrounded by OECs (Smithson & Kawaja, 2010), are in fact, a distinct subtype of OECs."

      Is it possible to conclude that these cells are a "distinct subtype of OECs?" Perhaps these cells are a hybrid between microglia/macrophages and OECs? This is speculative, so should be worded more carefully - especially in the Results section. Please clarify, dampen conclusions, and/or better justify the wording here.

      We agree and have modified the entire paragraph to dampen and more carefully explain our conclusions. We also added an additional observation that strengthens the relationship between OECs and microglial/macrophages.  

      Page 12, Results: Additional observation: “In fact, all top 20 genes in cluster 3 are expressed in microglia, macrophages, and/or monocytes (Suppl. Table 3).”

      Page 13, Results: The statement referenced in your review was deleted and we wrote the following: “Smithson and Kawaja (2010) identified unique microglial/macrophages that immunolabeled with Iba-1 (Aif1) and Annexin A3 (Anxa3) in the olfactory nerve and outer nerve layer of the olfactory bulb. These authors proposed that Iba1-Anxa3 double-labeled cells were a distinct population of microglia/macrophages that protected the olfactory system against viral invasion into the cranial cavity. Based on our scRNA-seq data we offer an alternative interpretation that at least some of these Iba-1-Anxa3 cells may be a hybrid OEC-microglial cell type. Supporting this interpretation, there are a number of reports that suggest OECs frequently function as phagocytes (e.g., Khankan et al., 2016; Nazareth et al., 2020; Su et al. 2013).”

      (5) Page 13: "Pseudotime trajectory analysis, a widely used approach to predict cell plasticity and lineages based on scRNA-seq data, suggests that there are potential transitions between specific OEC subclusters." This is interesting but is somewhat unclear. Please add one more sentence to aid the reader's understanding regarding how this analysis is performed.

      Thank you for your valuable feedback. We have revised the text for clarity as follows:

      Page 14, Results: “We performed pseudotime trajectory analysis using the Slingshot algorithm to infer lineage trajectories, cell plasticity and lineages by ordering cells in pseudotime based on their transcriptional progression reflected in scRNA-seq data. Transcriptional progression refers to the changes in gene expression profiles of cells as they undergo differentiation or transition through different states. The trajectory analysis results suggest that there are potential transitions between specific OEC subclusters.”

      (6) The authors could discuss potential reasons for variability in OEC treatment results after spinal cord injury between studies and labs. How might sequencing results here inform the debate about whether OECs are helpful or not?

      In response to the Public Review, we added discussions about the variability in OEC treatments between studies in both the Introduction and Discussion, and these comments are copied on pages 6-7 of this document. In the Discussion we included a statement about how the current findings may inform the debate on OECs.

      (7) Discussion: please add a discussion of limitations and future directions that addresses the following points:

      a) Please add one sentence on the lack of studying sex differences - only females were studied here.

      b) There is no correlation or modulation of any target genes, so all results here are correlative.

      c) Please add a brief paragraph with future directions for the field, including acknowledgment that the role of OECs in repair after SCI is not fully resolved and that future studies might consider targeting some of the specific pathways described herein.

      d) Which pathways and OEC subpopulations likely best support repair, and how might these be reinforced or better maintained in the SCI environment? If not known, what are the next steps for identifying the most reparative OEC subtype?

      Thank you for the valuable suggestions. We have added these to the discussion as detailed below.

      Pages 23-25, Discussion:

      “Limitations of these OEC scRNA-Seq studies”

      “We recognize that this study is a single snapshot of OEC gene expression derived from adult female rats before they are transplanted above and below the spinal cord transection site. We would expect the gene expression of transplanted OECs to change in each new environment, i.e. as they migrate into the injury site, integrate into the glial scar, and wrap around axons. Based on our past studies, OECs survived in an outbred Sprague-Dawley rat model for ~ 4 weeks (Khankan et al., 2016) and in an inbred Fischer 344 model for 5 months (Dixie, 2019). As spinal cord injury transplant procedures are further enhanced and OEC survival improves, these hybrid glial cells should be examined at multiple time points to better evaluate their proregenerative characteristics.”

      “Due to the extensive urinary tract dysfunction in spinal cord transected rats, most studies are conducted on females as their short urethra facilitates daily manual bladder expression. Our study was carried out only on adult female rats, so sex differences and the generalizability of our findings to adult male rats would require further investigation. We also did not modulate any of the genes or proteins in the identified OEC subtypes to test their causal and functional roles, thus our findings remain correlative in the current study. Future gene/protein modulation studies are necessary to understand the functional roles of the individual OEC subtypes in the context of their reparative functions to determine which pathways and subtypes are more critical and can be enhanced for neural repair. Our current findings build the foundation for these future studies to help resolve the role of OECs in spinal cord injury repair.” 

      “Extensive differences between OEC preparations contribute to the large variation in results from OEC treatments following spinal cord injury. This scRNA-seq study focused entirely on OB-OECs, and the next step would be to carry out similar studies on the peripheral, lamina-propria-derived OECs to discern the differences between the two OEC populations. Such comparative studies using scRNA-seq will help define the underlying mechanisms and resolve the variability in results from OEC-based therapy. Detailed studies of the composition of different OEC transplant types will contribute to identifying the most reparative cell transplantation treatments.”

      (8) Figure 6: What is the major point of this figure and its related immunocytochemistry? Please clarify.

      Franceschini & Barnett (1996) suggested that there were 2 distinct types of OECs that could be distinguished by their different morphology: One type resembling a Schwann cell and the other, an astrocyte. The purpose of Figure 6 is to determine if there is a link between our scRNA-seq-based OEC subtypes with those previously described based on morphology alone (Franceschini and Barnett, 1996). In our results section we show that ~3/4ths of the OECs sampled that were Ki67+ progenitor cells and were astrocyte-like, i.e., flat in shape and weakly Ngfr<sup>p75</sup>-labeled. The remainder were Schwann cell-like, fusiform in shape and strongly Ngfr<sup>p75</sup>-labeled. Our results indicate the two types of OEC classifications share certain degrees of overlap, indicating similarities but also differences between the different classification methods.

      (9) Figure 9, caption: "OEC whole cell lysates (WCL; lanes: 4, 6, and 8), and OEC conditioned medium (CM; lanes: 5 and 7)."  This statement is unclear - please clarify the result here.

      We added clarification to the legend for Figure 9d. 

      Page 50: (d) “Western blot confirms the expression of Reelin in rat olfactory nerve layer I and layer II (ONL; lane 1 of western blot). Reln<sup>+/+</sup> and Reln<sup>-/-</sup> mouse olfactory bulbs were used as positive and negative controls, respectively (lanes: 2 and 3). Reelin that was synthesized by cultured OECs was found in whole cell lysates (WCL; lanes: 4, 6, and 8), whereas Reelin that was secreted by cultured OECs into tissue culture medium was measured in the OEC “conditioned medium” (CM; lanes: 5 and 7). GAPDH was the loading control for tissue homogenates (lanes 1-4, 6, 8).”

      (10) Methods: A Cat. No. for all antibodies and key supplies should be included.

      Response: All of the antibody information in the revised version is in Suppl. Table 4. Information for other key supplies is included in the extensive methods section.

      (11) Methods: How was primary antibody specificity validated for less-used antibodies? Background staining can be a major issue after SCI; e.g., with the CTGF antibody used in Figure 5.

      The spinal cord section shown in Figure 5 was compared to sections from the same SCI cohort that had been injected with control cells, i.e. skin fibroblasts. We have used the first two antibodies (anti-Glial fibrillary acidic protein and anti-Green fluorescent protein) for many years so only the CTGF was a “less-used antibody.” Our strategy for working with “less-used” or “newly-purchased” antibodies was as follows.

      First, we studied the literature to find the best antibodies for neuronal tissue. Many of the images in Figure 7 were generated with antibodies purchased just for this study. Our goal was to characterize them on normal adult lamina propria and olfactory bulb tissues rather than in the injured spinal cord where background can be an issue. In the olfactory bulb we examined the olfactory nerve layer where OECs are concentrated and then examined the olfactory epithelium, lamina propria, and the deep layers of the olfactory bulb to find regions without immunolabel. As described above, we tested anti-CTGF antibodies in SCI sections implanted with skin fibroblasts controls when conducting experiments for CTGF in sections with OECs. New antibodies were tested at multiple concentrations and we tried different immunocytochemical techniques. Anti-CTFG is expressed by several different cell types, but expression is low in most of the areas above and below the injury site. Despite our success with many “newly-purchased” antibodies there were at least 4 of them that we were never able obtain specific labeling. 

      (12) Will the data (especially the sequencing data) be shared publicly?

      The data has been uploaded to and shared via the public data repository GEO. Data availability is stated on the title page of this manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This study provides important evidence supporting the ability of a new type of neuroimaging, OPM-MEG system, to measure beta-band oscillation in sensorimotor tasks on 2-14 years old children and to demonstrate the corresponding development changes, since neuroimaging methods with high spatiotemporal resolution that could be used on small children are quite limited. The evidence supporting the conclusion is solid but lacks clarifications about the much-discussed advantages of OPM-MEG system (e.g., motion tolerance), control analyses (e.g., trial number), and rationale for using sensorimotor tasks. This work will be of interest to the neuroimaging and developmental science communities.

      We thank the editors and reviewers for their time and comments on our manuscript. We have responded in detail to the comments, on a point-by-point basis, below. Included in our responses (and our revised manuscript) are additional analyses to control for trial count, clarification of the advantages of OPM-MEG, and justification of our use of sensory (as distinct from motor) stimulation. In what follows, our responses are in bold typeface; additions to our manuscript are in bold italic typeface. 

      Reviewer #1 (Public Review):

      Summary:

      Compared with conventional SQUID-MEG, OPM-MEG offers theoretical advantages of sensor configurability (that is, sizing to suit the head size) and motion tolerance (the sensors are intrinsically in the head reference frame). This study purports to be the first to experimentally demonstrate these advantages in a developmental study from age 2 to age 34. In short, while the theoretical advantages of OPM-MEG are attractive - both in terms of young child sensitivity and in terms of motion tolerance - neither was in fact demonstrated in this manuscript. We are left with a replication of SQUID-MEG observations, which certainly establishes OPM-MEG as "substantially equivalent" to conventional technology but misses the opportunity to empirically demonstrate the much-discussed theoretical advantages/opportunities.

      Thank you for reviewing our manuscript. We agree that our results demonstrate substantial equivalence with conventional MEG. However, as mentioned by Reviewer 3, most past studies have “focused on older children and adolescents (e.g., 9-15 years old)” whereas our youngest group is 25 years. We believe that by obtaining data of sufficient quality in these age groups, without the need for any restriction of head movement, we have demonstrated the advantage of OPM-MEG. We now have made this clear in our discussion:

      “…our primary aim was to test the feasibility of OPM-MEG for neurodevelopmental studies. Our results demonstrate we were able to scan children down to age 2 years, measuring high-fidelity electrophysiological signals and characterising the neurodevelopmental trajectory of beta oscillations. The fact that we were able to complete this study demonstrates the advantages of OPM-MEG over conventional-MEG, the latter being challenging to deploy across such a large age range…”

      Strengths:

      A replication of SQUID-MEG observations, which certainly establishes OPM-MEG as "substantially equivalent" to conventional technology but misses the opportunity to empirically demonstrate the much-discussed theoretical advantages/opportunities.

      As noted above the demonstration of equivalence was one of our primary aims. We have elaborated further on the advantages below.

      Weaknesses:

      The authors describe 64 tri-axial detectors, which they refer to as 192 channels. This is in keeping with some of the SQUID-MEG description, but possibly somewhat disingenuous. For the scientific literature, perhaps "64 tri-axial detectors" is a more parsimonious description.

      The number of channels in a MEG system refers to the number of independent measurements of magnetic field. This, in turn, tells us the number of degrees of freedom in the data that can be exploited by algorithms like signal space separation or beamforming. E.g. the MEGIN (cryogenic) MEG system has 306 channels, 102 magnetometers and 204 planar gradiometers. Sensors are constructed as “triple sensor elements” with one magnetometer and 2 gradiometers (in orthogonal orientations) centred on a single location. In our system, each sensor has three orthogonal metrics of magnetic field which are (by definition) independent. We have 64 such sensors, and therefore 192 independent channels – indeed when implementing algorithms like SSS we have shown we can exploit this number of degrees of freedom.1 192 channels is therefore an accurate description of the system.

      A small fraction (<20%) of trials were eliminated for analysis because of "excess interference" - this warrants further elaboration.

      We agree that this is an important point. We now state in our methods section:

      “…Automatic trial rejection was implemented with trials containing abnormally high variance (exceeding 3 standard deviations from the mean) removed. All experimental trials were also inspected visually by an experienced MEG scientist, to exclude trials with large spikes/drifts that were missed by the automatic approach. In the adult group, there was a significant overlap between automatically and manually detected bad trials (0.7+-1.6 trials were only detected manually). In the children 10.0 +-9.4 trials were only detected manually)…”

      We also note that the other reviewers and editor questioned whether the higher rejection rate in children had any bearing on results. This is an extremely important question. In revising the manuscript this has also been taken into account with all data reanalysed with equal trial counts in children and adults. Results are presented in Supplementary Information Section 5.

      Figure 3 shows a reduced beta ERD in the youngest children. Although the authors claim that OPMMEG would be similarly sensitive for all ages and that SQUID-MEG would be relatively insensitive to young children, one trivial counterargument that needs to be addressed is that OPM has NOT in fact increased the sensitivity to young child ERD. This can possibly be addressed by analogous experiments using a SQUID-based system. An alternative would be to demonstrate similar sensitivity across ages using OPM to a brain measure such as evoked response amplitude. In short, how does Figure 3 demonstrate the (theoretical) sensitivity advantage of OPM MEG in small heads ?

      We completely understand the referees’ point – indeed the question of whether a neuromagnetic effect really changes with age, or apparently changes due to a drop in sensitivity (caused by reduced head size or - in conventional MEG and fMRI - increased subject movement) is a question that can be raised in all neurodevelopmental studies.

      Our authors have many years’ experience conducting studies using conventional MEG (including in neurodevelopment) and agreed that the idea of scanning subjects down to age two in conventional MEG would not be practical; their heads are too small and they typically fail to tolerate an environment where they are forced to remain still for long periods. Even if we tried a comparative study using conventional MEG, the likely data exclusion rate would be so high that the study would be confounded. This is why most conventional MEG studies only scan older children and adolescents. For this reason, we cannot undertake the comparative study the reviewer suggests. There are however two reasons why we believe sensitivity is not driving the neurodevelopmental effects that we observe:

      Proximity of sensors to the head: 

      For an ideal wearable MEG system, the distance between the sensors and the scalp surface (sensor proximity) would be the same regardless of age (and size), ensuring maximum sensitivity in all subjects. To test how our system performed in this regard, we undertook analyses to compute scalp-to-sensor distances. This was done in two ways:

      (1) Real distances in our adaptable system: We took the co-registered OPM sensor locations and computed the Euclidean distance from the centre of the sensitive volume (i.e. the centre of the vapour cell) to the closest point on the scalp surface. This was measured independently for all sensors, and an average across sensors calculated. We repeated this for all participants (recall participants wore helmets of varying size and this adaptability should help minimise any relationship between sensor proximity and age).

      (2) Simulated distances for a non-adaptable system: Here, the aim was to see how proximity might have changed with age, had only a single helmet size been used. We first identified the single example subject with the largest head (scanned wearing the largest helmet) and extracted the scalpto-sensor distances as above. For all other subjects, we used a rigid body transform to co-register their brain to that of the example subject (placing their head (virtually) inside the largest helmet). Proximity was then calculated as above and an average across sensors calculated. This was repeated for all participants.

      In both analyses, sensor proximity was plotted against age and significant relationships probed using Pearson correlation. 

      In addition, we also wanted to probe the relation between sensor proximity and head circumference. Head circumference was estimated by binarising the whole head MRI (to delineate volume of the head), and the axial slice with the largest circumference around was selected. We then plotted sensor proximity versus head circumference, for both the real (adaptive) and simulated (nonadaptive) case (expecting a negative relationship – i.e. larger heads mean closer sensor proximity). The slope of the relationship was measured and we used a permutation test to determine whether the use of adaptable helmets significantly lowered the identified slope (i.e. do adaptable helmets significantly improve sensor proximity in those with smaller head circumference).

      Results are shown in Figure R1. We found no measurable relationship between sensor proximity and age (r = -0.195; p = 0.171) in the case of the real helmets (panel A). When simulating a non-adaptable helmet, we did see a significant effect of age on scalp-to-sensor distance (r = -0.46; p = 0.001; panel B). This demonstrates the advantage of the adaptability of OPM-MEG; without the ability to flexibly locate sensors, we would have a significant confound of sensor proximity. 

      Plotting sensor proximity against head circumference we found a significant negative relationship in both cases (r = -0.37; p = 0.007 and  r = -0.78; p = 0.000001); however, the difference between slopes was significant according to a permutation test (p < 0.025) suggesting that adaptable has indeed improved sensor proximity in those with smaller head circumference. This again shows the benefits of adaptability to head size.

      Author response image 1.

      Scalp-to-sensor distance as a function of age (A/B) and head circumference (C/D). A and C show the case for the real helmets; B and D show the simulated non-adaptable case.

      In sum, the ideal wearable system would see sensors located on the scalp surface, to get as close as possible to the brain in all subjects. Our system of multiple helmet sizes is not perfect in this regard (there is still a significant relationship between proximity and head circumference). However, our solution has offered a significant improvement over a (simulated) non-adaptable system. Future systems should aim to improve even further on this, either by using additively manufactured bespoke helmets for every subject (this is a gold standard, but also costly for large studies), or potentially adaptable flexible helmets.

      Burst amplitudes:

      The reviewer suggested to “demonstrate similar sensitivity across ages using OPM to a brain measure”. We decided not to use the evoked response amplitude (as suggested), since this would be expected to change with age. Instead, we used the amplitude of the bursts.

      Our manuscript shows a significant correlation between beta modulation and burst probability – implying that the stimulus-related drop in beta amplitude occurs because bursts are less likely to occur. Further, we showed significant age-related changes in both beta amplitude and burst probability leading to a conclusion that the age dependence of beta modulation was caused by changes in the likelihood of bursts (i.e. bursts are less likely to ’switch off’ during sensory stimulation in children). We have now extended these analyses to test whether burst amplitude also changes significantly with age – we reasoned that if burst amplitude remained the same in children and adults, this would not only suggest that beta modulation is driven by burst probability (distinct from burst amplitude), but also show directly that the beta effects we see are not attributable to a lack of sensitivity in younger people. 

      We took the (unnormalized) beamformer projected electrophysiological time series from sensorimotor cortex and filtered it 5-48 Hz (the motivation for the large band was because bursts are known to be pan-spectral and have lower frequency content in children; this band captures most of the range of burst frequencies highlighted in our spectra). We then extracted the timings of the bursts, and for each burst took the maximum projected signal amplitude. These values were averaged across all bursts in an individual subject, and plotted for all subjects against age.

      Author response image 2.

      Beta burst amplitude as a function of age; A) shows index finger simulation trials; B shows little finger stimulation trials. In both case there was no significant modulation of burst amplitude with age.

      Results (see Figure R2) showed that the amplitude of the beta burst showed no significant age-related modulation (R2 = 0.01, p = 0.48 for index finger and R2 = 0.01, p = 0.57 for the little finger). This is distinct from both burst probability and task induced beta modulation. This adds weight to the argument that the diminished beta modulation in children is not caused by a lack of sensitivity to the MEG signal and supports our conclusion that burst probability is the primary driver of the agerelated changes in beta oscillations.

      Both of the above analyses have been added to our supplementary information and mentioned in the main manuscript. The first shows no confound of sensor proximity to the scalp with age in our study. The second shows that the bursts underlying the beta signal are not significantly lower amplitude in children – which we reasoned they would be if sensitivity was diminished at younger ages. We believe that the two together suggest that we have mitigated a sensitivity confound in our study.

      The data do not make a compelling case for the motion tolerance of OPM-MEG. Although an apparent advantage of a wearable system, an empirical demonstration is still lacking. How was motion tracked in these participants?

      We agree that this was a limitation of our experiment. 

      We have the equipment to track motion of the head during an experiment, using IR retroreflective markers placed on the helmet and a set of IR cameras located inside the MSR. However, the process takes a long time to set up, it lacks robustness, and would have required an additional computer (the one we typically use was already running the somatosensory stimulus and video). When the study was designed, we were concerned that the increased set up time for motion tracking would cause children to get bored, and result in increased participant drop out. For this reason we decided not to capture motion of the head during this study.

      With hindsight this was a limitation which – as the reviewer states – makes us unable to prove that motion robustness was a significant advantage for this study. That said, during scanning there was both a parent and an experimenter in the room for all of the children scanned, and anecdotally we can say that children tended to move their head during scans – usually to talk to the parent. Whilst this cannot be quantified (and is therefore unsatisfactory) we thought it worth mentioning in our discussion, which reads:

      “…One limitation of the current study is that practical limitations prevented us from quantitatively tracking the extent to which children (and adults) moved their head during a scan. Anecdotally however, experimenters present in the room during scans reported several instances where children moved, for example to speak to their parents who were also in the room. Such levels of movement could not be tolerated in conventional MEG or MRI and so this again demonstrates the advantages afforded by OPM-MEG…”

      As a note, empirical demonstrations of the motion tolerance of OPM-MEG have been published previously: Early demonstrations included Boto et al. 2 who captured beta oscillations in adults playing a ball game and Holmes et al. who measured visual responses as participants moved their head to change viewing angle3. In more recent demonstrations, Seymour et al. measured the auditory evoked field in standing mobile participants4; Rea et al. measured beta modulation as subjects carried out a naturalistic handwriting task5 and Holmes et al measured beta modulation as a subject walked around a room.6

      Furthermore, while the introduction discusses at some length the phenomenon of PMBR, there is no demonstration of the recording of PMBR (or post-sensory beta rebound). This is a shame because there is literature suggesting an age-sensitivity to this, that the optimal sensitivity of OPM-MEG might confirm/refute. There is little evidence in Figure 3 for adult beta rebound. Is there an explanation for the lack of sensitivity to this phenomenon in children/adolescents? Could a more robust paradigm (button-press) have shed light on this?

      We understand the question. There are two limitations to the current study in respect to measuring the PMBR:

      Firstly, sensory tasks generally do not induce as strong a PMBR as motor tasks and with this in mind a stronger rebound response could have been elicited using a button press. However, it was our intention to scan children down to age 2 and we were sceptical that the youngest children would carry out a button press as instructed. For this reason we opted for entirely passive stimulation, requiring no active engagement from our participants. The advantages of this was a stimulus that all subjects could engage with. However, this was at the cost of a diminished rebound.

      The second limitation relates to trial length. Multiple studies have shown that the PMBR can last over ~10 s 7,8. Indeed, Pfurtscheller et al. argued in 1999 that it was necessary to leave 10 s between movements to allow the PMBR to return to a true baseline9, though this has rarely been adhered to in the literature. Here, we wanted to keep recordings short for the comfort of the younger participants, so we adopted a short trial duration. However, a consequence of this short trial length is that it becomes impossible to access the PMBR directly; one can only measure beta modulation with the task. This limitation has now been addressed explicitly in our discussion:

      “…this was the first study of its kind using OPM-MEG, and consequently aspects of the study design could have been improved. Firstly, the task was designed for children; it was kept short while maximising the number of trials (to maximise signal to noise ratio). However, the classical view of beta modulation includes a PMBR which takes ~10 s to reach baseline following task cessation7–9. Our short trial duration therefore doesn’t allow the rebound to return to baseline between trials, and so conflates PMBR with rest. Consequently, we cannot differentiate the neural generators of the task induced beta power decrease and the PMBR; whilst this helped ensure a short, child friendly task, future studies should aim to use longer rest windows to independently assess which of the two processes is driving age related changes…”

      Data on functional connectivity are valuable but do not rely on OPM recording. They further do not add strength to the argument that OPM MEG is more sensitive to brain activity in smaller heads - in fact, the OPM recordings seem plagued by the same insensitivity observed using conventional systems.

      Given the demonstration above that bursts are not significantly diminished in amplitude in children relative to adults; and further given the demonstrations in the literature (e.g. Seedat et al.10) that functional connectivity is driven by bursts, we would argue that the effects of connectivity changing with age are not related to sensitivity but rather genuinely reflect a lack of coordination of brain activity.

      The discussion of burst vs oscillations, while highly relevant in the field, is somewhat independent of the OPM recording approach and does not add weight to the OPM claims.

      We agree that the burst vs. oscillations discussion does not add weight to the OPM claims per se. However, we had two aims of our paper, the second being to “investigate how task-induced beta modulation in the sensorimotor cortices is related to the occurrence of pan-spectral bursts, and how the characteristics of those bursts change with age.” As the reviewer states, this is highly relevant to the field, and therefore we believe adds impact, not only to the paper, but also by extension to the technology.

      In short, while the theoretical advantages of OPM-MEG are attractive - both in terms of young child sensitivity and in terms of motion tolerance, neither was in fact demonstrated in this manuscript. We are left with a replication of SQUID-MEG observations, which certainly establishes OPM-MEG as "substantially equivalent" to conventional technology but misses the opportunity to empirically demonstrate the much-discussed theoretical advantages/opportunities.

      We thank the referee for the time and important contributions to this paper. We believe the fact that we were able to record good data in children as young as two years old was, in itself, an experimental realisation of the ‘theoretical advantages’ of OPM-MEG. Our additional analyses, inspired by the reviewers comments, help to clarify the advantages of OPM-MEG over conventional technology. The reviewers’ insights have without doubt improved the paper.

      Reviewer #2 (Public Review):

      Summary:

      The authors introduce a new 192-channel OPM system that can be configured using different helmets to fit individuals from 2 to 34 years old. To demonstrate the veracity of the system, they conduct a sensorimotor task aimed at mapping developmental changes in beta oscillations across this age range. Many past studies have mapped the trajectory of beta (and gamma) oscillations in the sensorimotor cortices, but these studies have focused on older children and adolescents (e.g., 9-15 years old) and used motor tasks. Thus, given the study goals, the choice of a somatosensory task was surprising and not justified. The authors recorded a final sample of 27 children (2-13 years old) and 24 adults (21-34 years) and performed a time-frequency analysis to identify oscillatory activity. This revealed strong beta oscillations (decreases from baseline) following the somatosensory stimulation, which the authors imaged to discern generators in the sensorimotor cortices. They then computed the power difference between 0.3-0.8 period and 1.0-1.5 s post-stimulation period and showed that the beta response became stronger with age (more negative relative to the stimulation period). Using these same time windows, they computed the beta burst probability and showed that this probability increased as a function of age. They also showed that the spectral composition of the bursts varied with age. Finally, they conducted a whole-brain connectivity analysis. The goals of the connectivity analysis were not as clear as prior studies of sensorimotor development have not conducted such analyses and typically such whole-brain connectivity analyses are performed on resting-state data, whereas here the authors performed the analysis on task-based data. In sum, the authors demonstrate that they can image beta oscillations in young children using OPM and discern developmental effects.

      Thank you for this summary and for taking the time to review our manuscript.

      Strengths:

      Major strengths of the study include the novel OPM system and the unique participant population going down to 2-year-olds. The analyses are also innovative in many respects.

      Thank you – we also agree that the major strength is in the unique cohort.

      Weaknesses:

      Several weaknesses currently limit the impact of the study. 

      First, the choice of a somatosensory stimulation task over a motor task was not justified. The authors discuss the developmental motor literature throughout the introduction, but then present data from a somatosensory task, which is confusing. Of note, there is considerable literature on the development of somatosensory responses so the study could be framed with that.

      We completely understand the referee’s point, and we agree that the motivation for the somatosensory task was not made clear in our original manuscript.

      Our choice of task was motivated completely by our targeted cohort; whilst a motor task would have been our preference, it was generally felt that making two-year-olds comply with instructions to press a button would have been a significant challenge. In addition, there would likely have been differences in reaction times. By opting for a passive sensory stimulation we ensured compliance, and the same stimulus for all subjects. We have added text on this to our introduction as follows:

      “…Here, we combine OPM-MEG with a burst analysis based on a Hidden Markov Model (HMM) 10–12 to investigate beta dynamics. We scanned a cohort of children and adults across a wide age range (upwards from 2 years old). Because of this, we implemented a passive somatosensory task which can be completed by anyone, regardless of age…”

      We also state in our discussion:

      “…here we chose to use passive (sensory) stimulation. This helped ensure compliance with the task in subjects of all ages and prevented confounds of e.g. reaction time, force, speed and duration of movement which would be more likely in a motor task.7,8 However, there are many other systems to choose and whether the findings here regarding beta bursts and the changes with age also extend to other brain networks remains an open question.…”

      Regarding the neurodevelopmental literature – we are aware of the literature on somatosensory evoked responses – particularly median nerve stimulation – but we can find little on the neurodevelopmental trajectory of somatosensory induced beta oscillations (the topic of our paper). We have edited our introduction as follows:

      “…All these studies probed beta responses to movement execution; in the case of tactile stimulation (i.e. sensory stimulation without movement) both task induced beta power loss, and the post stimulus rebound have been consistently observed in adults9,13–18. Further, beta amplitude in sensory cortex has been related to attentional processes19 and is broadly thought to carry top down top down influence on primary areas20. However, there is less literature on how beta modulation changes with age during purely sensory tasks.…”

      We would be keen for the reviewer to point to any specific papers in the literature that we may have missed.

      Second, the primary somatosensory response actually occurs well before the time window of interest in all of the key analyses. There is an established literature showing mechanical stimulation activates the somatosensory cortex within the first 100 ms following stimulation, with the M50 being the most robust response. The authors focus on a beta decrease (desynchronization) from 0.3-0.8 s which is obviously much later, despite the primary somatosensory response being clear in some of their spectrograms (e.g., Figure 3 in older children and adults). This response appears to exhibit a robust developmental effect in these spectrograms so it is unclear why the authors did not examine it. This raises a second point; to my knowledge, the beta decrease following stimulation has not been widely studied and its function is unknown. The maps in Figure 3 suggest that the response is anterior to the somatosensory cortex and perhaps even anterior to the motor cortex. Since the goal of the study is to demonstrate the developmental trajectory of well-known neural responses using an OPM system, should the authors not focus on the best-understood responses (i.e., the primary somatosensory response that occurs from 0.0-0.3 s)?

      We understand the reviewer’s point. The original aim of our manuscript was to investigate the neurodevelopmental trajectory of beta oscillations, not the evoked response. In fact, the evoked response in this paradigm is complicated by the fact that there are three stimuli in a very short (<500 ms) time window. For this reason, we prefer the focus of our paper to remain on oscillations.

      Nevertheless, we agree that not including the evoked responses was a missed opportunity.  We have now added evoked responses to our analysis pipeline and manuscript. As surmised by the reviewer, the M50 shows neurodevelopmental changes (an increase with age). Our methods section has been updated accordingly and Figure 3 has been modified. The figure and caption are copied below for the convenience of the reviewer.

      Author response image 3.

      Beta band modulation with age: (A) Brain plots show slices through the left motor cortex, with a pseudo-T-statistical map of beta modulation (blue/green) overlaid on the standard brain. Peak MNI coordinates are indicated for each subgroup. Time frequency spectrograms show modulation of the amplitude of neural oscillations (fractional change in spectral amplitude relative to the baseline measured in the 2.5-3 s window). Vertical lines indicate the time of the first braille stimulus. In all cases results were extracted from the location of peak beta desynchronisation (in the left sensorimotor cortex). Note the clear beta amplitude reduction during stimulation. The inset line plots show the 4-40 Hz trial averaged phase-locked evoked response, with the expected prominent deflections around 20 and 50 ms. (B) Maximum difference in beta-band amplitude (0.3-0.8 s window vs 1-1.5 s window) plotted as a function of age (i.e., each data point shows a different participant; triangles represent children, circles represent adults). Note significant correlation (𝑅2 \= 0.29, 𝑝 = 0.00004 *). (C) Amplitude of the P50 component of the evoked response plotted against age. There was no significant correlation (𝑅2 \= 0.04, 𝑝 = 0.14 ). All data here relate to the index finger stimulation; similar results are available for the little finger stimulation in Supplementary Information Section 1.

      Regarding the developmental effects, the authors appear to compute a modulation index that contrasts the peak beta window (.3 to .8) to a later 1.0-1.5 s window where a rebound is present in older adults. This is problematic for several reasons. First, it prevents the origin of the developmental effect from being discerned, as a difference in the beta decrease following stimulation is confounded with the beta rebound that occurs later. A developmental effect in either of these responses could be driving the effect. From Figure 3, it visually appears that the much later rebound response is driving the developmental effect and not the beta decrease that is the primary focus of the study. Second, these time windows are a concern because a different time window was used to derive the peak voxel used in these analyses. From the methods, it appears the image was derived using the .3-.8 window versus a baseline of 2.5-3.0 s. How do the authors know that the peak would be the same in this other time window (0.3-0.8 vs. 1.0-1.5)? Given the confound mentioned above, I would recommend that the authors contrast each of their windows (0.3-0.8 and 1.0-1.5) with the 2.5-3.0 window to compute independent modulation indices. This would enable them to identify which of the two windows (beta decrease from 0.3-0.8 s or the increase from 1.0-1.5 s) exhibited a developmental effect. Also, for clarity, the authors should write out the equation that they used to compute the modulation index. The direction of the difference (positive vs. negative) is not always clear.

      We completely understand the referee’s point; referee 1 made a similar point. In fact, there are two limitations of our paradigm regarding the measurement of PMBR versus the task-induced beta decrease:

      Firstly, sensory tasks generally do not induce as strong a PMBR as motor tasks and with this in mind a stronger rebound response could have been elicited using a button press. However, as described above it was our intention to scan children down to age 2 and we were sceptical that the youngest children would carry out a button press as instructed.

      The second limitation relates to trial length. Multiple studies have shown that the PMBR can last over ~10 s7,8. Indeed, Pfurtscheller et al. argued in 1999 that it was necessary to leave 10 s between movements to allow the PMBR to return to a true baseline9 Here, we wanted to keep recordings relatively short for the younger participants, and so we adopted a short trial duration. However, a consequence of this short trial length is that it becomes impossible to access the PMBR directly because the PMBR of the nth trial is still ongoing when the (n+1)th trial begins. Because of this, there is no genuine rest period, and so the stimulus induced beta decrease and subsequent rebound cannot be disentangled. This limitation has now been made clear in our discussion as follows:

      “…this was the first study of its kind using OPM-MEG, and consequently aspects of the study design could have been improved. Firstly, the task was designed for children; it was kept short while maximising the number of trials (to maximise signal to noise ratio). However, the classical view of beta modulation includes a PMBR which takes ~10 s to reach baseline following task cessation7–9. Our short trial duration therefore doesn’t allow the rebound to return to baseline between trials, and so conflates PMBR with rest. Consequently, we cannot differentiate the neural generators of the task induced beta power decrease and the PMBR; whilst this helped ensure a short, child friendly task, future studies should aim to use longer rest windows to independently assess which of the two processes is driving age related changes…”

      To clarify our method of calculating the modulation index, we have added the following statement to the methods:

      “The beta modulation index was calculated using the equation , where , and are the average Hilbert-envelope-derived amplitudes in the stimulus (0.3-0.8s), post-stimulus (1-1.5s) and baseline (2.5-3s) windows, respectively.”

      Another complication of using a somatosensory task is that the literature on bursting is much more limited and it is unclear what the expectations would be. Overall, the burst probability appears to be relatively flat across the trial, except that there is a sharp decrease during the beta decrease (.3-.8 s). This matches the conventional trial-averaging analysis, which is good to see. However, how the bursting observed here relates to the motor literature and the PMBR versus beta ERD is unclear.

      Again, we agree completely; a motor task would have better framed the study in the context of existing burst literature – but as mentioned above, making 2-year-olds comply with the instructions for a motor task would have been difficult. Interestingly in a recent paper, Rayson et al. used EEG to investigate burst activity in infants (9 and 12 months) and adults during observed movement execution, with results showing stimulus induced decrease in beta burst rate at all ages, with the largest effects in adults21. This paper was not yet published when we submitted our article but does help us to frame our burst results since there is strong agreement between their study and ours. We now mention this study in both our introduction and discussion. 

      Another weakness is that all participants completed 42 trials, but 19% of the trials were excluded in children and 9% were excluded in adults. The number of trials is proportional to the signal-to-noise ratio. Thus, the developmental differences observed in response amplitude could reflect differences in the number of trials that went into the final analyses.

      This is an important observation and we thank the reviewer for raising the issue. We have now re-analysed all of our data, removing trials in the adults such that the overall number of trials was the same as for the children. All effects with age remained significant. We chose to keep the Figures in the main manuscript with all good trials (as previously) and present the additional analyses (with matched trial numbers) in supplementary information. However, if the reviewer feels strongly, we could do it the other way around (there is very little difference between the results).

      Reviewer #3 (Public Review):

      This study demonstrated the application of OPM-MEG in neurodevelopment studies of somatosensory beta oscillations and connections with children as young as 2 years old. It provides a new functional neuroimaging method that has a high spatial-temporal resolution as well wearable which makes it a new useful tool for studies in young children. They have constructed a 192-channel wearable OPM-MEG system that includes field compensation coils which allow free head movement scanning with a relatively high ratio of usable trials. Beta band oscillations during somatosensory tasks are well localized and the modulation with age is found in the amplitude, connectivity, and panspectral burst probability. It is demonstrated that the wearable OPM-MEG could be used in children as a quite practical and easy-to-deploy neuroimaging method with performance as good as conventional MEG. With both good spatial (several millimeters) and temporal (milliseconds) resolution, it provides a novel and powerful technology for neurodevelopment research and clinical applications not limited to somatosensory areas.

      We thank the reviewer for their summary, and their time in reviewing our manuscript.

      The conclusions of this paper are mostly well supported by data acquired under the proper method. However, some aspects of data analysis need to be improved and extended.

      (1) The colour bars selected for the pseudo-T-static pictures of beta modulation in Figures 2 and 3, which are blue/black and red/black, are not easily distinguished from the anatomical images which are grey-scale. A colour bar without black/white would make these figures better. The peak point locations are also suggested to be marked in Figure 2 and averaged locations in Figure 3 with an error bar.

      Thank you for this comment which we certainly agree with. The colour scheme used has now been changed to avoid black. We have also added peak locations. 

      (2) The data points in plots are not constant across figures. In Figures 3 and 5, they are classified into triangles and circles for children and adults, but all are circles in Figures 4 and 6.

      Thank you! We apologise for the confusion. Data points are now consistent across plots.

      (3) Although MEG is much less susceptible to conductivity inhomogeneity of the head than EEG, the forward modulating may still be impacted by the small head profile. Add more information about source localization accuracy and stability across ages or head size.

      This is an excellent point. We have added to our discussion relating to the accuracy of the forward model. 

      “…We failed to see a significant difference in the spatial location of the cortical representations of the index and little finger; there are three potential reasons for this. First, the system was not designed to look for such a difference – sensors were sparsely distributed to achieve whole head coverage (rather than packed over sensory cortex to achieve the best spatial resolution in one area22). Second, our “pseudo-MRI” approach to head modelling (see Methods) is less accurate than acquisition of participantspecific MRIs, and so may mask subtle spatial differences. Third, we used a relatively straightforward technique for modelling magnetic fields generated by the brain (a single shell forward model). Although MEG is much less susceptible to conductivity inhomogeneity of the head than EEG, the forward model may still be impacted by the small head profile. This may diminish spatial resolution and future studies might look to implement more complex models based on e.g. finite element modelling23. Finally, previous work 24 suggested that, for a motor paradigm in adults, only the beta rebound, and not the power reduction during stimulation, mapped motortopically. This may also be the case for purely sensory stimulation. Nevertheless, it remains the case that by placing sensors closer to the scalp, OPM-MEG should offer improved spatial resolution in children and adults; this should be the topic of future work…”

      Recommendations for the authors:

      Reviewer #2 (Recommendations For The Authors):

      Major items to further test include the differing number of trials, the windowing issue, and the focus on motor findings in the intro and discussion. First, I would recommend the authors adjust the number of trials in adults to equate them between groups; this will make their developmental effects easier to interpret.  

      Thank you for raising this important point. This has now been done and appears in our supplementary information as discussed above.

      Second, to discern which responses are exhibiting developmental effects, the authors need to contrast the 0.3-0.8 window with the later window (2.5-3.0), not the window that appears to have the PMBR-like response. This artificially accentuates the response. I also think they should image the 1.0-1.5 vs 2.5-3.0s window to determine whether the response in this time window is in the same location as the decrease and then contrast this for beta differences. 

      We completely understand this point, which relates to separating the reduction in beta amplitude during stimulation and the rebound post stimulation. However, as explained above, doing so unambiguously would require the use of much longer trials. Here we were only able to measure stimulus induced beta modulation (distinct from the separate contributions of the task induced beta power reduction and rebound). It may be that future studies, with >10 s trial length, could probe the role of the PMBR, but such studies require long paradigms which are challenging to implement with children.

      Third, changing the framing of the study to highlight the somatosensory developmental literature would also be an improvement.

      We have added to our introduction a stated in the responses above.

      Finally, the connectivity analysis on data from a somatosensory task did not make sense given the focus of the study and should be removed in my opinion. It is very difficult to interpret given past studies used resting state data and one would expect the networks to dynamically change during different parts of the current task (i.e., stimulation versus baseline).

      We appreciate the point regarding connectivity. However, it was our intention to examine the developmental trajectory of beta oscillations, and a major role of beta oscillations is in mediating connectivity. It is true that most studies are conducted in the resting state (or more recently – particularly in children – during movie watching). The fact that we had a sensory task running is a confound; nevertheless, the connectivity we derived in adults bears a marked similarity to that from previous papers (e.g. 25) and we do see significant changes with age. We therefore believe this to be an important addition to the paper and we would prefer to keep it.

      References

      (1) Holmes, N., Bowtell, R., Brookes, M. J. & Taulu, S. An Iterative Implementation of the Signal Space Separation Method for Magnetoencephalography Systems with Low Channel Counts.

      Sensors 23, 6537 (2023).

      (2) Boto, E. et al. Moving magnetoencephalography towards real-world applications with a wearable system. Nature (2018) doi:10.1038/nature26147.

      (3) Holmes, M. et al. A bi-planar coil system for nulling background magnetic fields in scalp mounted magnetoencephalography. NeuroImage 181, 760–774 (2018).

      (4) Seymour, R. A. et al. Using OPMs to measure neural activity in standing, mobile participants. NeuroImage 244, 118604 (2021).

      (5) Rea, M. et al. A 90-channel triaxial magnetoencephalography system using optically pumped magnetometers. annals of the new york academy of sciences 1517, https://doi.org/10.1111/nyas.14890 (2022).

      (6) Holmes, N. et al. Enabling ambulatory movement in wearable magnetoencephalography with matrix coil active magnetic shielding. NeuroImage 274, 120157 (2023).

      (7) Pakenham, D. O. et al. Post-stimulus beta responses are modulated by task duration. NeuroImage 206, 116288 (2020).

      (8) Fry, A. et al. Modulation of post-movement beta rebound by contraction force and rate of force development. Human Brain Mapping 37, 2493–2511 (2016).

      (9) Pfurtscheller, G. & Lopes da Silva, F. H. Event-related EEG/MEG synchronization and desynchronization: Basic principles. Clin Neurophysio 110, 1842–1857 (1999).

      (10) Seedat, Z. A. et al. The role of transient spectral ‘bursts’ in functional connectivity: A magnetoencephalography study. NeuroImage 209, 116537 (2020).

      (11) Baker, A. P. et al. Fast transient networks in spontaneous human brain activity. eLife 2014, 1867 (2014).

      (12) Vidaurre, D. et al. Spectrally resolved fast transient brain states in electrophysiological data. NeuroImage 126, 81–95 (2016).

      (13) Gaetz, W. & Cheyne, D. Localization of sensorimotor cortical rhythms induced by tactile stimulation using spatially filtered MEG. NeuroImage 30, 899–908 (2006).

      (14) Cheyne, D. et al. Neuromagnetic imaging of cortical oscillations accompanying tactile stimulation. Cognitive Brain Research 17, 599–611 (2003).

      (15) van Ede, F., Jensen, O. & Maris, E. Tactile expectation modulates pre-stimulus β-band oscillations in human sensorimotor cortex. NeuroImage 51, 867–876 (2010).

      (16) Salenius, S., Schnitzler, A., Salmelin, R., Jousmäki, V. & Hari, R. Modulation of Human Cortical Rolandic Rhythms during Natural Sensorimotor Tasks. NeuroImage 5, 221–228 (1997).

      (17) Cheyne, D. O. MEG studies of sensorimotor rhythms: A review. Experimental Neurology 245, 27–39 (2013).

      (18) Kilavik, B. E., Zaepffel, M., Brovelli, A., MacKay, W. A. & Riehle, A. The ups and downs of beta oscillations in sensorimotor cortex. Experimental Neurology 245, 15–26 (2013).

      (19) Bauer, M., Oostenveld, R., Peeters, M. & Fries, P. Tactile Spatial Attention Enhances Gamma-Band Activity in Somatosensory Cortex and Reduces Low-Frequency Activity in Parieto-Occipital Areas. J. Neurosci. 26, 490–501 (2006).

      (20) Barone, J. & Rossiter, H. E. Understanding the Role of Sensorimotor Beta Oscillations. Frontiers in Systems Neuroscience 15, (2021).

      (21) Rayson, H. et al. Bursting with Potential: How Sensorimotor Beta Bursts Develop from Infancy to Adulthood. J Neurosci 43, 8487–8503 (2023).

      (22) Hill, R. M. et al. Optimising the Sensitivity of Optically-Pumped Magnetometer Magnetoencephalography to Gamma Band Electrophysiological Activity. Imaging Neuroscience (2024) doi:10.1162/imag_a_00112.

      (23) Stenroos, M., Hunold, A. & Haueisen, J. Comparison of three-shell and simplified volume conductor models in magnetoencephalography. NeuroImage 94, 337–348 (2014).

      (24) Barratt, E. L., Francis, S. T., Morris, P. G. & Brookes, M. J. Mapping the topological organisation of beta oscillations in motor cortex using MEG. NeuroImage 181, 831–844 (2018).

      (25) Rier, L. et al. Test-Retest Reliability of the Human Connectome: An OPM-MEG study. Imaging Neuroscience (2023) doi:10.1162/imag_a_00020.

    1. Author response:

      The following is the authors’ response to the original reviews

      Joint Public Review:

      In this work, the authors develop a new computational tool, DeepTX, for studying transcriptional bursting through the analysis of single-cell RNA sequencing (scRNA-seq) data using deep learning techniques. This tool aims to describe and predict the transcriptional bursting mechanism, including key model parameters and the steady-state distribution associated with the predicted parameters. By leveraging scRNA-seq data, DeepTX provides high-resolution transcriptional information at the single-cell level, despite the presence of noise that can cause gene expression variation. The authors apply DeepTX to DNA damage experiments, revealing distinct cellular responses based on transcriptional burst kinetics. Specifically, IdU treatment in mouse stem cells increases burst size, promoting differentiation, while 5FU affects burst frequency in human cancer cells, leading to apoptosis or, depending on the dose, to survival and potential drug resistance. These findings underscore the fundamental role of transcriptional burst regulation in cellular responses to DNA damage, including cell differentiation, apoptosis, and survival. Although the insights provided by this tool are mostly well supported by the authors' methods, certain aspects would benefit from further clarification.

      The strengths of this paper lie in its methodological advancements and potential broad applicability. By employing the DeepTXSolver neural network, the authors efficiently approximate stationary distributions of mRNA counts through a mixture of negative binomial distributions, establishing a simple yet accurate mapping between the kinetic parameters of the mechanistic model and the resulting steady-state distributions. This innovative use of neural networks allows for efficient inference of kinetic parameters with DeepTXInferrer, reducing computational costs significantly for complex, multi-gene models. The approach advances parameter estimation for high-dimensional datasets, leveraging the power of deep learning to overcome the computational expense typically associated with stochastic mechanistic models. Beyond its current application to DNA damage responses, the tool can be adapted to explore transcriptional changes due to various biological factors, making it valuable to the systems biology, bioinformatics, and mechanistic modelling communities. Additionally, this work contributes to the integration of mechanistic modelling and -omics data, a vital area in achieving deeper insights into biological systems at the cellular and molecular levels.  

      We thank the reviewers for their positive opinion on our manuscript. As reflected in our detailed responses to the reviewers’ comments, we will make significant changes to address their concerns comprehensively.

      This work also presents some weaknesses, particularly concerning specific technical aspects. The tool was validated using synthetic data, and while it can predict parameters and steady-state distributions that explain gene expression behaviour across many genes, it requires substantial data for training. The authors account for measurement noise in the parameter inference process, which is commendable, yet they do not specify the exact number of samples required to achieve reliable predictions. Moreover, the tool has limitations arising from assumptions made in its design, such as assuming that gene expression counts for the same cell type follow a consistent distribution. This assumption may not hold in cases where RNA measurement timing introduces variability in expression profiles.

      Thank reviewers for detailed and constructive feedback on our work. We will address the key concerns raised from the following points:

      (1) Clarification on the required sample size: We tested the robustness of our inference method on simulated datasets by varying the number of single-cell samples. Our results indicated that the predictions of burst kinetics parameters become accurate when the number of cells reaches 500 (Supplementary Figure S3d, e). This sample size is smaller than the data typically obtained with current single-cell RNA sequencing (scRNA-seq) technologies, such as 10x Genomics and Smart-seq3 (Zheng GX et al., 2017; Hagemann-Jensen M et al., 2020). Therefore, we believed that our algorithm is well-suited for inferring burst kinetics from existing scRNA-seq datasets, where the sample size is sufficient for reliable predictions. We will clarify this point in the main text to make it easier for readers to use the tool.

      (2) Assumption-related limitations: One of the fundamental assumptions in our study is that the expression counts of each gene are independently and identically distributed (i.i.d.) among cells, which is a commonly adopted assumption in many related works (Larsson AJM et al., 2019; Ochiai H et al., 2020; Luo S et al., 2023). However, we acknowledged the limitations of this assumption. The expression counts of the same gene in each cell may follow distinct distributions even from the same cell type, and dependencies between genes could exist in realistic biological processes. We recognized this and will deeply discuss these limitations from assumptions and prospect as an important direction for future research.  

      The authors present a deep learning pipeline to predict the steady-state distribution, model parameters, and statistical measures solely from scRNA-seq data. Results across three datasets appear robust, indicating that the tool successfully identifies genes associated with expression variability and generates consistent distributions based on its parameters. However, it remains unclear whether these results are sufficient to fully characterise the transcriptional bursting parameter space. The parameters identified by the tool pertain only to the steady-state distribution of the observed data, without ensuring that this distribution specifically originates from transcriptional bursting dynamics.

      We appreciate reviewers’ comments and the opportunity to clarify our study’s contributions and limitations. Although we agree that assessing whether the results from these three realistic datasets can represent the characterize transcriptional burst parameter space is challenging, as it depends on data property and conditions in biology, we firmly believe that DeepTX has the capacity to characterize the full parameter space. This believes stems from the extensive parameters and samples we input during model training and inference across a sufficiently large parameter range (Method 1.3). Furthermore, the training of the model is both flexible and scalable, allowing for the expansion of the transcriptional burst parameter space as needed. We will clarify this in the text to enable readers to use DeepTX more flexibly.

      On the other hand, we agree that parameter identification is based on the steady-state distribution of the observed data (static data), which loses information about the fine dynamic process of the burst kinetics. In principle, tracking the gene expression of living cells can provide the most complete information about real-time transcriptional dynamics across various timescales (Rodriguez J et al., 2019).

      However, it is typically limited to only a small number of genes and cells, which could not investigate general principles of transcriptional burst kinetics on a genome-wide scale. Therefore, leveraging the both steady-state distribution of scRNA-seq data and mathematical dynamic modelling to infer genome-wide transcriptional bursting dynamics represents a critical and emerging frontier in this field. For example, the statistical inference framework based on the Markovian telegraph model, as demonstrated in (Larsson AJM et al., 2019), offers a valuable paradigm for understanding underlying transcriptional bursting mechanisms. Building on this, our study considered a more generalized non-Mordovian model that better captures transcriptional kinetics by employing deep learning method under conditions such as DNA damage. This provided a powerful framework for comparative analyses of how DNA damage induces alterations in transcriptional bursting kinetics across the genome. We will highlight the limitations of current inference using steady-state distributions in the text and look ahead to future research directions for inference using time series data across the genome.

      A primary concern with the TXmodel is its reliance on four independent parameters to describe gene state-switching dynamics. Although this general model can capture specific cases, such as the refractory and telegraph models, accurately estimating the parameters of the refractory model using only steadystate distributions and typical cell counts proves challenging in the absence of time-dependent data.

      We thank reviewers for highlighting this critical concern regarding the TXmodel's reliance on four independent parameters to describe gene state-switching dynamics. We acknowledge that estimating the parameters of the TXmodel using only steady-state distributions and typical single-cell RNA sequencing (scRNA-seq) data poses significant challenges, particularly in the absence of timeresolved measurements.

      As described in the response of last point, while time-resolved data can provide richer information than static scRNA-seq data, it is currently limited to a small number of genes and cells, whereas static scRNA-seq data typically capture genome-wide expression. Our framework leverages deep learning methods to link mechanistic models with static scRNA-seq data, enabling the inference of genome-wide dynamic behaviors of genes. This provides a potential pathway for comparative analyses of transcriptional bursting kinetics across the entire genome.

      Nonetheless, the refractory model and telegraphic model are important models for studying transcription bursts. We will discuss and compare them in terms of the accuracy of inferred parameters.

      Certainly, we agree that inferring the molecular mechanisms underlying transcriptional burst kinetics using time-resolved data remains a critical future direction. We will include a brief discussion on the role and importance of time-resolved data in addressing these challenges in the discussion section of the revised manuscript.

      The claim that the GO analysis pertains specifically to DNA damage response signal transduction and cell cycle G2/M phase transition is not fully accurate. In reality, the GO analysis yielded stronger p-values for pathways related to the mitotic cell cycle checkpoint signalling. As presented, the GO analysis serves more as a preliminary starting point for further bioinformatics investigation that could substantiate these conclusions. Additionally, while GSEA analysis was performed following the GO analysis, the involvement of the cardiac muscle cell differentiation pathway remains unclear, as it was not among the GO terms identified in the initial GO analysis.

      We thank the reviewer for this valuable feedback and for pointing out the need for clarification regarding the GO and GSEA analyses. We agree that the connection between the cardiac muscle cell differentiation pathway identified in the GSEA analysis and the GO terms from the initial analysis requires further clarification. This discrepancy arises because GSEA examines broader sets of pathways and may capture biological processes not highlighted by GO analysis due to differences in the statistical methods and pathway definitions used. We will revise the manuscript to address this point, explicitly discussing the distinct yet complementary nature of GO and GSEA analyses and providing a clearer interpretation of the results.

      As the advancement is primarily methodological, it lacks a comprehensive comparison with traditional methods that serve similar functions. Consequently, the overall evaluation of the method, including aspects such as inference accuracy, computational efficiency, and memory cost, remains unclear. The paper would benefit from being contextualised alongside other computational tools aimed at integrating mechanistic modelling with single-cell RNA sequencing data. Additional context regarding the advantages of deep learning methods, the challenges of analysing large, high-dimensional datasets, and the complexities of parameter estimation for intricate models would strengthen the work.

      We greatly appreciate your insightful feedback, which highlights important considerations for evaluating and contextualizing our methodological advancements. Below, we emphasize our advantages from both the modeling perspective and the inference perspective compared with previous model. As our work is rooted in a model-based approach to describe the transcriptional bursting process underlying gene expression, the classic telegraph model (Markovian) and non-Markovian models which are commonly employed are suitable for this purpose:

      Classic telegraph model: The classic telegraph model allows for the derivation of approximate analytical solutions through numerical integration, enabling efficient parameter point estimation via maximum likelihood methods, e.g., as explored in (Larsson AJM et al., 2019). Although exact analytical solutions for the telegraph model are not available, certain moments of its distribution can be explicitly derived. This allows for an alternative approach to parameter inference using moment-based estimation methods, e.g., as explored in (Ochiai H et al., 2020). However, it is important to note that higher-order sample moments can be unstable, potentially leading to significant estimation bias. 

      Non-Markovian Models: For non-Markovian models, analytical or approximate analytical solutions remain elusive. Previous work has employed pseudo-likelihood approaches, leveraging statistical properties of the model’s solutions to estimate parameters ,e.g., as explored in (Luo S et al., 2023).

      However, the method may suffer from low inference efficiency. 

      In our current work, we leverage deep learning to estimate parameters of TXmodel, which is nonMarkovian model. First, we represent the model's solution as a mixture of negative binomial distributions, which is obtained by the deep learning method. Second, through integration with the deep learning architecture, the model parameters can be optimized using automatic differentiation, significantly improving inference efficiency. Furthermore, by employing a Bayesian framework, our method provides posterior distributions for the estimated dynamic parameters, offering a comprehensive characterization of uncertainty. Compared to traditional methods such as moment-based estimation or pseudo-likelihood approaches, we believe our approach not only achieves higher inference efficiency but also delivers posterior distributions for kinetics parameters, enhancing the interpretability and robustness of the results. We will present and emphasize the computational efficiency and memory cost of our methods the revised version.

      Recommendations for the authors:

      There are various noise sources in biological progress. How transcriptional bursting fits within those as well as the reasons to focus only on this source needs to be clearly discussed in the introduction of the manuscript. Related to this last point, transcriptional bursting might not be the only mechanism to take advantage of the stochastic nature of biomolecular processes to make decisions. Once again, what are the implications of assuming this as the underlying mechanism?

      Thank the reviewer for this valuable comment. We fully agree that biological systems are subject to multiple stochastic sources, which arise from both intrinsic and extrinsic noise (Eling N et al., 2019). Intrinsic noise is primarily driven by the stochastic biochemical effects that directly influence mRNA and protein expression in a gene-specific manner, such as DNA, epigenetic, transcription, and translation levels. Extrinsic noise arises from fluctuations in cell-specific manners, such as changes in cell size, cell cycle, or cell signaling. Given that DNA damage most directly perturbs transcription and translation processes, focusing on intrinsic noise sources is appropriate for mechanistically modeling gene-specific expression variability, particularly since this variability can be captured at the genome-wide scale by scRNA-seq data.

      Among various intrinsic noise sources, transcriptional bursting offers a mechanistically wellcharacterized and quantifiable representation of gene expression variability (Tunnacliffe E & Chubb JR, 2020). It reflects the dynamic switching between active and inactive gene states and has been observed consistently across prokaryotic and eukaryotic cells (Eling N et al., 2019). Moreover, transcriptional bursting kinetics, defined by burst size and frequency, can be inferred from scRNA-seq data at the singlegene level using steady-state assumptions, making it an analytically tractable and biologically meaningful feature for large-scale inference (Rodriguez J & Larson DR, 2020).

      We acknowledge that transcriptional bursting is not the only mechanism through which cells can utilize stochasticity for fate decisions. Other processes, such as translational noise and chromatin accessibility, may also contribute. However, given the data modality (static scRNA-seq) and the established theoretical framework for bursting, we assume transcriptional bursting as a representative and interpretable proxy of stochastic regulation. This assumption enables us to extract meaningful insights while remaining open to future model extensions, incorporating additional regulatory layers as more data types become available.

      In this version of the manuscript, we have revised the introduction section to better clarify the rationale of this assumption and to more explicitly emphasize the important role of transcriptional bursting within stochastic noise.

      More careful discussion of how the proposed method differentiates from previous work that employs scRNA-seq to elucidate the diverse sources of noise (pp.3).

      Thank the reviewer for this suggestion. Our proposed method differs significantly from previous work that utilizes scRNA-seq data to study diverse noise sources from several aspects (Ochiai H et al., 2020; Eling N et al., 2019; Morgan MD & Marioni JC, 2018). Specifically, DeepTX infers genomewide burst kinetics by directly matching the full steady-state distribution of a mechanistic stochastic model to the observed scRNA-seq data, rather than relying solely on low-order statistics such as mean and variance. Moreover, by adopting a non-Markovian process that allows multi-step promoter switching, DeepTX extends beyond the classic telegraph model to better capture the complex molecular events underlying transcriptional activation and repression. Crucially, we used a deep-learning–based solver to obtain these intractable steady-state distributions rapidly and accurately. This combination of richer data usage, more realistic mechanistic assumptions, and scalable neural-network–accelerated computation lays the groundwork for incorporating additional noise sources into a unified inference framework in future work. 

      In this version of the manuscript, we have revised the discussion section to highlight the difference with previous works.

      The paper could benefit from being contextualised alongside other computational tools that aim to integrate mechanistic modelling with single-cell RNA sequencing data. This is an active area of research, and works such as Sukys and Grima (bioRxiv, 2024), Garrido-Rodriguez et al. (PLOS Computational Biology, 2021), Maizels (2024), and others could provide valuable context.

      Thank the reviewer for suggesting these relevant works. Garrido-Rodriguez et al. (PLOS Comput. Biol., 2021) integrated single-cell and bulk transcriptomic data into mechanistic pathway models to infer signaling dynamics, an approach complementary to our mapping of burst kinetic parameters onto pathway enrichment for linking transcriptional bursting to functional outcomes. Sukys and Grima et al. (bioRxiv, 2024; Now in Nucleic Acids Res., 2025) demonstrated that cell-cycle stage and cellular age significantly modulate burst frequency and size, highlighting the potential to enhance DeepTX by incorporating cell-cycle–dependent variability into genome-wide burst inference. Maizels et al. (Philos. Trans. R. Soc. Lond. B. Biol. Sci., 2024) reviewed methods for capturing single-cell temporal dynamics across multi-omic modalities, underscoring how higher time-resolved data could refine and validate steady-state burst inference frameworks to better resolve causal gene-expression mechanisms.

      We have cited these studies on the contextual relevance to DeepTX in the discussion sections.

      As the advancement is primarily methodological, it lacks a comprehensive comparison with traditional methods that serve similar functions. Consequently, the overall evaluation of the method, including aspects such as inference accuracy, computational efficiency, and memory cost, remains unclear. We suggest incorporating these experiments to provide readers with a more complete understanding of the proposed method's performance.

      Thank the reviewer for constructive suggestion regarding a comprehensive comparison with other previous methods. To address this problem, in this version, we compared DeepTX with our previous work, txABC, that utilized approximate Bayesian computation to infer parameters from the generalized telegraph model (Luo S et al., 2023). As a result, DeepTX achieved improvements in inference accuracy and computational efficiency (Supplementary Figure S4.). For memory cost during single-gene inference, DeepTX requires an average memory usage of approximately 70 MB, whose memory consumption accounts for only a small fraction of the total available memory on standard computing devices (typically exceeding 10 GB), while exhibiting superior inference efficiency compared to txABC. We have mentioned in the third result section.

      Discuss the validity of the assumption of the static snapshot provided by the scRNA-seq data as in steadystate (i.e., stationary distribution), and the implications of this assumption being untrue (for the proposed method).

      We thank the reviewer for the comment regarding the stationary assumption. We assume that each scRNA-seq snapshot approximates the steady-state (stationary) distribution of transcript counts because (i) typical single-cell experiments sample large, asynchronously dividing populations that collectively traverse many transcriptional burst cycles, and (ii) in the absence of a synchronized perturbation, mRNA production and degradation reach a dynamic balance on timescales much shorter than overall cell-type changes. Under these conditions, the empirical count distribution closely mirrors the model’s stationary solution, justifying steady-state inference of burst size and frequency from a single time point. This assumption is commonly adopted in probabilistic models of transcriptional bursting (Larsson AJM et al., 2019; Raj A & van Oudenaarden A, 2008).

      However, this steady-state assumption has some limitations. First, in some scenarios, the cell system may exhibit highly transient transcriptional programs that do not satisfy stationarity, leading to biased or misleading parameter estimates. For example, immediately following a synchronized developmental stimulus—such as serum shock–induced activation of immediate-early genes. Second, because DeepTX infers the mean burst frequency and size across the population, it cannot recover the underlying time-resolved dynamics or distinguish heterogeneous kinetic subpopulations. 

      We have added a statement in the discussion to acknowledge these limitations and suggest future extensions—such as incorporating time-series measurements or latent pseudo time covariates—to address non-stationarity and recover temporal burst dynamics.

      On page 3, "traditional telegraph model" is mentioned without any context. This model, and particularly the implications for the current work, might not be obvious to the reader. Take one or two sentences to give the reader context.

      Thank the reviewer for this helpful comment. We acknowledge that the mention of the "traditional telegraph model" on page 3 may not be immediately clear to all readers. The traditional telegraph model is a mathematical framework commonly used to describe gene expression burst dynamics, in which genes stochastically switch between active (ON) and inactive (OFF) states, with exponentially distributed waiting times for state transitions. To provide the necessary context, we added a brief introduction to the traditional telegraph model and its relevance to our work in the revised manuscript.

      A primary concern with the model used in Figure 2a (TXmodel) is its reliance on four independent parameters to describe gene state switching dynamics. While this general model can encompass specific cases such as the refractory model (Science 332, 472 (2011)) and the telegraph model, accurately estimating the parameters of the refractory model using only steady-state distributions and typical cell numbers (10³-10⁴) is challenging without time-dependent data. To address this, we suggest that the authors provide parameter inference results for each individual parameter, rather than only for burst size and burst frequency, based on synthetic data. This would help clarify the model's effectiveness and improve understanding of its estimation precision.

      Thank the reviewer for highlighting this important concern. We agree that the lack of timeresolved measurements may affect the accuracy of inferences about dynamic parameters, especially the unidentifiability of parameters inferred from steady-state distributions, i.e., multiple parameters leading to the same steady-state distribution. The unidentifiability of individual parameters is a common and critical problem in systems biology studies. To address this issue, for example, Trzaskoma et al. developed StochasticGene, a computationally efficient software suite that uses Bayesian inference to analyze arbitrary gene regulatory models and quantify parameter uncertainty across diverse data types (Trzaskoma P et al., 2024). Alexander et al. adopt a Bayesian approach to parameter estimation by incorporating prior knowledge through a prior distribution and classify a parameter as practically nonidentifiable if it cannot be uniquely determined beyond the confidence already provided by the prior (Browning AP et al., 2020). Hence, in DeepTX, we employed a Bayesian approach based on loss potential to infer the posterior distributions of the parameters (Figure 3E). 

      Although DeepTX also encounters the issue of unidentifiability for individual parameters (Supplementary Figure S11), the multimodal nature of the posterior distribution suggests that multiple distinct parameter sets can produce similarly good fits to the observed data, highlighting the inherent non-identifiability of the model. Nevertheless, in the multimodal posterior distribution, at least one of the posterior peaks aligns closely with the ground truth, thereby demonstrating the validity of the inferred result. Moreover, inference results on synthetic data confirm that the BS and BF can be accurately estimated (Supplementary Figure S3b and S3c). We also performed robustness analyses on synthetic datasets. As shown in Supplementary Figure S3d and S3e, our model reliably recovers the ground-truth burst kinetics of models when the number of cells reaches ~1000, which is within the range of typical single-cell RNA-seq experiments. 

      We have explicitly pointed out the potential issue of unidentifiability due to the lack of temporal resolution information in the discussion section. 

      Noteworthy, transcriptional is always a multi-step process (depending on the granularity with which the process is described). What do the authors mean by saying that "DNA damage turns transcription into a multi-step process rather than a single-step process"?

      Thank the reviewer for pointing out the lack of precision in our original statement. We agree that the phrasing could be misleading. Transcription is inherently a multi-step process, but most mechanistic studies simplify it to a single-step “telegraph” model for tractability. In the context of DNA damage, however, damage-induced pausing and repair-mediated delays introduce additional intermediary states in the transcription cycle that cannot be approximated by a single step. To capture these damage-specific interruptions, DeepTX explicitly consider a multi-step promoter switching framework rather than combining all transitions into one. What we originally wanted to express was the necessity of multi-step process modeling. We have replaced the original sentence in introduction with: “However, the presence of DNA damage necessitates modeling the transcriptional process as a multistep process, rather than a single-step process, to capture the additional complexity introduced by the damage”.

      It is unclear why the authors have chosen a different definition in Equation (2) rather than the commonly used burst frequency, 1/(k_deg * tau_off), as reported in the literature. Unlike the traditional definition, which is unit-free, the definition in Eq. (2) includes units, raising questions about its interpretability and consistency with established conventions. Clarifying this choice would improve the understanding and consistency of the methodology.

      Thank the reviewer for raising this important point. We acknowledge that there are multiple definitions of burst frequency (BF) in the literature. Here, we provide a detailed explanation, clarifying the differences between these definitions, including the one used and the traditional definition .

      First, the definition of burst frequency we adopt has been widely used in recent literatures, such as Benjamin Zoller et al. (Zoller B et al., 2018), Caroline Hoppe et al. (Hoppe C et al., 2020) and Daniel Ramsköld (Ramsköld D et al., 2024). And its quantity represents the average time it takes for the promoter to complete one full stochastic cycle between its active and inactive states . Secondly, the traditional definition can be regarded as a simplified version of our definition, under the assumptions that τ<sub>on</sub> is negligible and k<sub>deg</sub> =1 (i.e., rate parameters are normalized to be unit-free). Although it is reasonable to neglecting activate time τ<sub>on</sub>, as it is typically much shorter than inactive time under some conditions, we chose a more complete way to define the burst frequency so that it is applicable to more general situations. In addition, by defining the burst frequency as , the mean transcription level can be analytically represented as the product of burst size and burst frequency.

      This explanation has been clarified in the methods 1.2 section.

      The authors mention the need to model "more realistic gene expression processes". How is this exactly being incorporated into the model?

      Thank the reviewer for raising this important question. To incorporate "more realistic gene expression processes" into our model, we considered two critical aspects into DeepTX that are often oversimplified in traditional approaches:

      (1) Integration of gene expression and sequencing processes: Observations from scRNA-seq data are influenced by both the intrinsic gene expression processes and the subsequent sequencing procedure. Traditional models often focus solely on gene expression, neglecting the stochastic effects introduced by the sequencing process. Our model explicitly incorporates both the gene expression and sequencing processes, providing a more comprehensive and realistic representation of the observed data.

      (2) Modeling gene expression as a multi-step process: Gene expression is inherently a multi-step process. However, traditional telegraph models typically simplify gene state switching as a single-step process for tractable analysis, often assuming Markovian dynamics where transition waiting times follow exponential distributions. In contrast, our model accounts for the multi-step nature of gene state transitions by allowing the waiting times to follow non-exponential (non-Markovian) distributions. This model is more suitable for gene expression dynamics that cannot be simplified to a single-step process, such as DNA damage, which may introduce an intermediate state to represent pausing and repair in the transcription process.

      By addressing these factors, our model better reflects the complexity and stochastic nature of gene expression processes, aligning more closely with the data generated from biological systems. We have added detailed explanations after this sentence for clarification in the first result section.

      Better explanation of the previously developed TXmodel, and the assumption of a non-Markovian system. In particular, it isn't clear how using arbitrary distributions for the waiting times implies a non-Markovian process (as the previous state(s) of the system is not used to inform the transition probability, at least as explained in pp. 4). Without a clear discussion of the so-called arbitrary waiting time distribution, it isn't clear how these represent a mechanistic model. In general, a more careful discussion of the "mechanistic" model is needed.

      Thank the reviewer for this thoughtful comment. In this revised version, we provided a more detailed explanation of the relationship between the TXmodel and the non-Markovian system in the revised manuscript. Specifically, we will clarify the following points:

      (1) Why non-Markovian system: In a Markovian system, the waiting times for events are exponentially distributed, meaning that the state transitions depend solely on the current state and are memoryless (Van Kampen NG, 1992). However, when the waiting times follow non-exponential distributions, such as Gamma or Weibull distributions, the state transitions are no longer independent of the system's previous states. This introduces memory into the system, making it non-Markovian.

      (2) Why mechanistic model: First, it is important to clarify that regardless of whether the waiting time is arbitrary or exponential (corresponding to non-Markovian and Markovian systems), our TXmodel is a mechanistic model because it models the dynamic process of transcription bursts with interpretable kinetic parameters. Second, although we introduced arbitrarily distributed waiting times, reasonable selection of waiting time distributions can still make the distribution parameters mechanistically interpretable. For example, in the context of modeling ON and OFF state switching times using a Gamma distribution, the two parameters have clear interpretations: the shape parameter represents the number of sequential exponential (memoryless) steps required for the transition to occur, capturing the complexity or multi-step nature of the switching process, while the scale parameter denotes the average duration of each of these steps. We have added the explanation in methods 1.2 section.

      Include a brief discussion about the metric used to compare distributions (and introduce KL abbreviation).

      Thank the reviewer for this suggestion. In the second result and methods 1.3 section of revised manuscript, we have included a brief discussion to introduce and clarify the metric used to compare distributions. Specifically, we have given more explanation for the Kullback-Leibler (KL) divergence, which is a widely used metric for quantifying the difference between two probability distributions. We also ensured that the abbreviation "KL" is properly introduced when it first appears in the text, along with a concise description of its mathematical definition and interpretation within the context of our analysis. 

      What does the "CTM" model stand for (in supplementary information)? And "TX" model?

      Thank the reviewer for highlighting this point. We revised the supplementary information to explicitly define the "CTM" and "TX" models and clarify their distinctions.

      CTM model: The "CTM" model refers to the classic telegraph model, a widely used model for capturing Markovian gene expression burst kinetics. The CTM describes stochastic gene expression as a sequence of four biochemical reactions involving two gene states (ON and OFF), mRNA transcription and degradation:

      k<sub>off</sub> as the rate at which the gene switches from OFF to ON, k<sub>on</sub>  as the rate at which the gene switches from ON to OFF, k<sub>syn</sub>  as the rate of mRNA synthesis and k<sub>deg</sub>  as the rate of mRNA degradation. In this model, gene switching between active and inactive states is governed by a memoryless Markovian process, where the waiting times for transitions follow exponential distributions (Van Kampen NG, 1992).

      TX model: In contrast, the "TX" model is a more generalized telegraph model for transcriptional processes.

      Different from the CTM, the waiting times for state transitions between ON and OFF in the TX model follow arbitrary waiting time distributions. This implies that the future state of the system depends not only on the current state but may also be influenced by its historical trajectories. Consequently, the TX model exhibits non-Markovian behavior. We have added more detailed description on these two models in section 1.1 of supplementary text.

      Leaky transcription (in the OFF promoter state) is not considered. What would be the implications of its presence in the data?

      Thank the reviewer for pointing out the potential role of leaky transcription in our analysis. We acknowledge that leaky transcription, occurring in the promoter OFF state, was not explicitly considered in our current model. Our decision to exclude it assumed that the leaky transcription rate is relatively small and its impact on the observed data is negligible. This assumption is consistent with previous studies that similarly disregard leaky transcription in gene expression modeling due to its minimal contribution to the overall dynamics (Larsson AJM et al., 2019).

      However, we recognize that the leaky transcription should be considered, particularly in systems where the leaky rate is significant relative to the active transcription rate. In such cases, it may introduce additional variability to the observed expression levels or obscure the distinction between ON and OFF states. We have added relevant statements in the discussion section.

      In the main text, the waiting time for state transitions is described by two parameters, while in the methods/supplementary information only one parameter is considered per distribution (without a clear discussion of the so-called "dwell time distributions").

      Thank the reviewer for this comment. We recognize the need to clarify the discrepancy between the descriptions of waiting times in the main text and supplementary materials.

      Dwell time distribution refers to the probability distribution of the time in which a gene remains in a particular transcriptional state (ON or OFF) before transitioning to the other state. While in Markovian models the dwell time follows an exponential distribution, more complex or non-Markovian regulatory mechanisms may give rise to Gamma, Weibull, or other non-exponential dwell time distributions.

      In our model, we denote the dwell time distributions in the OFF and ON states by and , respectively, where w represents a vector of parameters characterizing the distribution, the dimensionality of which depends on the specific form of the distribution. For example, when an exponential distribution is assumed, w consists of a single rate parameter; in contrast, for distributions such as the Gamma or Weibull, w includes two parameters. In the main text, both and are modeled using Gamma distributions, whereas in the Supplementary Materials, we assume exponential distributions for both, resulting in a single-parameter representation. We have added relevant statements in the methods 1.2 section.

      Related, but more general, across the manuscript there are problems with the consistency in terminology. This is especially problematic with the figures. It makes it incredibly hard to follow the work. Better integration of the information, and consistency with the terminology, would improve the understanding for the reader.

      Thank the reviewer for the valuable feedback. To enhance clarity and readability, we have carefully revised the manuscript to ensure consistent terminology throughout the text and figures e.g., unifying terms such as "untreatment" and "control" under the consistent label "control"—across both the text and figures.

      One of the four main assumptions behind the model is that "the solution of the model can be explained by a mixed negative binomial distribution". The logic and implications of this assumption need to be discussed in the paper. (Methods, pp.13.) All four assumptions need to be carefully argued in the paper. 

      We appreciate the reviewer’s comment regarding the assumptions underlying our model. Here, we would like to clarify the rationale and implications of each assumption. 

      Assumption 1 (The gene expression of cells was in a stationary distribution during sequencing.) has been extensively used in previous studies for the inference and modeling of scRNA-seq data, demonstrating effectiveness in capturing mRNA expression distributions and inferring underlying dynamic parameters (Larsson AJM et al., 2019; Luo S et al., 2023; Ramsköld D et al., 2024; Gupta A et al., 2022).

      For Assumption 2 (Gene expression counts of the same cell type follow the same distribution.) is as follows: cell types are typically defined based on gene expression profiles or functional characteristics. Cells with similar functions often exhibit consistent transcriptional programs, leading to approximately identical gene expression distributions. This assumption has been widely adopted in previous research (Larsson AJM et al., 2019; Gupta A et al., 2022).

      Regarding Assumption 3 (The solution of the model can be approximated by a mixed negative binomial distribution.), in the most general formulation, a chemical master equation (CME) model of biological systems converges to a stationary distribution P(n;θ) over n∈ℕ. And P(n;θ) afford a real Poisson representation (Gardiner CW & Chaturvedi S, 1977): where F is a mixing cumulative distribution function (CDF). If such a Poisson representation exists, we can always write down a finite approximation over K Poisson kernels: , where w<sub>k</sub> are weights on a K-dimensional simplex. Further, as k →∞,QP . More problematically, convergence in the number of kernels in K is typically slow. Negative binomial kernels P<sub>Poisson</sub> (n m<sub> k</sub>,l<sub>k</sub>), which are continuous Poisson mixtures with a gamma mixing density can accelerate convergence in K (Gorin G et al., 2024). Hence, the solution of the TX model can be approximated by a mixed negative binomial distribution. 

      For Assumption 4 (The state space sampled from a sufficiently long single simulation is statistically equivalent to that obtained from multiple simulations at steady state in gene expression models.), when a sample trajectory of the model is simulated for a sufficiently long period, it is assumed to have traversed the entire stationary state space (Kuntz J et al., 2021). Therefore, by performing truncated statistical analysis on the trajectory, the corresponding stationary distribution of the model can be obtained. We have added the explanation in methods 1.1 section.

      The authors propose that the waiting times between promoter states follow a non-exponential distribution, but the choice of gamma distribution and the implications for the method and the biological conclusions need to be discussed.

      We thank the reviewer for this comment. To account for the impact of DNA damage on the transcription process, our model assumes that both the "ON" and "OFF" states of the promoter consist of multiple underlying sub-states. When a promoter switches from the "ON" state to the "OFF" state, the transition is governed by multiple distinct waiting time distributions that follow exponential distributions. Similarly, when a promoter switches from the "OFF" state to the "ON" state, there may be multiple transitions from different "OFF" sub-states. Consequently, the waiting times for the transitions from the "OFF" state to the "ON" state, and vice versa, must account for multiple exponential waiting time distributions associated with each "ON" state transition. We can map a multiple exponential-waiting-times reaction process to a single-step reaction process with a non-exponential waiting time distribution. Therefore, we use a Gamma distribution for dwell time of promoter switching, which can be expressed as the convolution of multiple exponential distributions (corresponding to a sum of multiple exponential variables). Additionally, other non-exponential distributions, such as those discussed in our previous studies (Zhang J & Zhou T, 2019), may also be considered, and we recognize that alternative choices could be made depending on the specific characteristics of the system. We have added the explanation in methods 1.2 section.

      BF - burst frequency; BS - burst size. These terms represent the main data output, but they are only mathematically defined in the methods, and never the intuition of the specific expression explained (e.g., why not using tON/(tON+tOFF) as BF instead of 1/(tON+tOFF), and why not kSYN*tON as BS instead of kSYN*tON).

      We appreciate the reviewer’s comment and agree that clarifying the biological intuition behind the mathematical definitions of burst frequency (BF) and burst size (BS) is important. Below, we provide a more detailed explanation of these definitions.

      BF: The definition of burst frequency we adopt has been widely used in previous literature, such as Benjamin Zoller et al (Zoller B et al., 2018), Caroline Hoppe et al (Hoppe C et al., 2020) and Daniel Ramsköld (Ramsköld D et al., 2024). And its quantity represents the average time it takes for the promoter to complete one full stochastic cycle between its active and inactive states.

      BS: The definition of burst size BS = we adopt is consistent with the definition proposed by the reviewer. Burst size refers to the average number of mRNA transcripts produced during a single transcriptional activation event of a gene. It reflects the quantity of gene product synthesized per activation and is influenced by the rate of transcription and the duration of the active state of the gene. Our definition aligns with this biological interpretation and is mathematically formulated as BS = , where k<sub>syn</sub> is the transcription rate and is the average duration of the active state.

      In addition, the mean transcription level can be analytically represented as the product of burst size and burst frequency. This analytical result has been included in the methods 1.2 section of revised manuscript.

      One can assume from the methods that omegaON and omegaOFF are the vector of (2) parameters describing the distribution, but the reader would benefit from some clarity here. The authors claim that they proved that the distribution moments can be obtained through an iterative process. How much does this rely on the assumption of an underlying binomial distribution?

      Thank the reviewer for this helpful suggestion. To clarify, the vectors omegaON and omegaOFF represent the parameters characterizing the waiting time distributions of the promoter's active and inactive states, respectively. The exact form and interpretation of these vectors depend on the specific distributional choice for the waiting times. For instance, when the waiting time distribution follows a Gamma distribution with shape parameter α>0  and scale parameter β>0 , denoted as , then w<sub>on</sub> = (α,β) . Conversely, when the waiting time distribution follows a Weibull distribution, denoted as , with shape parameter k >0 and scale parameter l>0, then w<sub>on</sub> = (l,k) . We have clarified it in the Methods 1.2 section of the revised manuscript.

      For the question about the binomial distribution, in our work, we use the binomial moment method to compute distributional statistics of chemical master equation (Zhang J et al., 2016). Binomial moments of the mRNA stationary distribution P(m) are defined as , where the symbol represents the combinatorial number. This technique refers to a mathematical tool for moment calculation and is not based on the assumption that the underlying distribution is binomial distribution (Luo S et al., 2023). Hence, our approach is general and does not require the distribution itself to follow a binomial form.

      More details about the parameter sampling are required. For instance, why are the specific ranges chosen and their implications? And is the space explored in logarithmic scale?  

      Thank the reviewer for the insightful comment regarding parameter sampling. In our study, we considered five parameters: . The parameters k<sub>off</sub>  and k<sub>on</sub> represent the number of intermediate reaction steps involved in transcriptional state transitions. These values were sampled uniformly from the range 1 to 15, which aligns with biological evidence indicating that most genes undergo either direct (single-step) transitions or a small number of intermediate steps, typically fewer than ten (Tunnacliffe E & Chubb JR, 2020). This range is sufficient to capture both widely used singlestep models and more detailed multi-step mechanisms without introducing biologically implausible complexity. 

      Among these parameters, r<sub>off</sub> and r<sub>on</sub> denote the rate constants governing stochastic transitions between the OFF and ON transcriptional states, respectively. The mean duration of the OFF state, which corresponds to the time between transcriptional bursts, is given by = k<sub>off</sub> / r<sub>off</sub> , and falls within the range ∈(0.1,150).Experimental measurements report a median value of approximately 3.7 (Gupta A et al., 2022), which is well contained within this range. Similarly, the mean duration of the ON state, referred to as the burst duration, is defined by = k<sub>on</sub> / r<sub>on</sub> , and spans the interval ∈(0.1,1500). The experimentally observed median value of 0.12 (Gupta A et al., 2022) confirms that the parameter range adequately captures biologically realistic dynamics.

      The parameter k<sub>syn</sub>  represents the normalized synthesis rate after accounting for molecular degradation. Its range was chosen based on empirical observations of transcriptional burst sizes, which typically vary from single molecules to several dozen (Gupta A et al., 2022). Considering the relationship BS = k<sub>syn</sub> * , the selected range of k<sub>syn</sub> ensures that the experimentally observed burst sizes are well represented within the defined parameter space. We have added the explanation in methods 1.2 section and supplementary text 4.

      We fully recognize the advantages of logarithmic sampling, particularly when parameters span several orders of magnitude. Logarithmic scaling ensures balanced exploration across wide ranges and prevents sampling bias towards larger values. However, in our work, we applied Sobol sampling directly within the original (linear) parameter space. Although we did not explicitly transform parameters into logarithmic scale, Sobol sequences provide low-discrepancy, quasi-random coverage, which promotes uniform sampling across bounded domains (Sobol IM, 1967). Further, if necessary, we can increase the parameter range adaptively, and perform simulation algorithm to obtain sample and train a new model to solve a larger parameter range. 

      On page 15, the rationale for selecting the parameter space is unclear. This is crucial, as fully connected neural networks typically exhibit poor extrapolation beyond their training parameter space. If the parameter space of an experimental dataset significantly differs from the training range, the inference results may become unreliable. We suggest further clarification on how the alignment between the parameter spaces of the experimental data and the training dataset can be ensured to maintain inference accuracy.

      We appreciate the reviewer’s insightful comment regarding the extrapolation limitations of fully connected neural networks. To address this concern, we have implemented a truncation strategy during inference, which constrains the inferred parameters to remain within the bounds of the training parameter space. This ensures that the neural network operates within a regime where its predictive accuracy has been validated, thereby enhancing the robustness of our results. Additionally, we have carefully selected the training parameter space to be reasonable, based on the characteristics of the experimental data. These ranges have been validated through domain knowledge and data analysis, ensuring that even when the experimental data approaches the boundaries of the training range, the inference results remain reliable and accurate.

      On page 16, it is unclear why the authors chose to incorporate the Fano factor instead of using the coefficient of variation or variance. Clarifying the reasoning behind the selection of the Fano factor over these other statistical measures would provide better insight into its relevance for their analysis.  

      We thank the reviewer for raising this point. Although the loss term is described using the Fano factor, its formulation actually involves both the variance and the mean. Specifically, the loss we use is: . We chose to use the Fano factor because it is particularly well-suited for quantifying transcriptional noise in systems where the mean expression level varies across conditions or parameters. Unlike variance, the Fano factor normalizes variability by the mean, making it more robust for comparing noise levels across genes or regulatory regimes with different expression levels. Compared to the coefficient of variation (CV), which normalizes by the square of the mean, the Fano factor tends to be less sensitive to low expression regimes and is commonly used in stochastic gene expression studies, especially when the distribution is skewed or over dispersed (i.e., variance exceeds the mean). This makes it a more appropriate metric in our context, where transcriptional bursting often leads to over dispersed expression distributions. We have added an explanation in the methods 1.3 of revised manuscript to explain this choice.

      On page 17, the definition of "sample" is unclear. Does it refer to the number of parameters sets or to the simulated trajectories generated by stochastic simulation algorithms?

      Thank reviewers for your valuable feedback. The term "sample" in this context refers to the data points used in the neural network training set. To eliminate any ambiguity, we included a precise mathematical definition of "sample" (θ<sub>i</sub>,P<sub>simulation,i</sub> ) in the methods 1.3 section of revised manuscript.

      Additionally, it is unclear how the authors determined the number of simulated trajectories per parameter set to ensure training accuracy. Furthermore, it would be relevant to address whether including moments during neural network training is beneficial.

      We appreciate the reviewer’s insightful questions regarding the simulation and training process. To clarify, for each parameter set, we did not simulate multiple trajectories to obtain the corresponding distribution. Instead, we simulated the system for a sufficiently long period to ensure that the system reached a steady-state distribution. From this steady-state data, we then used interpolation methods to derive the corresponding distribution for each parameter set.

      On the other hand, the moments were calculated theoretically without any approximations, providing higher accuracy. By incorporating the moments into the training process, we can effectively mitigate potential biases arising from insufficient sampling of the simulated data. Moreover, our experiments on the synthetic dataset demonstrate that introducing the moments as a loss function significantly enhances the model's performance on the test set (Figure 2E).

      What is the intuition behind the choice of alpha_cg? On page 18, the rationale for setting the sampling probability to 0.5 is unclear. Could this parameter be inferred rather than being preset?  

      We thank the reviewer for the insightful comment regarding the choice of α<sub>cg</sub>. We acknowledge that the typical values of this parameter in related literature often fall within a narrower range (e.g., 0.06–0.32) (Zheng GX et al., 2017; Macosko EZ et al., 2015). However, our decision to set α<sub>cg</sub> was based on a trade-off between sampling efficiency and computational tractability in our specific application context. While it is indeed possible to infer α<sub>cg</sub> as a learnable parameter, we opted for a fixed value in this work to reduce model complexity and avoid unidentifiability issues. In addition, we conducted inference under different capture efficiencies (0.5, 0.3, and 0.2), and found that the inferred burst size (BS) and burst frequency (BF) remained strongly correlated across these conditions (Supplementary Figure S12). This indicates that variations in capture efficiency do not significantly impact the outcomes of downstream enrichment analyses. Nevertheless, we agree that adaptively learning α<sub>cg</sub> could be a promising direction, and we plan to explore this in future work. We have added the explanation in methods 1.4 section.

      On page 19, the authors employed gradient descent for parameter inference. However, as this method is sensitive to initial values, it is unclear how the starting points were selected.

      We sincerely thank the reviewer for highlighting the sensitivity of gradient-based optimization methods to initial values. To address this concern, we adopted a black-box optimization strategy in the form of the adaptive differential evolution (DE) algorithm (Das S & Suganthan PN, 2010) to derive robust initial parameters for the parameter inference. The adaptive DE algorithm enables global exploration across a broad parameter space, thereby reducing the risk of convergence to suboptimal local minima. This yielded reasonably good initial estimates, which were subsequently refined using gradient-based optimization to identify high-quality solutions characterized by a vanishing gradient norm. This hybrid strategy, which combines global and local search, is widely adopted in optimization literature to alleviate the risk of entrapment in local optima (Ahandani MA et al., 2014). We have clarified this detail in the third result of the revised manuscript.

      Furthermore, clarification on how the gradients were computed - whether through finite difference approximation or other methods - would offer additional insight into the robustness and accuracy of their approach.

      Thank reviewers for valuable feedback. Regarding the computation of gradients, we use the chain rule in neural networks, and the gradients are computed through backpropagation. Specifically, we rely on automatic differentiation to efficiently calculate the gradients. Unlike finite difference approximation, automatic differentiation directly computes the derivative of the loss function with respect to each parameter, ensuring accurate gradient calculations (Baydin AG et al., 2018). We have clarified this detail in the discussion section of the revised manuscript.

      The paper presents several comparisons between continuous and discrete distributions in Figure 2B and Supplementary Figures S4, S6, and S8, described as a "comparison between mRNA distribution and inferred distribution by DeepTX for scRNA-seq data" or a "comparison between SSA results and DeepTX prediction results." This may lead to confusion for the reader, as the paper focuses on transcriptional bursting, a process where we would typically expect the distributions to be discrete. Clarifying this point would help align the figures with the main topic and enhance the reader's understanding.

      We sincerely thank the reviewer for this insightful comment. We understand the concern that the distributions shown in Figure 2B and Supplementary Figures S4, S6, and S8 may appear to be continuous, which could be confusing given that transcriptional bursting naturally results in discrete mRNA count distributions.

      We have clarified that in all these figures, both the empirical mRNA distributions derived from scRNAseq data and the model-predicted distributions from DeepTX are inherently discrete. To visualize the empirical distributions, we used histograms where the x-axis corresponds to discrete mRNA copy numbers and the y-axis represents the normalized frequency (density). To illustrate the DeepTX-inferred probability mass function, we plotted the predicted probabilities at each integer count as points and connected them with lines for clarity. While the connecting lines give the appearance of continuity, this is a standard graphical convention used to better show trends and model fit in discrete distributions.

      We suggest that Figure 3E could present the error as a percentage of the parameter value, as this would provide a more equitable comparison and better illustrate the relative accuracy of the parameter estimation.

      Thank reviewers for suggestion regarding Figure 3E. We agree that presenting the error as a percentage of the parameter value would offer a more equitable basis for comparison and better highlight the relative accuracy of our parameter estimation. Accordingly, we have revised Figure 3E to include the relative percentage error for each parameter.

      Figure 4A could be improved for better legibility. The contour plots are somewhat confusing, and the light blue points are difficult to distinguish. Additionally, the x-axis label "Untreatment" appears throughout the manuscript-could this term be referring to the control experiment?

      Thank reviewers for constructive feedback. We have revised Figure 4A to improve its clarity and legibility. Specifically, we adjusted the display style of the contour plots and enhanced the visibility of the light green points to make them more distinguishable.

      Additionally, we recognize the potential confusion caused by the term "Untreatment" and have replaced it with "Control" throughout the revised manuscript to ensure consistency and accuracy in terminology.

      Figure 4B was unclear, and further explanation would be helpful for understanding its purpose.

      Thank reviewers for feedback. The purpose of Figure 4B is to illustrate the relationship between bursting kinetics and the mean and variance of the model. In the revised manuscript, we will provide a more detailed explanation of how the figure captures these relationships, highlighting the key insights it offers into the underlying dynamics.

      Figure 4B illustrates the quantitative relationships among BS, BF, and gene expression noise within the framework of the transcriptional model. In this log-log-log 3D space, the mean expression level is constrained on a blue plane defined by the equation log(BS)+log(BF) = log(Mean), highlighting that the product of burst size and burst frequency determines the mean expression level. The orange plane represents a scaling relationship between expression noise and burst kinetics, expressed as log(BS)+log(BF) = klog(Noise), where k is a constant indicating how the burst kinetics co-vary with noise. Notably, the trajectory of the green sphere demonstrates that, under a fixed mean expression level (i.e., remaining on the blue plane), an increase in gene expression noise arises primarily from an increase in burst size. We have revised the caption of Figure 4B.

      In Figure 4D, two of the GO analysis terms are highlighted in red, but the meaning behind this emphasis is not clear. The same question applies to Figure 5E, where the green dots are missing from the plot.

      Clarification on these points would enhance the overall clarity.  

      We appreciate the reviewer’s thoughtful comments. We have added further clarification regarding the enrichment analysis results presented in Figure 4D. Specifically, we highlighted the "cell cycle G2/M phase transition" pathway because a delay in the G2/M phase transition has been shown to increase the probability of cell differentiation, which is a key aspect of our study. In addition, since IdU treatment is known to induce DNA damage, we emphasized the DNA damage-related pathway to support the biological relevance and consistency of our enrichment results. Similarly, in Figure 5E, we highlighted the apoptosis-related pathway. Apoptosis in this context is closely associated with cellular responses to toxic substances and mitochondrial dynamics. The enrichment of pathways related to these processes enables us to hypothesize the underlying mechanisms driving apoptosis in our system. Further, the absence of green dots in Figure 5E was due to an error in the figure caption. We have revised the figure caption accordingly to accurately describe all elements presented in the figure.

      Clarify axis labels in figures, particularly the y-axis in Figure 5A and the x-axis in Figure 6G. In the first case, it isn't clear what this "value" represents. In the second case, the x-label is very confusing. As I understand the figure description, in these plots you are always comparing the G0 arrested genes between control and treated cells. But the x-label says "G0 (0 D)", "Cycle (50 D)".

      Thank reviewers for pointing out the issues with the axis labels. We have made the necessary revisions to eliminate any confusion. In Figure 5A, the label for the y-axis has been changed from "value" to "log2 (value)" for clarity. The “value” in y-axis represents the value of statistical measure indicated at top of each panel. In Figure 6G, the x-axis label "Cycle (50 D)" has been updated to "G0 (50 D)" to accurately reflect the comparison between the G0-arrested genes in control and treated cells. We have revised the text of Figure 5A and Figure 6G.

      Figure 6 uses a QS metric (quality score), but the definition of this metric is not provided. Including a brief explanation of its meaning would be helpful for clarity.  

      Thank reviewers for feedback. In this version, we provided explanation of the QS (Quality Score) metric in the supplementary text 3 for better clarity. The QS is calculated based on the difference in z-scores derived from GSVA (Gene set variation analysis) of gene sets upregulated and downregulated during the quiescent phase, and is defined as QS = z(up genes)− z(down genes) , as described in the literature (Wiecek AJ et al., 2023). z(up genes) represents the standardized enrichment score of the gene set upregulated during quiescence in each sample. A higher value indicates that the quiescenceassociated upregulated genes are actively expressed, suggesting that the sample is more likely to be in a quiescent (G0) state. z(down genes)  corresponds to the standardized enrichment score of genes downregulated during quiescence. A lower value implies effective suppression of these genes, which is also consistent with quiescence. The difference score QS serves as an integrated indicator of the quiescent state: A higher value reflects simultaneous activation of quiescence-associated upregulated genes and repression of downregulated genes, indicating a gene expression profile that strongly aligns with the G0/quiescent state. A lower or negative value suggests a deviation from the quiescent signature, potentially reflecting a proliferative state or failure to enter quiescence. 

      In Figure 6G, light grey lines are shown, but their significance is unclear. It would be useful to specify what these lines represent.

      Thank reviewers for observation. In Figure 6G, each point represents a single gene, and the light grey lines indicate the trend of changes in the corresponding bursting kinetics values, mean and variance for genes. We have added the explanation in the caption of Figure 6G.

      Additionally, the manuscript should include references to the specific pathways used in the GO analysis to provide more context for the reader.

      Thank reviewers for the suggestion. We have included references to the specific pathways used in the GO analysis in the revised manuscript to provide additional context for the readers.

      In the discussion, sentences like "IdU drug treatment-induced BS enhancement delays the cell mitosis phase transition, impacting cell reprogramming and differentiation" are problematic as they imply causality, which I believe cannot be determined through the present analysis. The strength of the conclusions needs to be better argued (or toned down).

      We acknowledge that the original sentence lacked precision and may have conveyed a misleading implication of causality not fully supported by our current analysis. In the discussion section of revised manuscript, we have rephrased the statement to present a more nuanced interpretation: IdU drug treatment-induced BS enhancement of genes may be associated with a delayed transition in the cell mitosis phase, which could potentially influence cell reprogramming and differentiation.  

      Other (minor) comments:

      On pp. 10, "the BS down-regulates differential genes were mainly enriched..." appears to have a grammatical error/typo, "down-regulated"?

      We have made correction. We have revised “down-regulates” to “down-regulated” for grammatical consistency.

      Equation 2 doesn't match Figure 1A.

      We have made correction. The definition of BF = in Equation 2 is correct. We have revised the definition of BF in Figure 1A to ensure consistency with Equation 2.

      Reference

      Zheng, G.X., Terry, J.M., Belgrader, P., Ryvkin, P., Bent, Z.W., Wilson, R., Ziraldo, S.B., Wheeler, T.D., McDermott, G.P., Zhu, J., Gregory, M.T., Shuga, J., Montesclaros, L., Underwood, J.G., Masquelier, D.A., Nishimura, S.Y., Schnall-Levin, M., Wyatt, P.W., Hindson, C.M., Bharadwaj, R., Wong, A., Ness, K.D., Beppu, L.W., Deeg, H.J., McFarland, C., Loeb, K.R., Valente, W.J., Ericson, N.G., Stevens, E.A., Radich, J.P., Mikkelsen, T.S., Hindson, B.J., Bielas, J.H. 2017. Massively parallel digital transcriptional profiling of single cells. Nature Communications 8: 14049. DOI: https://dx.doi.org/10.1038/ncomms14049, PMID: 28091601

      Hagemann-Jensen, M., Ziegenhain, C., Chen, P., Ramsköld, D., Hendriks, G.J., Larsson, A.J.M., Faridani, O.R., Sandberg, R. 2020. Single-cell RNA counting at allele and isoform resolution using Smart-seq3. Nature Biotechnology 38: 708714. DOI: https://dx.doi.org/10.1038/s41587-020-0497-0, PMID: 32518404

      Larsson, A.J.M., Johnsson, P., Hagemann-Jensen, M., Hartmanis, L., Faridani, O.R., Reinius, B., Segerstolpe, A., Rivera, C.M., Ren, B., Sandberg, R. 2019. Genomic encoding of transcriptional burst kinetics. Nature 565: 251-254. DOI: https://dx.doi.org/10.1038/s41586-018-0836-1, PMID: 30602787

      Ochiai, H., Hayashi, T., Umeda, M., Yoshimura, M., Harada, A., Shimizu, Y., Nakano, K., Saitoh, N., Liu, Z., Yamamoto, T., Okamura, T., Ohkawa, Y., Kimura, H., Nikaido, I. 2020. Genome-wide kinetic properties of transcriptional bursting in mouse embryonic stem cells. Science Advances 6: eaaz6699. DOI: https://dx.doi.org/10.1126/sciadv.aaz6699, PMID: 32596448

      Luo, S., Wang, Z., Zhang, Z., Zhou, T., Zhang, J. 2023. Genome-wide inference reveals that feedback regulations constrain promoter-dependent transcriptional burst kinetics. Nucleic Acids Research 51: 68-83. DOI: https://dx.doi.org/10.1093/nar/gkac1204, PMID: 36583343

      Rodriguez, J., Ren, G., Day, C.R., Zhao, K., Chow, C.C., Larson, D.R. 2019. Intrinsic dynamics of a human gene reveal the basis of expression heterogeneity. Cell 176: 213-226.e218. DOI: https://dx.doi.org/10.1016/j.cell.2018.11.026, PMID: 30554876

      Luo, S., Zhang, Z., Wang, Z., Yang, X., Chen, X., Zhou, T., Zhang, J. 2023. Inferring transcriptional bursting kinetics from single-cell snapshot data using a generalized telegraph model. Royal Society Open Science 10: 221057. DOI: https://dx.doi.org/10.1098/rsos.221057, PMID: 37035293

      Eling, N., Morgan, M.D., Marioni, J.C. 2019. Challenges in measuring and understanding biological noise. Nature Reviews Genetics 20: 536-548. DOI: https://dx.doi.org/10.1038/s41576-019-0130-6, PMID: 31114032

      Tunnacliffe, E., Chubb, J.R. 2020. What is a transcriptional burst? Trends in Genetics 36: 288-297. DOI: https://dx.doi.org/10.1016/j.tig.2020.01.003, PMID: 32035656

      Rodriguez, J., Larson, D.R. 2020. Transcription in living Cells: molecular mechanisms of bursting. Annual Review of Biochemistry 89: 189-212. DOI: https://dx.doi.org/10.1146/annurev-biochem-011520-105250, PMID: 32208766

      Morgan, M.D., Marioni, J.C. 2018. CpG island composition differences are a source of gene expression noise indicative of promoter responsiveness. Genome Biology 19: 81. DOI: https://dx.doi.org/10.1186/s13059-018-1461-x, PMID: 29945659

      Raj, A., van Oudenaarden, A. 2008. Nature, nurture, or chance: stochastic gene expression and its consequences. Cell 135: 216-226. DOI: https://dx.doi.org/10.1016/j.cell.2008.09.050, PMID: 18957198

      Trzaskoma, P., Jung, S., Pękowska, A., Bohrer, C.H., Wang, X., Naz, F., Dell’Orso, S., Dubois, W.D., Olivera, A., Vartak, S.V. 2024. 3D chromatin architecture, BRD4, and Mediator have distinct roles in regulating genome-wide transcriptional bursting and gene network. Science Advances 10: eadl4893. DOI: https://dx.doi.org/https://www.science.org/doi/10.1126/sciadv.adl4893, PMID: 

      Browning, A.P., Warne, D.J., Burrage, K., Baker, R.E., Simpson, M.J. 2020. Identifiability analysis for stochastic differential equation models in systems biology. Journal of the Royal Society Interface 17: 20200652. DOI: https://dx.doi.org/10.1098/rsif.2020.0652, PMID: 33323054

      Zoller, B., Little, S.C., Gregor, T. 2018. Diverse spatial expression patterns emerge from unified kinetics of transcriptional bursting. Cell 175: 835-847.e825. DOI: https://dx.doi.org/10.1016/j.cell.2018.09.056, PMID: 30340044

      Hoppe, C., Bowles, J.R., Minchington, T.G., Sutcliffe, C., Upadhyai, P., Rattray, M., Ashe, H.L. 2020. Modulation of the promoter activation rate dictates the transcriptional response to graded BMP signaling levels in the drosophila embryo. Dev Cell 54: 727-741.e727. DOI: https://dx.doi.org/10.1016/j.devcel.2020.07.007, PMID: 32758422

      Ramsköld, D., Hendriks, G.J., Larsson, A.J.M., Mayr, J.V., Ziegenhain, C., Hagemann-Jensen, M., Hartmanis, L., Sandberg, R. 2024. Single-cell new RNA sequencing reveals principles of transcription at the resolution of individual bursts. Nature Cell Biology 26: 1725-1733. DOI: https://dx.doi.org/10.1038/s41556-024-01486-9, PMID: 39198695 Van Kampen, N.G. 1992. Stochastic Processes in Physics and Chemistry. Elsevier.

      Gupta, A., Martin-Rufino, J.D., Jones, T.R., Subramanian, V., Qiu, X., Grody, E.I., Bloemendal, A., Weng, C., Niu, S.Y., Min, K.H., Mehta, A., Zhang, K., Siraj, L., Al' Khafaji, A., Sankaran, V.G., Raychaudhuri, S., Cleary, B., Grossman, S., Lander, E.S. 2022. Inferring gene regulation from stochastic transcriptional variation across single cells at steady state. Proceedings of the National Academy of Sciences 119: e2207392119. DOI: https://dx.doi.org/10.1073/pnas.2207392119, PMID: 35969771

      Gardiner, C.W., Chaturvedi, S. 1977. The Poisson representation. I. A new technique for chemical master equations. Journal of Statistical Physics 17: 429-468. DOI: https://dx.doi.org/https://doi.org/10.1007/BF01014349, PMID: 

      Gorin, G., Carilli, M., Chari, T., Pachter, L. 2024. Spectral neural approximations for models of transcriptional dynamics. Biophysical Journal 123: 2892-2901. DOI: https://dx.doi.org/10.1016/j.bpj.2024.04.034, PMID: 38715358

      Kuntz, J., Thomas, P., Stan, G.-B., Barahona, M. 2021. Stationary distributions of continuous-time Markov chains: a review of theory and truncation-based approximations. SIAM Review 63: 3-64. DOI: 

      Zhang, J., Zhou, T. 2019. Computation of stationary distributions in stochastic models of cellular processes with molecular memory. bioRxiv: 521575. DOI: https://dx.doi.org/https://doi.org/10.1101/521575, PMID: 

      Zhang, J., Nie, Q., Zhou, T. 2016. A moment-convergence method for stochastic analysis of biochemical reaction networks. The Journal of chemical physics 144. DOI: 

      Sobol, I.M. 1967. On the distribution of points in a cube and the approximate evaluation of integrals. USSR Comput. Math. Math. Phys. 7: 784-802. DOI: https://dx.doi.org/10.1016/0041-5553(67)90144-9, PMID: 

      Macosko, E.Z., Basu, A., Satija, R., Nemesh, J., Shekhar, K., Goldman, M., Tirosh, I., Bialas, A.R., Kamitaki, N., Martersteck, E.M., Trombetta, J.J., Weitz, D.A., Sanes, J.R., Shalek, A.K., Regev, A., McCarroll, S.A. 2015. Highly parallel genome-wide expression profiling of individual cells using nanoliter dsroplets. Cell 161: 1202-1214. DOI: https://dx.doi.org/10.1016/j.cell.2015.05.002, PMID: 26000488

      Das, S., Suganthan, P.N. 2010. Differential evolution: A survey of the state-of-the-art. IEEE transactions on evolutionary computation 15: 4-31. DOI: https://dx.doi.org/10.1109/TEVC.2010.2059031, PMID: 

      Ahandani, M.A., Vakil-Baghmisheh, M.-T., Talebi, M. 2014. Hybridizing local search algorithms for global optimization. Computational Optimization and Applications 59: 725-748. DOI: https://dx.doi.org/https://doi.org/10.1007/s10589014-9652-1, PMID: 

      Baydin, A.G., Pearlmutter, B.A., Radul, A.A., Siskind, J.M. 2018. Automatic differentiation in machine learning: a survey. Journal of machine learning research 18: 1-43. DOI: https://dx.doi.org/https://dl.acm.org/doi/abs/10.5555/3122009.3242010, PMID: 

      Wiecek, A.J., Cutty, S.J., Kornai, D., Parreno-Centeno, M., Gourmet, L.E., Tagliazucchi, G.M., Jacobson, D.H., Zhang, P., Xiong, L., Bond, G.L., Barr, A.R., Secrier, M. 2023. Genomic hallmarks and therapeutic implications of G0 cell cycle arrest in cancer. Genome Biology 24: 128. DOI: https://dx.doi.org/10.1186/s13059-023-02963-4, PMID: 37221612

    1. Author response:

      The following is the authors’ response to the original reviews

      We thank the three reviewers for their insightful feedback. We look forward to addressing the raised concerns in a revised version of the manuscript. There were a few common themes among the reviews that we will briefly touch upon now, and we will provide more details in the revised manuscript. 

      First, the reviewers asked for the reasoning behind the task ratios we implemented for the different attentional width conditions. The different ratios were selected to be as similar as possible given the size and spacing of our stimuli (aside from the narrowest cue width of one bin, the ratios for the others were 0.66, .6 and .66). As Figure 1b shows, while the ratios were similar, task difficulty is not constant across cue widths: spreading attention makes the task more difficult generally. But, while the modeled width of the spatial distribution of attention changes monotonically with cue width, task difficulty does not. Furthermore, prior work has indicated that there is a relationship between task difficulty and the overall magnitude of the BOLD response, however we don’t suspect that this will influence the width of the modulation. How task difficulty influences the BOLD response is an important topic, and we hope that future work will investigate this relationship more directly.   

      Second, reviewers raised interest in the distribution of spatial attention in higher visual areas. In our study we focus only on early visual regions (V1-V3). This was primarily driven by pragmatic considerations, in that we only have retinotopic estimates for our participants in these early visual areas. Our modeling approach is dependent on having access to the population receptive field estimates for all voxels, and while the main experiment was scanned using whole brain coverage, retinotopy was measured in a separate session using a field of view only covering the occipital cortex.  

      Lastly, we appreciate the opportunity to clarify the purpose of the temporal interval analysis. The reviewer is correct in assuming we set out to test how much data is needed to recover the cortical modulation and how dynamic a signal the method can capture. This analysis does show that more data provides more reliable estimates, though the model was still able to recover the location and width of the attentional cue at shorter timescales of as few as two TRs. This has implications for future studies that may involve more dynamic tracking of the attentional field.

      Public Reviews

      Reviewer #1 (Public review): 

      The authors conducted an fMRI study to investigate the neural effects of sustaining attention to areas of different sizes. Participants were instructed to attend to alphanumeric characters arranged in a circular array. The size of attention field was manipulated in four levels, ranging from small (18 deg) to large (162 deg). They used a model-based method to visualize attentional modulation in early visual cortex V1 to V3, and found spatially congruent modulations of the BOLD response, i.e., as the attended area increased in size, the neural modulation also increased in size in the visual cortex. They suggest that this result is a neural manifestation of the zoomlens model of attention and that the model-based method can effectively reconstruct the neural modulation in the cortical space. 

      The study is well-designed with sophisticated and comprehensive data analysis. The results are robust and show strong support for a well-known model of spatial attention, the zoom-lens model. Overall, I find the results interesting and useful for the field of visual attention research. I have questions about some aspects of the results and analysis as well as the bigger picture. 

      (1) It appears that the modulation in V1 is weaker than V2 and V3 (Fig 2). In particular, the width modulation in V1 is not statistically significant (Fig 5). This result seems a bit unexpected. Given the known RF properties of neurons in these areas, in particular, smaller RF in V1, one might expect more spatially sensitive modulation in V1 than V2/V3. Some explanations and discussions would be helpful. Relatedly, one would also naturally wonder if this method can be applied to other extrastriate visual areas such as V4 and what the results look like. 

      We agree with the reviewer. It’s very interesting how the spatial resolution within different visual regions contributes to the overall modulation of the attentional field, and how this in turn would influence perception. Our data showed that fits in V1 appeared to be less precise than in V2 and V3. This can be seen in the goodness of fit of the model as well as the gain and absolute angular error estimates. The goodness of fit and gain were lowest in V1 and the absolute angular error was largest in V1 (see Figure 5). We speculate that the finer spatial granularity of V1 RFs was countered by a lower amplitude and SNR of attention-related modulation in V1, resulting in overall lower sensitivity to variation in attentional field width. Prior findings concur that the magnitude of covert spatial attention increases when moving from striate to extrastriate cortex (Bressler & Silver (2010); Buracas & Boynton (2007)). Notably, in our perception condition, V1 showed more spatially sensitive modulation (see Figure 7), consistent with the known RF properties of V1 neurons.

      Regarding the second point: unfortunately, our dataset did not allow us to explore higherorder cortical regions with the model-based approach. While the main experiment was scanned using a sequence with whole brain coverage, the pRF estimates came from a separate scanning session which only had limited occipital coverage. Our modeling approach is dependent on the polar angle estimates from this pRF session. We now explicitly state this limitation in the methods (lines 87-89):

      “In this session, the field of view was restricted to the occipital cortex to maximize SNR, thereby limiting the brain regions for which we had pRF estimates to V1, V2, and V3.”

      (2) I'm a bit confused about the angular error result. Fig 4 shows that the mean angular error is close to zero, but Fig 5 reports these values to be about 30-40 deg. Why the big discrepancy? Is it due to the latter reporting absolute errors? It seems reporting the overall bias is more useful than absolute value. 

      The reviewer’s inference here is exactly right: Figure 4 shows signed error, whereas Figure 5 shows absolute error. We show the signed error for the example participant because, (1) by presenting the full distribution of model estimates for one participant, readers have access to a more direct representation of the data, and (2) at the individual level it is possible to examine potential directional biases in the location estimates (which do not appear to be present). As we don’t suspect a consistent directional bias across the group, we believe the absolute error in location estimates is more informative in depicting the precision in location estimates using the model-based approach. In the revised manuscript, we modified Figure 5 to make the example participant’s data visually distinct for easy comparison. We have clarified this reasoning in the text (results lines 59-64):

      “The angular error distribution across blocks, separated by width condition, is shown in Figure 4 for one example participant to display block-to-block variation. The model reliably captured the location of the attentional field with low angular error and with no systematic directional bias. This result was observed across participants. We next examined the absolute angular error to assess the overall accuracy of our estimates.”

      (3) A significant effect is reported for amplitude in V3 (line 78), but the graph in Fig 5 shows hardly any difference. Please confirm the finding and also explain the directionality of the effect if there is indeed one. 

      We realize that the y-axis scale of Figure 5 was making it difficult to see that gain decreases with cue width in area V3. Instead of keeping the y-axis limits the same across visual regions, we now adapt the y-axis scale of each subplot to the range of data values:  

      We now also add the direction of the effect in the text (results lines 83-86):

      “We observed no significant relationship between gain and cue width in V1 and V2 (V1 t(7)=.54, p=.605; V2 t(7)=-2.19, p=.065), though we did find a significant effect in V3 illustrating that gain decreases with cue width (t(7)=-3.12, p=.017).”

      (4) The purpose of the temporal interval analysis is rather unclear. I assume it has to do with how much data is needed to recover the cortical modulation and hence how dynamic a signal the method can capture. While the results make sense (i.e., more data is better), there is no obvious conclusion and/or interpretation of its meaning. 

      We apologize for not making our reasoning clear. We now emphasize our reasoning in the revised manuscript (results lines 110-112). Our objective was to quantify how much data was needed to recover the dynamic signal. As expected, we found that including more data reduces noise (averaging helps), but importantly, we found that we still obtained meaningful model fits even with limited data. We believe this has important implications for future paradigms that explore more dynamic deployment of spatial attention, where one would not want to average over multiple repetitions of a condition.

      The first paragraph of the Temporal Interval Analysis section in the results now reads: 

      “In the previous analyses, we leveraged the fact that the attentional cue remained constant for 5-trial blocks (spatial profiles were computed by averaging BOLD measurements across a block of 10 TRs). We next examined the degree to which we were able to recover the attentional field on a moment-by-moment (TR-by-TR) basis. To do this, we systematically adjusted the number of TRs that contributed to the averaged spatial response profile. To maintain a constant number of observations across the temporal interval conditions, we randomly sampled a subset of TRs from each block. This allowed us to determine the amount of data needed to recover the attentional field, with a goal of examining the usability of our modeling approach in future paradigms involving more dynamic deployment of spatial attention.”

      (5) I think it would be useful for the authors to make a more explicit connection to previous studies in this literature. In particular, two studies seem particularly relevant. First, how do the present results relate to those in Muller et al (2003, reference 37), which also found a zoom-lens type of neural effects. Second, how does the present method compare with spatial encoding model in Sprague & Serences (2013, reference 56), which also reconstructs the neural modulation of spatial attention. More discussions of these studies will help put the current study in the larger context.

      We now make a more explicit connection to prior work in the discussion section (lines 34-54). 

      “We introduced a novel modeling approach that recovered the location and the size of the attentional field. Our data show that the estimated spatial spread of attentional modulation (as indicated by the recovered FWHM) consistently broadened with the cue width, replicating prior work (Müller et al., 2003; Herrmann et al., 2010). Our results go beyond prior work by linking the spatial profiles to pRF estimates, allowing us to quantify the spread of both attentional and perceptual modulation in degrees of polar angle. Interestingly, the FWHM estimates for the attentional and perceptual spatial profiles were highly similar. Additionally, for area V3 we replicate that the population response magnitude decreased with cue width (Müller et al., 2003; Feldmann-Wüstefeld and Awh, 2020). One innovation of our method is that it directly reconstructs attention-driven modulations of responses in visual cortex, setting it apart from other methods, such as inverted encoding models (e.g. Sprague & Serences, 2013). Finally, we demonstrated that our method has potential to be used in more dynamic settings, in which changes in the attentional field need to be tracked on a shorter timescale.”

      (6) Fig 4b, referenced on line 123, does not exist. 

      We have corrected the text to reference the appropriate figure (Figure 5, results line 136).

      Reviewer #2 (Public review):

      Summary: 

      The study in question utilizes functional magnetic resonance imaging (fMRI) to dynamically estimate the locus and extent of covert spatial attention from visuocortical activity. The authors aim to address an important gap in our understanding of how the size of the attentional field is represented within the visual cortex. They present a novel paradigm that allows for the estimation of the spatial tuning of the attentional field and demonstrate the ability to reliably recover both the location and width of the attentional field based on BOLD responses. 

      Strengths: 

      (1) Innovative Paradigm: The development of a new approach to estimate the spatial tuning of the attentional field is a significant strength of this study. It provides a fresh perspective on how spatial attention modulates visual perception. 

      (2) Refined fMRI Analysis: The use of fMRI to track the spatial tuning of the attentional field across different visual regions is methodologically rigorous and provides valuable insights into the neural mechanisms underlying attentional modulation. 

      (3) Clear Presentation: The manuscript is well-organized, and the results are presented clearly, which aids in the reader's comprehension of the complex data and analyses involved. 

      We thank the reviewer for summarizing the strengths in our work. 

      Weaknesses: 

      (1) Lack of Neutral Cue Condition: The study does not include a neutral cue condition where the cue width spans 360°, which could serve as a valuable baseline for assessing the BOLD response enhancements and diminishments in both attended and non-attended areas. 

      We do not think that the lack of a neutral cue condition substantially limits our ability to address the core questions of interest in the present work. We set out to estimate the locus and the spread of covert spatial attention. By definition, a neutral cue does not have a focus of attention as the whole annulus becomes task relevant. We agree with the reviewer that how spatial attention influences the magnitude of the BOLD response is still not well defined; i.e., does attending a location multiplicatively enhance responses at an attended location or does it instead act to suppress responses outside the focus of attention? A neutral cue condition would be necessary to be able to explore these types of questions. However, our findings don’t rest on any assumptions about this. Instead, we quantify the attentional modulation with a model-based approach and show that we can reliably recover its locus, and reveal a broadening in the attentional modulation with wider cues. 

      We realize that throughout the original manuscript we often used the term ‘attentional enhancement,’ which might inadvertently specify an increase with respect to a neutral condition. To be more agnostic to the directionality of the effect, we have changed this to ‘attentional modulation’ and ‘attentional gain’ throughout the manuscript. Additionally, we have added results and visualizations for the baseline parameter to all results figures (Figures 4-7) to help readers further interpret our findings.  

      (2) Clarity on Task Difficulty Ratios: The explicit reasoning for the chosen letter-to-number ratios for various cue widths is not detailed. Ensuring clarity on these ratios is crucial, as it affects the task difficulty and the comparability of behavioral performance across different cue widths. It is essential that observed differences in behavior and BOLD signals are attributable solely to changes in cue width and not confounded by variations in task difficulty.  

      The ratios were selected to be as similar as possible given the size and spacing of our stimuli (aside from the narrowest cue width of one bin, the proportions for the others were 0.67, 0.60, and 0.67). We have updated the methods section to state this explicitly (methods lines 36-38): 

      “The ratios were selected to be as similar as possible given the size and spacing of our stimuli (aside from the one-bin cue, the proportions for the other cues were 0.67, 0.60, 0.67).”

      As Figure 1b shows, task accuracy showed small and non-monotonic changes across the three larger cue widths, dissociable from the monotonic pattern seen for the modelestimated width of the attentional field. Furthermore, as prior work has indicated that there is a relationship between task difficulty and the overall magnitude of the BOLD response (e.g., Ress, Backus & Heeger, 2000), we would primarily expect effects of task difficulty on the gain or baseline rather than the width. How exactly task difficulty influences the BOLD response and whether this would, in fact, interact with the width of the attentional field is an important topic, and we hope that future work will investigate this relationship more directly.  

      We have clarified these points within the text, and now explicitly motivate future work looking at these important interactions (discussion lines 57-67):

      “The observed effects of attentional field width were unlikely to be directly attributable to variation in task difficulty. Participants' task in our study was to discriminate whether more numbers or more letters were presented within a cued region of an iso-eccentric annulus of white noise. For our different cue widths, the ratios of numbers and letters were selected to be as similar as possible given the size and spacing of our stimuli. Changes in accuracy across the three larger cue widths were small and non-monotonic, implying task difficulty was dissociable from width per se. This dissociation bolsters the interpretability of our model fits; nevertheless, future work should further investigate how task difficulty interacts with the spread of the attentional field and the amplitude of attention-related BOLD effects (cf. Ress, Backus & Heeger, 2000).”

      Reviewer #3 (Public review):

      Summary: 

      In this report, the authors tested how manipulating the contiguous set of stimuli on the screen that should be used to guide behavior - that is, the scope of visual spatial attention - impacts the magnitude and profile of well-established attentional enhancements in visual retinotopic cortex. During fMRI scanning, participants attended to a cued section of the screen for blocks of trials and performed a letter vs digit discrimination task at each attended location (and judged whether the majority of characters were letters/digits). Importantly, the visual stimulus was identical across attention conditions, so any observed response modulations are due to topdown task demands rather than visual input. The authors employ population receptive field (pRF) models, which are used to sort voxel activation with respect to the location and scope of spatial attention and fit a Gaussian-like function to the profile of attentional enhancement from each region and condition. The authors find that attending to a broader region of space expands the profile of attentional enhancement across the cortex (with a larger effect in higher visual areas), but does not strongly impact the magnitude of this enhancement, such that each attended stimulus is enhanced to a similar degree. Interestingly, these modulations, overall, mimic changes in response properties caused by changes to the stimulus itself (increase in contrast matching the attended location in the primary experiment). The finding that attentional enhancement primarily broadens, but does not substantially weaken in most regions, is an important addition to our understanding of the impact of distributed attention on neural responses, and will provide meaningful constraints to neural models of attentional enhancement. 

      Strengths: 

      (1) Well-designed manipulations (changing location and scope of spatial attention), and careful retinotopic/pRF mapping, allow for a robust assay of the spatial profile of attentional enhancement, which has not been carefully measured in previous studies.

      (2) Results are overall clear, especially concerning width of the spatial region of attentional enhancement, and lack of clear and consistent evidence for reduction in the amplitude of enhancement profile.

      (3) Model-fitting to characterize spatial scope of enhancement improves interpretability of findings.

      We thank the reviewer for highlighting the strengths of our study. 

      Weaknesses: 

      (1) Task difficulty seems to vary as a function of spatial scope of attention, with varying ratios of letters/digits across spatial scope conditions, which may complicate interpretations of neural modulation results  

      The reviewer is correct in observing that task accuracy varied across cue widths. Though we selected the task ratios to be as similar as possible given the size and spacing of our stimuli (aside from the narrowest cue width of one bin, the proportions for the others were 0.67, 0.60, and 0.67), behavioral accuracy across the three larger cue widths was not identical. Prior research has shown that there is a relationship between task difficulty and the overall magnitude of the BOLD response (e.g., Ress, Backus & Heeger, 2000). Thus, we would primarily expect effects of task difficulty on gain rather than width. How task difficulty influences the BOLD response and whether this would, in fact, interact with the width of the attentional field is an important topic, and we hope that future work will investigate this relationship more directly.  

      To clarify these points and highlight the potential for future work looking at these important interactions, we added the following text to the discussion section (discussion lines 57-67):

      “The observed effects of attentional field width were unlikely to be directly attributable to variation in task difficulty. Participants' task in our study was to discriminate whether more numbers or more letters were presented within a cued region of an iso-eccentric annulus of white noise. For our different cue widths, the ratios of numbers and letters were selected to be as similar as possible given the size and spacing of our stimuli. Changes in accuracy across the three larger cue widths were small and non-monotonic, implying task difficulty was dissociable from width per se. This dissociation bolsters the interpretability of our model fits; nevertheless, future work should further investigate how task difficulty interacts with the spread of the attentional field and the amplitude of attention-related BOLD effects (cf. Ress, Backus and Heeger, 2000).”

      (2) Some aspects of analysis/data sorting are unclear (e.g., how are voxels selected for analyses?) 

      We apologize for not describing our voxel selection in sufficient detail. Some of the questions raised in the private comments are closely related to this point, we therefore aim to clarify all concerns below:

      - Voxel selection: To select voxels that contribute to the 1D spatial profiles, we relied on the independent pRF dataset. We first defined some general requirements that needed to be met. Specifically, 1) the goodness of fit (R<sup>2</sup>) of the pRF fits needed to be greater than 10%; 2) the estimated eccentricity had to fall within [0.7 9.1] degree eccentricity (to exclude voxels in the fovea and voxels with estimated eccentricities larger than the pRF mapping stimulus); 3) the estimated size must be greater than 0.01 degree visual angle. 

      Next, we included only voxels whose pRF overlapped with the white noise annulus. Estimated eccentricity was used to select all voxels whose eccentricity estimate fell within the annulus bounds. However, here it is also important to take the size of the pRF into account. Some voxels’ estimated eccentricity might fall just outside the annulus, but will still have substantial overlap due to the size of their pRF. Therefore, we further included all voxels whose estimated pRF size resulted in overlap with the annulus. 

      This implies that some voxels with greater eccentricities and larger pRF sizes contribute to the 1D profile, which will influence the spatial specificity of the 1D profiles. However, we want to emphasize that in our view, the exact FWHM value is not so much of interest, as this will always be dependent on the voxel selection and many other data processing steps. Instead, we focus on the relative differences of the FWHM driven by the parametric attentional cue width manipulation. 

      - Data sorting and binning. The reviewer raises an important point about how the FWHM value should be interpreted considering the data processing steps. To generate the 1D spatial profile, we binned voxels based on their estimated polar angle preference into 6degree bins and applied a moving average of 18 degrees to smooth the 1D profiles. Both of these processing steps will influence the spatial specificity of the profile. The binning step facilitates recentering based on cue center and combining across trials.

      To explore the extent to which the moving average substantially impacted our results, we reran our analyses without that smoothing step. The vast majority of the results held. In V1, we found a significant effect of cue width on FWHM where the result was not significant previously (t(7)=2.52, p\=.040). Additionally, when looking at the minimum number of TRs needed to see a significant effect of cue width on FWHM, without the smoothing step in V1 it took 10 TRs (not significant at 10 TRs previously), in V2 it took 5 TRs (10 previously), and in V3 it took 3 TRs (2 previously). The other notable difference is that FWHM was generally a bit larger when the moving average smoothing was performed. We have visualized the group results for the FWHM estimates below to help with comparison. 

      Author response image 1.

      No moving average smoothing:

      Voxel selection methods have been clarified in methods section lines 132-139:

      “Within each ROI, pRF modeling results were used to constrain voxel selection used in the main experiment. We excluded voxels with a preferred eccentricity outside the bounds of the pRF stimulus (<0.7° and >9.1°), with a pRF size smaller than 0.01°, or with poor spatial selectivity as indicated by the pRF model fit (R2 < 10%). Following our 2D visualizations (see below), we further constrained voxel selection by only including voxels whose pRF overlapped with the white noise annulus. We included all voxels with an estimated eccentricity within the annulus bounds, as well as voxels with an estimated pRF size that would overlap the annulus.”

      Data binning methods have been clarified in methods section lines 154-159: 

      “Voxels with pRFs overlapping the white noise annulus were grouped into 60 bins according to their pRF polar angle estimate (6° polar angle bin width). We computed a median BOLD response within each bin. This facilitated the recentering of each profile to align all cue centers for subsequent combining across trials. To improve the signal-to-noise ratio, the resulting profile was smoothed with a moving average filter (width 18° polar angle; see Figure 2b).”

      (3) While the focus of this report is on modulations of visual cortex responses due to attention, the lack of inclusion of results from other retinotopic areas (e.g. V3AB, hV4, IPS regions like IPS0/1) is a weakness 

      We agree with the reviewer that using this approach in other retinotopic areas would be of significant interest. In this case, population receptive field mapping occurred in a separate session with a field of view only covering the occipital cortex (in contrast to the experimental session, which had whole-brain coverage). Because our modeling approach relies on these pRF estimates, we were unable to explore higher visual areas. However, we hope future work will follow up on this.

      We have added the following text to the methods section describing the pRF mapping session (lines 87-89):

      “In this session, the field of view was restricted to the occipital cortex to maximize SNR, thereby limiting the brain regions for which we had pRF estimates to V1, V2, and V3.”

      (4) Additional analyses comparing model fits across amounts of data analyzed suggest the model fitting procedure is biased, with some parameters (e.g., FWHM, error, gain) scaling with noise. 

      In this analysis, we sought to test how much data was needed to recover the attentional field, in view of the need for additional fMRI-based tools for use in tasks that involve more rapid dynamic adaptation of attention. Though we did find that more data reduced noise (and accordingly decreased absolute error and amplitude while increasing FWHM and R<sup>2</sup>), absolute angular error remained low across different temporal intervals (well below the chance level of 90°). With regard to FWHM, we believe that the more important finding is that the model-estimated FWHM was modulated by cue width at shorter timescales of as few as two TRs while maintaining relatively low angular error. We refrain from drawing conclusions here on the basis of the exact FWHM values, both because we don’t have a ground truth for the attentional field and because various processing pipeline steps can impact the values as well. Rather, we are looking at relative value and overall patterns in the estimates. The observed patterns imply that the model recovers meaningful modulation of the attentional field even at shorter time scales.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Additional data reporting and discussion of results are needed as outlined in the public review. 

      Reviewer #2 (Recommendations for the authors):

      (1) The current experimental design effectively captured the impact of varying cue widths on the BOLD response in the visual cortex. However, the inclusion of a neutral cue condition, where the cue width spans 360{degree sign} and all peripheral stimuli are attended, could serve as a valuable baseline. This would enable a quantitative assessment of how much the BOLD response is enhanced in specific spatial regions due to focused cues and, conversely, how much it is diminished in non-attended areas, along with the spatial extent of these effects. 

      Please refer to our response in the public review. 

      (2) While the study provides valuable insights into BOLD signal changes in visual areas corresponding to the focus of attention, it does not extend its analysis to the impact on regions outside the focus of attention. It would be beneficial to explore whether there is a corresponding decrease in BOLD signal in non-attended regions, and if identified, to describe the spatial extent and position of this effect relative to the attended area. Such an analysis could yield deeper insights into how attention influences activity across the visual cortex. 

      We agree with the reviewer that it is very interesting to examine the spread of attention across the whole visual field. Our experiment was designed to focus on width modulations at a fixed eccentricity, but future work should explore how the attentional field changes with eccentricity and interacts with spatial variations across the visual field. This is highlighted in our discussion section (lines 76-81): 

      “Future work can help provide a better understanding of the contribution of spatial attention by considering how the attentional field interacts with these well described spatial variations across the visual field. Measuring the full spatial distribution of the attentional field (across both eccentricity and polar angle) will shed light on how spatial attention guides perception by interacting with the non-uniformity of spatial representations.”

      The addition of figure panels for the estimated baseline parameter in Figures 4-7 provides further information about BOLD effects in unattended regions of the annulus.  

      (3) The rationale behind the selection of task difficulty ratios for different cue widths, specifically the letter-to-number ratios of 1:0, 1:2, 2:3, and 3:6 (or vice versa) for cue widths of 18{degree sign}, 54{degree sign}, 90{degree sign}, and 162{degree sign} respectively, was not explicitly discussed. It would be beneficial to clarify the basis for these ratios, as they may influence the perceived difficulty of the task and thus the comparability of behavioral performance across different cue widths. Ensuring that the task difficulty is consistent across conditions is crucial for attributing differences in behavior and BOLD signals solely to changes in cue width and not confounded by variations in task difficulty. 

      Please refer to our response in the public review. We now clarify why we selected these ratios, and acknowledge more explicitly that behavioral performance differed across width conditions. See also our reply to private comment 1 from Reviewer 3 for some additional analyses examining task related influences.

      Reviewer #3 (Recommendations for the authors):

      (1) Task difficulty: the task seems exceptionally challenging. Stimuli are presented at a relativelyeccentric position for a very brief duration, and a large number of comparisons must be made across a broad region of space. This is reflected in the behavioral performance, which decreases rapidly as the scope of attention increases (Fig. 1). Because trials are blocked, does this change in task difficulty across conditions impact the degree to which neural responses are modulated? How should we consider differences in task difficulty in interpreting the conclusions (especially with respect to the amplitude parameter)? Also, note that the difficulty scales both with number of stimuli - as more need to be compared - but also with the ratio, which differs nonmonotonically across task conditions. One way to dissociate these might be RT: for 54/162, which both employ the same ratio of letter/digits and have similar accuracy, is RT longer for 162, which requires attending more stimuli? 

      In addition to our comments in response to the public review, we emphasize that the reviewer makes an important point that there are differences in task difficulty, though the ratios are as close as they can be given the size and spacing of our stimuli. Behavioral performance varied non-monotonically with cue width, bolstering our confidence that our monotonically increasing model-estimated width is likely not entirely driven by task difficulty. There nevertheless remain open questions related to how task difficulty does impact BOLD attentional modulation, which we hope future work will more directly investigate.

      The reviewer's comments identify two ways our data might preliminarily speak to questions about BOLD attentional modulation and task difficulty. First: how might the amplitude parameter reflect task difficulty? This is an apt question as we agree with the reviewer that it would be a likely candidate in which to observe effects of task difficulty. We do find a small effect of cue width on our amplitude estimates (amplitude decreases with width) in V3. Using the same analysis technique to look at the relationship between task difficulty and amplitude, we find no clear relationship in any of the visual areas (all p >= 0.165, testing whether the slopes differed from zero at the group level using a one-sample t-test). We believe future work using other experimental manipulations should look more systematically at the relationship between task difficulty and amplitude of the attentional BOLD enhancement.

      Second: Does the same ratio at different widths elicit different behavioral responses (namely accuracy and RT)? We followed the reviewer’s suggestion to compare performance between cue widths of three and nine (identical ratios, different widths; see Author response image 2 and Figure 5). We found that, using a paired t-test, behavioral accuracy differed between the two cue widths (mean accuracy of 0.73 versus 0.69, p = 0.008), with better performance for cue width three. RT did not differ significantly between the two conditions (paired t-test, p = 0.729). This could be due to the fact that participants were not incentivized to respond as quickly as possible, they merely needed to respond before the end of the response window (1.25 s) following the stimulus presentation (0.5 s). The comparisons for accuracy and RT (calculated from time of stimulus appearance) are plotted below:

      Author response image 2.

      In summary, with matched stimulus ratios, the wider cue was associated with worse (though not slower) performance. This could be due to the fact that more elements are involved and/or that tasks become more difficult when attending to a broader swath of space. Given these results, we believe that future studies targeting difficulty effects should use direct and independent manipulations of task difficulty and attentional width. 

      (2) Eye movements: while the authors do a good job addressing the average eccentricity of fixation, I'm not sure this fully addresses concerns with eye movements, especially for the character-discrimination task which surely benefits from foveation (and requires a great deal of control to minimize saccades!). Can the authors additionally provide data on, e.g., # of fixations within the attended stimulus annulus, or fixation heatmap, or # of saccades, or some other indicator of likelihood of fixating the letter stimuli for each condition? 

      We agree with the reviewer that this task is surely much easier if one foveated the stimuli, and it did indeed require control to minimize saccades to the annulus. (We appreciate the effort and motivation of our participants!) We are happy to provide additional data to address these reasonable concerns about eye movements. Below, we have visualized the number of fixations to the annulus, separated by participant and width. Though there is variability across participants, there are at most 16 instances of fixations to the annulus for a given participant, combined across all width conditions. The median number of fixations to the annulus per width is zero (shown in red). Considering the amount of time participants engaged in the task (between 8 and 12 runs of the task, each run with 100 trials), this indicates participants were generally successful at maintaining central fixation while the stimuli were presented.

      Author response image 3.

      We added the results of this analysis to the methods section (lines 205-208):

      “Additionally, we examined the number of fixations to the white noise annulus itself. No participant had more than 16 fixations (out of 800-1200 trials) to the annulus during the task, further suggesting that participants successfully maintained fixation.”

      (3) pRF sorting and smoothing: Throughout, the authors are analyzing data binned based on pRF properties with respect to the attended location ("voxels with pRFs overlapping with the white noise annulus", line 243-244) First, what does this mean? Does the pRF center need to be within the annulus? Or is there a threshold based on the pRF size? If so, how is this implemented? Additionally, considering the methods text in lines 242-247, the authors mention that they bin across 6 deg-wide bins and smooth with a moving average (18 deg), which I think will lead to further expansion of the profile of attentional enhancement (see also below) 

      We provide a detailed response in the public review. Furthermore, we have clarified the voxel selection procedure in the Methods (lines 132–139 & 154–159).

      (4) FWHM values: The authors interpret the larger FWHMs estimated from their model-fitting than the actual size of the attended region as a meaningful result. However, depending on details of the sorting procedure above, this may just be due to the data processing itself. One way to identify how much expansion of FWHM occurs due to analysis is by simulating data given estimates of pRF properties for a 'known' shape of modulation (e.g., square wave exactly spanning the attended aperture) and compare the resulting FWHM to that observed for attention and perception conditions (e.g., Fig. 7c). 

      We provide a detailed response in the public review. The essence of our response is to refrain from interpreting the precise recovered FWHM values, which will be influenced by multiple processing steps, and instead to focus on relative differences as a function of the attentional cue width. Accordingly, we did not add simulations to the revised manuscript, although we agree with the reviewer that such simulations could shed light on the underlying spatial resolution, and how binning and smoothing influences the estimated FWHM. We have clarified our interpretation of FWHM results in the manuscript as follows:

      Results lines 137-141:

      “One possibility is that the BOLD-derived FWHM might tend to overestimate the retinotopic extent of the modulation, perhaps driven by binning and smoothing processing steps to create the 1D spatial profiles. If this were the case, we would expect to obtain similar FWHM estimates when modeling the perceptual modulations as well.”

      Results lines 169-175:

      “Mirroring the results from the attentional manipulation, FWHM estimates systematically exceeded the nominal size of the perceptually modulated region of the visual field. Comparing the estimated FWHMs of the perceptual and attentional spatial profiles (Figure 7c) revealed that the estimated widths were highly comparable (Pearson correlation r=0.664 across width conditions and visual regions). Importantly, the relative differences in FWHM show meaningful effects of both cue and contrast width in a similar manner for both attentional and perceptual forms of modulation.”

      Discussion lines 16-22:

      “We also found that the estimated spatial spread of the attentional modulation (as indicated by the recovered FWHM) was consistently wider than the cued region itself. We therefore compared the spread of the attention field with the spatial profile of a perceptually induced width manipulation. The results were comparable in both the attentional and perceptual versions of the task, suggesting that cueing attention to a region results in a similar 1D spatial profile to when the stimulus contrast is simply increased in that region.”

      (5) Baseline parameter: looking at the 'raw' response profiles shown in Fig. 2b, it looks, at first, like the wider attentional window shows substantially lower enhancement. However, this seems to be mitigated by the shift of the curve downwards. Can the authors analyze the baseline parameter in a similar manner as their amplitude analyses throughout? This is especially interesting in contrast to the perception results (Fig. 7), for which the baseline does not seem to scale in a similar way. 

      We agree with the reviewer that the baseline parameter is worth examining, and have therefore added panels displaying the baseline parameter into all results figures (Figures 4-7). There was no significant association between cue width and baseline offset in any of the three visual regions.

      (6) Outlier: Fig. 5, V2, Amplitude result seems to have a substantial outlier - is there any notable difference in e.g. retinotopy in this participant? 

      One participant indeed has a notably larger median amplitude estimate in V2. Below, we plot the spatial coverage from the pRF data for this participant (022), as well as all other participants.

      Author response image 4.

      Each subplot represents a participant's 2D histogram of included voxels for the 1D spatial profiles; the colors indicate the proportion of voxels that fell within a specific x,y coordinate bin. Note that this visualization only shows x and y estimates and does not take into account size of the pRF. While there is variation across participants in the visual field coverage, the overall similarity of the maps indicates that retinotopy is unlikely to be the explanation. 

      To further explore whether this participant might be an outlier, we additionally looked at behavioral performance, angular error and FWHM parameters as well as the goodness of fit of the model. On all these criteria this participant did not appear to be an outlier. We therefore see no reason to exclude this participant from the analyses.  

      (7) Fig. 4 vs Fig. 5: I understand that Fig. 4 shows results from a single participant, showing variability across blocks, while Fig. 5 shows aggregate results across participants. However, the Angular Error figure shows complementary results - Fig. 4 shows the variability of best-fit angular error, while Fig. 5 shows the average deviation (approximately the width of the error distribution). This makes sense I think, but perhaps the abs(error) for the single participant shown in Fig. 4 should be included in the caption so we can easily compare between figures. 

      That's right: the Figure 4 results show the signed error, whereas the Figure 5 results show the absolute error. We agree that reporting the absolute error values for the example participant would facilitate comparison. Rather than add the values to the text, we have made the example participant’s data visually distinct within Figure 5 for easy comparison.  

      (8) Bias in model fits: the analysis shown in Fig. 6 compares the estimated parameters across amounts of data used to compute attentional modulation profiles for fitting those parameters. If the model-fitting procedure were unbiased, my sense is we would likely see no impact of the number of TRs on the parameters (R^2 should improve, abs(error) should improve, but FWHM, amplitude, baseline, etc should be approximately stable, if noisier). However, instead, it looks like more/less data leads to biased estimates, such that FWHM is biased to be smaller with more noise, and amplitude is biased to be larger. This suggests (to me) that the fit is landing on a spiky function that captures a noise wiggle in the profile. I don't think this is a problem for the primary results across the whole block of 10 TRs, which is the main point of the paper. Indeed, I'm not sure what this figure is really adding, since the single-TR result isn't pursued further (see below). 

      Please refer to our response in the public review, comment 4. 

      (9) 'Dynamics': The paper, starting in the title, claims to get at the 'dynamics' of attention fields. At least to me, that word implies something that changes over time (rather than across trials). Maybe I'm misinterpreting the intent of the authors, but at present, I'm not sure the use of the word is justified. That said, if the authors could analyze the temporal evolution of the attention field through each block of trials at 1- or 2-TR resolution, I think that could be a neat addition to the paper and would support the claim that the study assays dynamic attention fields. 

      We thank the reviewer for giving us a chance to speak more directly to the dynamic aspect of our approach. Here, we specifically use the word “dynamic” to refer to trial-to-trial dynamics.  Importantly, our temporal interval analysis suggests that we can recover information about the attentional field at a relatively fine-grained temporal resolution (a few seconds, or 2 TRs). Following this methodological proof-of-concept to dynamically track the attentional field, we are excited about future work that can more directly investigate the manner in which the attentional field evolves through time, especially in comparison to other methods that first require training on large amounts of data.

      (10) Correction for multiple comparisons across ROIs: it seems that it may be necessary to correct statistical tests for multiple comparisons across each ROI (e.g., Fig. 5 regression tests). If this isn't necessary, the authors should include some justification. I'm not sure this changes any conclusions, but is worth considering. 

      We appreciate the opportunity to explain our reasoning regarding multiple comparisons. We thought it appropriate not to correct as we are not comparing across regions and are not treating tests of V1, V2, and V3 as multiple opportunities to support a common hypothesis. Rather, the presence or absence of an effect in each visual region is a separate question. We would typically perform correction for multiple comparisons to control the familywise error rate when conducting a family of tests addressing a common hypothesis. We have added this to the Methods section (lines 192-195): 

      “No multiple comparison correction was applied, as the different tests for each region are treated as separate questions. However, using a threshold of 0.017 for p-values would correct for comparisons across the three brain regions.”

      However, we are happy to provide corrected results. If we use Bonferroni correction across ROIs (i.e. multiply p-values by three), there are some small changes from significant to only trending towards significance, but these changes don’t affect any core results. The changes that go from significant to trending are:

      Associated with Figure 5 – In V3, the relationship of cue width to amplitude goes from a p-value of 0.017 to 0.051.

      Associated with Figure 6 –

      V1: the effect of cue width on FWHM goes from p = 0.043 to 0.128.

      V2: the effect of TR on both FWHM and R2 goes from p = ~0.02 to ~0.06. 

      V3: the effect of cue width on amplitude goes from p = 0.024 to 0.073.

    1. Author Response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Expressed concern that FOOOF may not be sensitive to peaks located at the edges of the spectrum and suggested using rhythmicity as an alternative measure of oscillatory activity.

      To address this concern, we first conducted a simulation in which we generated power spectra with a single periodic component while varying its parameters. The results confirmed that FOOOF may indeed have reduced sensitivity to low-frequency periodic components. In such cases, periodic activity can be conflated with aperiodic activity, leading to inflated estimates of the aperiodic component. These simulation results are presented in detail at the end of the Supplement.

      To further investigate whether the low-frequency activity in our datasets may be oscillatory, we employed the phase-autocorrelation function (pACF), a measure of rhythmicity developed by Myrov et al. (2024). We compared pACF and FOOOF-derived parameters using linear mixed models at each channel–frequency– time point (see Methods for details). Our analyses showed that pACF activity closely resembles periodic activity across all three datasets, and is dissimilar to aperiodic parameters (see Figures 5, S4, S5, S21, S22, S34, S35). This supports the interpretation that, in our data, aperiodic activity is not conflated with periodic activity.

      I was concerned that “there were no dedicated analyses in the paper to show that the aperiodic changes account for the theta changes.”

      To address this concern, we used linear mixed models to estimate the association between FOOOF parameters and baseline-corrected time-frequency activity. These models were fitted at each channel-frequency-time point. Our results indicate that aperiodic activity is correlated with low-frequency (theta) baseline-corrected activity, while periodic activity is correlated primarily with activity in the alpha/beta range, but not with theta (see Figures 4, S3, S20, S33). Additionally, the exponent parameter exhibited a negative correlation in the gamma frequency range.

      These findings support the reviewer's hypothesis: “I would also like to note that if the theta effect is only the aperiodic shift in disguise, we should see a concomitant increase in delta activity too – maybe even a decrease at high frequencies.” Overall, the results are consistent with our interpretation that low-frequency baseline-corrected activity reflects changes in aperiodic, rather than periodic, activity.

      “On page 7 it is noted that baseline correction might subtract a significant amount of ongoing periodic activity. I would replace the word "subtract" with "remove" as not all baseline correction procedures are subtractive. Furthermore, while this sentence makes it sound like a problem, this is, to my mind, a feature, not a bug - baseline correction is meant to take away whatever is ongoing, be it oscillatory or not, and emphasise changes compared to that, in response to some event.”

      We thank the reviewer for this helpful clarification. We have revised the sentence accordingly to read: “Our results show that classical baseline correction can remove continuous oscillatory activity that is present both during baseline and after stimulus onset, because it treats all baseline signals as 'background' to be removed without distinguishing between transient and continuous oscillations. While this is consistent with the intended purpose of baseline correction---to highlight changes relative to ongoing activity---it may also lead to unintended consequences, such as misinterpreting aperiodic activity as an increase in poststimulus theta oscillations.”

      In addition, we have made several broader revisions throughout the manuscript to improve clarity and accuracy in response to the reviewer’s feedback:

      (1) We have softened our interpretation of changes in the theta range. We no longer claim that these effects are solely due to aperiodic activity; rather, we now state that our findings suggest a potential contribution of aperiodic activity to signals typically interpreted as theta oscillations.

      (2) We have revised our language to avoid suggesting a direct “interplay” between periodic and aperiodic components. Instead, we emphasize the concurrent presence of both components, using more precise and cautious formulations.

      (3) We have clarified our discussion of baseline normalization approaches, explicitly noting that our findings hold regardless of whether a subtractive or divisive baseline correction was applied.

      (4) Finally, we have restructured the introduction to improve readability and address points of potential confusion. Specifically, we have clarified the definition and role of 1/f activity, refined the discussion linking baseline correction to aperiodic activity, and improved transitions between key concepts.

      Reviewer suggested that “it might be good to show that the findings were not driven by the cognitive-complaint subgroup (although the internal replications suggest they were not).”

      We agree that it is important to demonstrate that our findings are not driven solely by the cognitive-complaint subgroup. While we did not include additional figures in the manuscript due to their limited relevance to the primary research question, we have attached figures that explicitly show the comparison between the clinical and control groups here in the response to reviewers. These figures include non-significant effects.

      Author response image 1.

      Results of the linear mixed model analysis of periodic activity for comparison between conditions, including non-significant effect (see also Figure 7 in the paper)

      Author response image 2.

      Results of the linear mixed model analysis of aperiodic exponent for comparison between conditions, including nonsignificant effects (see also Figure 9 in the paper)

      Author response image 3.

      Results of the linear mixed model analysis of aperiodic offset for comparison between conditions, including non-significant effects (see also Figure S11 in the paper)

      “Were lure trials discarded completely, or were they included in the non-target group?”

      Thank you for the question. As described in the Methods section (EEG data preprocessing), lure trials were discarded entirely from further analysis and were not included in the non-target group.

      “Also, just as a side note, while this time-resolved approach is definitely new, it is not novel to this paper, at least two other groups have tried similar approaches, e.g., Wilson, da Silva Castanheira, & Baillet, 2022; Ameen, Jacobs, et al., 2024.”

      Thank you for drawing our attention to these relevant studies. We have now cited both Wilson et al. (2022) and Ameen et al. (2024) in our manuscript. While these papers did indeed use time-resolved approaches, to our knowledge our study is the first to use such an approach within a task-based paradigm.

      noted that it was unclear how the periodic component was reconstructed: “I understand that a Gaussian was recreated based on these parameters, but were frequencies between and around the Gaussians just zeroed out? Or rather, given a value of 1, so that it would be 0 after taking its log10.”

      The periodic component was reconstructed by summing the Gaussians derived from the FOOOF model parameters. Since the Gaussians asymptotically approach, but never reach, zero, there were no explicit zeros between them. We have included this explanation in the manuscript.

      “If my understanding is correct, the periodic and aperiodic analyses were not run on the singletrial level, but on trial-averaged TF representations. Is that correct? In that case, there was only a single observation per participant for each within-subject cell at each TF point. This means that model (4) on p. 15 just simplifies to a repeated-measures ANOVA, does it not? As hinted at later in this section, the model was run at each time point for aperiodic analyses, and at each TF point for periodic analyses, resulting in a series of p-values or a map of p-values, respectively, is that correct?”

      We thank the reviewer for this careful reading and helpful interpretation. The reviewer is correct that analyses were conducted on trial-averaged time-frequency representations. Model presented in equation 7 (as referred to in the current version of the manuscript) is indeed conceptually similar to a repeated-measures ANOVA in that it tests within-subject effects across conditions. However, due to some missing data (i.e., excluded conditions within subjects), we employed linear mixed-effects models (LMER), which can handle unbalanced data without resorting to listwise deletion. This provides more flexibility and preserves statistical power.

      The reviewer is also correct that the models were run at each channel-time point for the aperiodic analyses, and at each channel-time-frequency point for the periodic analyses, resulting in a series or map of p-values, respectively.

      suggested marking the mean response time and contrasting scalp topographies of response-related ERPs with those of aperiodic components.

      We thank the reviewer for this helpful suggestion. In response, we have now marked the mean response time and associated confidence intervals on the relevant figures (Figures 8 and S8). Additionally, we have included a new figure (Figure S13) presenting both stimulus- and response-locked ERP scalp topographies for comparison with aperiodic activity.

      In the previous version of the manuscript, we assessed the relationship between ERPs and aperiodic parameters by computing correlations between their topographies at each time point. However, to maintain consistency with our other analyses and to provide a more fine-grained view, we revised this approach and now compute correlations at each channel–time point. This updated analysis is presented in Figure S14. The results confirm that the correlation between ERPs and aperiodic activity remains low, and we discuss these findings in the manuscript.

      Regardless of the low correlation, we have added the following statement to the manuscript to clarify our conceptual stance: “While contrasting response-related ERPs with aperiodic components can help address potential confounds, we believe that ERPs are not inherently separate from aperiodic or periodic activity. Instead, ERPs may reflect underlying changes in aperiodic and periodic activity. Therefore, different approaches to studying EEG activity should be seen as providing complementary rather than competing perspectives.”

      “On page 3, it is noted that distinct theta peaks were only observed in 2 participants. Was this through visual inspection?”

      Yes, this observation was based on visual inspection of the individual power spectra. We have included this explanation in the text.

      suggested improving the plots by reducing the number of conditions (e.g., averaging across conditions), increasing the size of the colorbars, and using different color scales for different frequency bands, given their differing value ranges. Additionally, the reviewer noted that the theta and alpha results appeared surprising and lacked their expected topographical patterns, possibly due to the color scale.

      We appreciate these thoughtful suggestions and have implemented all of them to improve the clarity and interpretability of the figures. Specifically, we reduced the number of conditions by averaging across them where appropriate, enlarged the colorbars for better readability, and applied separate color scales for different frequency bands to account for variability in dynamic range.

      In the process, we also identified and corrected an error in the code that had affected the topographies of periodic activity in the previous version of the manuscript. With this correction, the resulting topographical patterns are now more consistent with canonical findings and are easier to interpret. For example, activity in the beta range now shows a clear central distribution (see Figure 6B and Figure S5B), and frontal activity in the theta range is more apparent.

      This correction also directly addresses the reviewer’s concern that the “theta and alpha results (where visible) look surprising – the characteristic mid-frontal and posterior topographies, respectively, are not really present.” These unexpected patterns were primarily due to the aforementioned error.

      “Relatedly, why is the mu parameter used here for correlations? Why not simply the RT mean/median, or one of the other ex-Gaussian parameters? Was this an a priori decision?”

      We appreciate the reviewer's thoughtful question. While mean and median RTs are indeed commonly used as summary measures, we chose the mu parameter because it provides a more principled estimate of central tendency that explicitly accounts for the positive skew typically observed in RT distributions. Although we did not directly compare mu, mean and median in this dataset, our experience with similar datasets suggests that differences between them are typically small. We chose not to include other ex-Gaussian parameters (e.g., sigma, tau) to avoid unnecessary model complexity and potential overfitting, especially since our primary interest was not in modelling the full distribution of response variability. This decision was made a priori, although we note that the study was not pre-registered. We have now added a clarification in the manuscript to reflect this rationale.

      “Relatedly, were (some) analyses of the study preregistered?”

      The analyses were not preregistered. Our initial aim was to investigate differences in phaseamplitude coupling (PAC) between the clinical and control groups. However, we did not observe clear PAC in either group—an outcome consistent with recent concerns about the validity of PAC measures in scalp EEG data (see: https://doi.org/10.3390/a16120540). This unexpected finding prompted us to shift our focus toward examining the presence of theta activity and assessing its periodicity.

      The reviewer suggested examining whether there might be differences between trials preceded by a target versus trials preceded by a non-target, potentially reflecting a CNV-like mechanism.

      We appreciate the reviewer’s insightful suggestion. The idea of investigating differences between trials preceded by a target versus a non-target, possibly reflecting a CNV-like mechanism, is indeed compelling. However, this question falls outside the scope of the current study and was not addressed in our analyses. We agree that this represents an interesting direction for future research.

      Reviewer #2 (Public review):

      “For the spectral parameterization, it is recommended to report goodness-of-fit measures, to demonstrate that the models are well fit and the resulting parameters can be interpreted.”

      We thank the reviewer for this suggestion. We have added reports of goodness-of-fit measures in the supplementary material (Fig. S9, S25, S41). However, we would like to note that our simulation results suggest that high goodness-of-fit values are not always indicative of accurate parameter estimation. For example, in our simulations, the R² values remained high even when the periodic component was not detectable or when it was conflated with the aperiodic component (e.g., compare Fig. S48 with Fig. S47). We now mention this limitation in the revised manuscript to clarify the interpretation of the goodness-of-fit metrics.

      “Relatedly, it is typically recommended to set a maximum number of peaks for spectral parameterization (based on the expected number in the analyzed frequency range). Without doing so, the algorithm can potentially overfit an excessive number of peaks. What is the average number of peaks fit in the parameterized spectra? Does anything change significantly in setting a maximum number of peaks? This is worth evaluating and reporting.”

      We report the average number of peaks, which was 1.9—2 (Figure S10). The results were virtually identical when setting number of peaks to 3.

      “In the main text, I think the analyses of 'periodic power' (e.g. section ‘Periodic activity...’ and Figures 4 & 5 could be a little clearer / more explicit on the measure being analyzed. ‘Periodic’ power could in theory refer to the total power across different frequency bands, the parameterized peaks in the spectral models, the aperiodic-removed power across frequencies, etc. Based on the methods, I believe it is either the aperiodic power or an estimate of the total power in the periodic-only model fit. The methods should be clearer on this point, and the results should specify the measure being used.”

      We thank the reviewer for highlighting this point. In our analyses, “periodic power” (or “periodic activity”) refers specifically to the periodic-only model fit. We have added clarifications under Figure 3 and in the Methods section to make this explicit in the revised manuscript.

      “The aperiodic component was further separated into the slope (exponent) and offset components". These two parameters describe the aperiodic component but are not a further decomposition per se - could be rephrased.”

      We thank the reviewer for alerting us to this potential misunderstanding. We have now rephrased the sentence to read: “The aperiodic component was characterised by the aperiodic slope (the negative counterpart of the exponent parameter) and the offset, which together describe the underlying broadband spectral shape.”

      “In the figures (e.g. Figure 5), the channel positions do not appear to be aligned with the head layout (for example - there are channels that extend out in front of the eyes).”

      Corrected.

      “Page 2: aperiodic activity 'can be described by a linear slope when plotted in semi-logarithmic space'. This is incorrect. A 1/f distributed power spectrum has a linear slope in log-log space, not semi-log.”

      Corrected.

      Page 7: "Our results clearly indicate that the classical baseline correction can subtract a significant amount of continuous periodic activity". I am unclear on what this means - it could be rephrased.

      We thank the reviewer to pointing out that the statement is not clear. We have now rephrased is to read: “Our results show that classical baseline correction can remove continuous oscillatory activity that is present both during baseline and after stimulus onset, because it treats all baseline signals as 'background' to be removed without distinguishing between transient and continuous oscillations.”

      ”Page 14: 'the FOOOF algorithm estimates the frequency spectrum in a semi-log space'. This is not quite correct - the algorithm parameterizes the spectrum in semi-log but does not itself estimate the spectrum.”

      Again, we thank the reviewer for alerting us to imprecise description. We have now changed the sentence to: “The FOOOF algorithm parameterises the frequency spectrum in a semi-logarithmic space”.

      We have made refinements to improve clarity, consistency, and flow of the main text. First, we streamlined the introduction by removing redundancies and ensuring a more concise presentation of key concepts. We also clarified our use of terminology, consistently referring to the ‘aperiodic slope’ throughout the manuscript, except where methodological descriptions necessitate the term ‘exponent.’ Additionally, we revised the final section of the introduction to better integrate the discussion of generalisability, ensuring that the inclusion of additional datasets feels more seamlessly connected to the study’s main objectives rather than appearing as an addendum. Finally, we carefully reviewed the entire manuscript to enhance coherence, particularly ensuring that discussions of periodic and aperiodic activity remain precise and do not imply an assumed interplay between the two components. We believe these revisions align with the reviewer’s suggestions and improve the overall readability and logical structure of the manuscript.

      Reviewer #3 (Public review):

      Raised concerns regarding the task's effectiveness in evoking theta power and the ability of our spectral parameterization method (specparam) to adequately quantify background activity around theta bursts.

      We thank Reviewer #3 for their constructive feedback. To address the concerns regarding the task’s effectiveness in evoking theta power and the adequacy of our spectral parameterization method, we have added additional visualizations using a log-y axis ****(Figures S1, S19, S32). These figures demonstrate that, in baseline-corrected data, low-frequency activity during working memory tasks appears as both theta and delta activity. Additionally, we have marked the borders between frequency ranges with dotted lines to facilitate clearer visual differentiation between these bands. We believe these additions help clarify the results and address the reviewer’s concerns.

      The reviewer noted that “aperiodic activity seems specifically ~1–2 Hz.”

      In our data baseline-corrected low-frequency post-stimulus increase in EEG activity spans from approximately 3 to 7 Hz, with no prominent peak observed in the canonical theta band (4–7 Hz). While we did not analyze frequencies below 3 Hz, we agree with the reviewer that some of this activity could potentially fall within the delta range.

      Nonetheless, we would like to emphasize that similar patterns of activity have often been interpreted as theta in the literature,  even  in  the  absence  of a distinct spectral  peak (see: https://doi.org/10.1016/j.neulet.2012.03.076;    https://doi.org/10.1016/j.brainres.2006.12.076; https://doi.org/10.1111/psyp.12500; https://doi.org/10.1038/s42003-023-05448-z — particularly, see the interpretation of State 1 as a “theta prefrontal state”).

      To accommodate both interpretations, we have opted to use the more neutral term “low-frequency activity” where appropriate. However, we also clarify that such activity is frequently referred to as “theta” in prior studies, even in the absence of a clear oscillatory peak.

      “Figure 4 [now Figure 6]: there is no representation of periodic theta.”

      Yes, this is one of the main findings of our study - periodic theta is absent in the vast majority of participants. A similar finding was found in a recent preprint on a working memory task (https://doi.org/10.1101/2024.12.16.628786), which further supports our results.

      “Figure 5 [now Figure 7]: there is some theta here, but it isn't clear that this is different from baseline corrected status-quo activity.”

      This figure shows comparisons of periodic activity between conditions. Although there are differences between conditions in the theta band, this does not indicate the presence of theta oscillations. Instead, the differences between the conditions in the theta band are most likely due to alpha components extending into the theta band (see Figure S6). This is further supported by the large overlap of significant channels between theta and alpha in Figure 7.

      “Figure 8: On the item-recognition task, there appears to be a short-lived burst in the high delta / low theta band, for about 500 ms. This is a short phenomenon, and there is no evidence that specparam techniques can resolve such time-limited activity.”

      We thank the reviewer for their comment. As we noted in our preliminary response, specparam, in the form we used, does not incorporate temporal information; it can be applied to any power spectral density (PSD), regardless of how the PSD is derived. Therefore, the ability of specparam to resolve temporal activity depends on the time-frequency decomposition method used. In particular, the performance of specparam is limited by the underlying time-frequency decomposition method and the data available for it. In fact, Wilson et al. (2022, https://doi.org/10.7554/eLife.77348), who have developed an approach for timeresolved estimation of aperiodic parameters, actually compare two approaches that differ only in their underlying time-frequency estimation method, while the specparam algorithm is the same in both cases. For the time-frequency decomposition we used superlets (https://doi.org/10.1038/s41467-020-20539-9), which have been shown to resolve short bursts of activity more effectively than other methods. To our knowledge, superlets provide the highest resolution in both time and frequency compared to wavelets or STFT.

      To improve the stability of the estimates, we performed spectral parameterisation on trial-averaged power rather than on individual trials (unlike the approach in Wilson et al., 2022). In contrast, Gyurkovics et al. (2022) who also investigated task-related changes in aperiodic activity, estimated power spectra at the single-trial level, but stabilised their estimates by averaging over 1-second time windows; however, this approach reduced their temporal resolution. We have now clarified this point in the manuscript.

      “The authors note in the introduction that ‘We hypothesised that the aperiodic slope would be modulated by the processing demands of the n-back task, and that this modulation would vary according to differences in load and stimulus type.’. This type of parametric variation would be a compelling test of the hypothesis, but these analyses only included alpha and beta power (Main text & Figure 4)”

      We appreciate the reviewer's comment, but would like to clarify that the comparison between conditions was performed separately for both periodic power and aperiodic parameters. The periodic power analyses included all frequencies from 3 to 50 Hz (or 35 Hz in the case of the second dataset). All factors were included in the linear model (see LMM formula in equation 7 - subsection Methods / Comparisons between experimental conditions), but the figures only include fixed effects that were statistically significant. For example, Figure 7 shows the periodic activity and Figure 9 shows the exponent, with further details provided in other supplementary figures.

      “Figure 5 does show some plots with some theta activity, but it is unclear how this representation of periodic activity has anything to do with the major hypothesis that aperiodic slope accounts for taskevoked theta.” /…/ In particular, specparam is a multi-step model fitting procedure and it isn't impressively reliable even in ideal conditions (PMID: 38100367, 36094163, 39017780). To achieve the aim stated in the title, abstract, and discussion, the authors would have to first demonstrate the robustness of this technique applied to these data.

      We acknowledge these concerns and have taken several steps to clarify the relationship between the aperiodic slope and low-frequency activity, and to assess the robustness of the specparam (FOOOF) approach in our data.

      First, we directly compared baseline-corrected activity with periodic and aperiodic components in all three data sets. These analyses showed that low-frequency increases in baseline-corrected signals consistently tracked aperiodic parameters - in particular the aperiodic exponent - rather than periodic theta activity (see Figs 4, S3, S20, S33). Periodic components, on the other hand, were primarily associated with baseline corrected activity in the alpha and beta bands. The aperiodic exponent also showed negative correlations with high beta/gamma baseline-corrected activity, which is exactly what would be expected in the case of a shift in the aperiodic slope (rather than delta/theta oscillations). See also examples at https://doi.org/10.1038/s41593-020-00744-x (Figures 1c-iv) or https://doi.org/10.1111/ejn.15361 (Figures 3c,d).

      Next, because reviewer #1 was concerned that FOOOF might be insensitive to peaks at the edges of the spectrum, we ran a simulation that confirmed this concern. We then applied an alternative phase-based measure of oscillatory activity: the phase-autocorrelation function (pACF; Myrov et al., 2024). This method does not rely on spectral fitting and is sensitive to phase rather than amplitude. Across all datasets, pACF results were in close agreement with periodic estimates from FOOOF and were not correlated with aperiodic parameter estimates (Figs 5, S4, S5, S21, S22, S34, S35).

      Taken together, these complementary analyses suggest that the apparent low-frequency (delta, theta) activity observed in the baseline-corrected data is better explained by changes in the aperiodic slope than by true low-frequency oscillations. While we acknowledge the limitations of any single method, the convergence between the techniques increases our confidence in this interpretation.

      “How did the authors derive time-varying changes in aperiodic slope and exponent in Figure 6 [now Figure 8]?”

      We thank the reviewer for this question. As explained in the Methods section, we first performed a time-frequency decomposition, averaged across trials, and then applied a spectral decomposition to each time point.

      “While these methodological details may seem trivial and surmountable, even if successfully addressed the findings would have to be very strong in order to support the rather profound conclusions that the authors made from these analyses, which I consider unsupported at this time:

      (a) ‘In particular, the similarities observed in the modulation of theta-like activity attributed to aperiodic shifts provide a crucial validation of our conclusions regarding the nature of theta activity and the aperiodic component.’

      (b) ‘where traditional baseline subtraction can obscure significant neural dynamics by misrepresenting aperiodic activity as theta band oscillatory activity’

      (d) ‘our findings suggest that theta dynamics, as measured with scalp EEG, are predominantly a result of aperiodic shifts.’

      (e)  ‘a considerable proportion of the theta activity commonly observed in scalp EEG may actually be due to shifts in the aperiodic slope’.

      (f) ‘It is therefore essential to independently verify whether the observed theta activity is genuinely oscillatory or primarily aperiodic’

      [this would be great, but first we need to know that specparam is capable of reliably doing this].”

      We believe that our claims are now supported by the aforementioned analyses, namely associations between baseline-corrected time-frequency activity and FOOOF parameters and associations between FOOOF parameters and PACF.

      The reviewer found it unclear what low-frequency phase has to do with 1/f spectral changes: ‘Finally, our findings challenge the established methodologies and interpretations of EEG-measured crossfrequency coupling, particularly phase-amplitude coupling’

      We thank the reviewer for their comment. To address this concern, we have added further clarification in the Discussion section. Our results are particularly relevant for phase-amplitude coupling (PAC) based on theta, such as theta-gamma coupling. PAC relies on the assumption that there are distinct oscillations at both frequencies. However, if no clear oscillations are present at these frequencies— specifically, if theta oscillations are absent—then the computation of PAC becomes problematic.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Point-by-point reply in response to the Reviewer’s comments

      Reviewer #1

      Public review:

      [1] (a) Given that only a fraction of the FAPs express BDNF after injury, the authors need to demonstrate the specificity of the Prrx1-Cre for FAPs. This is particularly important because muscle stem cell also express GDNF receptors (Fig. 3C & D) and myogenic progenitors/satellite cells produce BDNF after nerve injury (Griesbeck et al., 1995 (PMID 8531223); Omura et al., 2005 (PMID 16221288)). (b) Moreover, as the authors point out, there are multipotent mesenchymal precursor cells in the nerve that migrate into the surrounding tissue following nerve injury and contribute to regeneration (Carr et al, PMID 30503141). Therefore, there are multiple possible sources of BDNF, highlighting the need to clearly demonstrate that FAP-derived BDNF is essential.

      - (a) As the Reviewer noted, both GDNF receptor expression and increased BDNF expression in response to nerve injury are detectable in both FAPs and muscle stem cells (MuSCs). Therefore, we agree with the Reviewer that demonstrating the specificity of Prrx1-Cre in FAPs is crucial to support our claim. In our previous publication (Kim et al., 2022), using Prrx1-Cre; Rosa-eYFP mice, we showed that while most of the CD31-CD45-Vcam1-Sca1+ FAPs are eYFP+, CD31-CD45-Vcam1+Sca1- MuSCs do not express eYFP (Liu et al., 2015; Kim et al., 2022) (Attached Figure 1). Additionally, genomic DNA PCR using mononuclear cells sorted from our Prrx1Cre; Bdnffl/fl mice showed that DNA recombination in the floxed Bdnf gene could only be detected in FAPs and CD31-CD45-Vcam1-Sca1- cells, but not in MuSCs (Author response image 2). This is consistent with a previous report that showed Prrx1-Cre activity in FAPs, pericytes, vascular smooth muscle cells (vSMCs) and tenocytes (Leinroth et al.,

      2022), where pericytes, vSMCs and tenocytes are included the CD31-CD45-Vcam1Sca1- population (Giordani et al., 2019). Together, these results demonstrate that while Prrx1-Cre is active in FAPs, it is absent in MuSCs.

      Author response image 1.

      Expression of eYFP in muscle-resident, lineage-negative, live mononuclear cells isolated from Prrx1Cre;RosaeYFP mice. Supplemental Figure 3A from Kim et al., 2022. Lin-: lineage-negative (CD31-CD45-); Neg.: Vcam1-Sca1-.

      Author response image 2.

      Recombination of the floxed Bdnf gene in the mononuclear cells sorted from muscles of Prrx1Cre; Bdnffl/fl or Bdnffl/fl mice. Genotypes and cell types sampled for each lane is specified. P4, P5, and P6 indicate primers used for each PCR. Lin+: lineage(CD31/CD45)-positive; DN: CD31-CD45-Vcam1-Sca1-.

      - (b) We appreciate and agree with the Reviewer’s comment that additional experiments are needed to confirm that FAP-derived BDNF is indeed essential for nerve regeneration, considering other potential cellular sources of BDNF, such as nerve-resident mesenchymal precursor cells. One possible experiment that could demonstrate the requirement of FAP-derived BDNF in nerve regeneration would be the transplantation of wild-type FAPs into our Prrx1Cre; Bdnf fl/fl mice and to see if the delay in nerve regeneration and remyelination is recovered, making the process similar to that in control mice. Unfortunately, since the genetic background of our Prrx1Cre; Bdnffl/fl mice is a mixture of B6, 129S4, and BALB/c, immune rejection of the transplanted cells may occur, which makes the experiment technically difficult. Another experimental approach could involve the use of FAP-specific Cre mouse line, as we have mentioned in the Discussion of our original manuscript. However, such a line does not yet exist due to the lack of a marker gene that is expressed specifically in FAPs, but not in nerve-resident mesenchymal precursor cells. Overcoming such technical challenges and demonstrating the requirement of FAP-derived BDNF in nerve regeneration would significantly strengthen our report, though we regret that these methods are currently unavailable.

      [2] Similarly, the authors should provide some evidence that BDNF protein is produced by FAPs. All of their data for BDNF expression is based on mRNA expression and that appears to only be increased in a small subset of FAPs. Perhaps an immunostaining could be done to demonstrate up-regulation of BDNF in FAPs after injury.

      - We appreciate the Reviewer’s constructive comment. To demonstrate that BDNF protein is produced by FAPs upon nerve injury, we performed western blot analysis. FAPs were isolated from either sciatic nerve crush injury-affected muscles at 7 days post injury (dpi) or from the contralateral, uninjured muscles, and protein samples were prepared for SDS-PAGE and western blot using anti-BDNF, anti-PDGFRα and antiGAPDH antibodies. As a result, while both nerve injury-affected and uninjured musclederived FAPs expressed PDGFRα, the mature from of BDNF protein was only detected in nerve injury-affected FAPs, showing that BDNF is indeed expressed in FAPs at the protein level after injury. We have added this new result as Figure 4F in the New Figure 4 with the experimental scheme as New Figure 4—figure supplement 1, and revised the Results section (lines 364-374) and the Materials and Methods section (lines 687-705) in our manuscript to include the new results in detail.

      [3] The suggestion that Schwann cell-derived GDNF is responsible for upregulation of BDNF in the FAPs is indirect, based largely on the data showing that injection of GDNF into the muscle is sufficient to up-regulate BDNF (Fig. 4F & G). However, to more directly connect the 2 observations in a causal way, the authors should inject a Ret/GDNF antagonist, such as a Ret-Fc construct, then measure the BDNF levels.

      - We appreciate the Reviewer’s constructive comment, and we agree that testing the necessity of GDNF/RET signaling in BDNF upregulation is crucial to link the expression of the two neurotrophic factors in a causal way. As a means to antagonize GDNF/RET signaling, we injected anti-GDNF antibodies into the tibialis anterior and gastrocnemius muscles following sciatic nerve crush injury to block the activity of intramuscular GDNF protein. As a result, although the differences were not statistically significant, we observed a tendancy towards decreased Bdnf mRNA expression upon anti-GDNF injection compared to IgG controls. We have added this new result as New Figure 4—figure supplement 2, and revised our manuscript to include the details in both the Results section (lines 381-390) and the Materials and Methods section (lines 611-616). We have also changed the title of New Figure 4 (line 332) to encompass the new results. We are aware that further experiments that may involve increasing the number of animals tested, increasing the antibody injection dosage or frequency, or implementation of genetic models such as Plp1CreER; Gdnffl/fl should be carried out to validate our hypothesis with statistical significance. Unfortunately, due to limited time, resources, and research funds, we were unable to perform such additional experiments. We hope that the Reviewer understands these limitations.

      [4] (a) In assessing the regeneration after nerve crush, the authors focus on remyelination, for example, assessing CMAP and g-ratios. However, they should also quantify axon regeneration, which can be done distal to the crush injury at earlier time points, before the 6 weeks scored in their study. Evaluating axon regeneration, which occurs prior to remyelination, would be especially useful because BDNF can act on both Schwann cells, to promote myelination, and axons, enhancing survival and growth. (b) They could also evaluate the stability of the neuromuscular junctions, particularly if a denervation was done with the conditional knock outs, although that may be a bit beyond the scope of this study.

      - (a) As the Reviewer mentioned, BDNF is known to act on both Schwann cells and axons, where it promotes myelination and axonal growth, respectively (Oudega and

      Hagg, 1998; Zhang et al., 2000; Chan et al., 2001; Xiao et al., 2009; English et al.,

      2013). We fully agree with the Reviewer’s comment that quantification of axon regeneration, which could be achieved through immunostaining of the distal part of the sciatic nerve at earlier time points after injury, would shed light on whether FAPderived BDNF can also contribute to axon regeneration in addition to remyelination. Unfortunately, we could not perform such additional experiments within the limited time frame, since preparing enough numbers of control and conditional knockout mice that match the age groups used in this study (3-4 months old), followed by waiting for additional 2-4 weeks after nerve crush injury for sample collection, and subsequent immunostaining for quantification could take almost 6 months in total. We hope that the Reviewer understands this limitation.

      - (b) We appreciate the Reviewer’s constructive comment. Although the number of animals used for neuromuscular junction (NMJ) analyses was not sufficient, we had briefly examined the structure of NMJs at 4 weeks post nerve crush injury in control (Ctrl) and conditional knockout (cKO) mice as a preliminary experiment. As a result, no significant differences were observed between Ctrl and cKO mice in terms of NMJ morphology and innervation (Author response image 3). 

      Author response image 3.

      Structures of neuromuscular junctions from Ctrl vs cKO mice at 4 weeks post nerve crush injury. Whole-mount immunostaining was done using the exterior digitorum longus muscles that were affected by sciatic nerve crush injury. Samples were stained with α-bungarotoxin (green), neurofilament (red), and synaptophysin (blue). Scale bar: 50 μm. 

      Going back to part (a) of this Reviewer’s comment, considering the data presented in Author response image 3, where innervation of axons into acetylcholine receptor clusters was not significantly different between Ctrl versus cKO mice, FAP-derived BDNF may not be critical for the axonal growth upon nerve injury. Although we acknowledge that additional experiments are required to draw a meaningful conclusion on this point, we could not perform such additional experiments due to insufficient time and resources.

      We hope that the Reviewer understands our limitation.

      Recommendations for the authors:

      [1] In citing the ability of BDNF to promote Schwann cell myelination the authors should include Chan et al., 2001 (PMID 11717413) in addition to the Zhang et al, 2000 and Xiao et al, 2009 references.

      - We apologize for missing out the reference mentioned by the Reviewer. We have added the suggested reference in our revised manuscript (lines 395, 425, and 517).

      Reviewer #2

      Public review:

      [1] Although, I find the data the authors generated enough for their claims. I do see them as relatively poor, and (a) a complementary analysis of protein expression would strengthen the paper through immunostaining of the different genes mentioned for FAPs and Schwann cells. The model is entirely supported by measuring mRNA levels and negative regulation of gene expression in specific cells. Additionally, (b) what happens to the structure of the neuromuscular junction after regeneration when GDNF or BDNF expression is reduced? (c) The determination of decreasing levels of FAPs BDNF mRNA during aging is interesting; is the gain of BDNF expression in FAPs reverting the phenotype?

      - (a) We appreciate and agree with the Reviewer’s comment that validation of BDNF protein expression in FAPs and GDNF protein expression in Schwann cells upon nerve injury would strengthen this paper. Regarding GDNF protein expression in Schwann cells upon nerve injury, it has already been demonstrated by previous studies (Höke et al., 2002; Xu et al., 2013). For BDNF protein expression in FAPs upon nerve injury, we performed western blot analysis for validation, as mentioned in the response to Reviewer #1 Public review [2]. The results showed that while the mature form of BDNF protein could not be readily detected in FAPs isolated from uninjured muscles, it could be detected in FAPs isolated from sciatic nerve crush injury-affected muscles at 7 days post injury. We have added the new result as Figure 4F in the New Figure 4 with the experimental scheme as New Figure 4—figure supplement 1, and revised the Results section (lines 364-374) and the Materials and Methods section (lines 687-705) in our manuscript to include the new results in detail.

      - (b) Though the data is preliminary, we examined the structures of neuromuscular junctions (NMJs) from control and Prrx1Cre; Bdnf fl/fl mice at 4 weeks post injury in the exterior digitorum longus muscles, as mentioned in the response to Reviewer #1 Publilc review [4](b). As a result, we could not identify significant differences between control versus Prrx1Cre; Bdnf fl/fl mice, where BDNF expression is reduced specifically in Prrx1-expressing cells, including FAPs (Attached Figure 3). Since other cellular sources of BDNF, such as Schwann cells, exist, regeneration of the NMJs may not have been as significantly affected as remyelination in our Prrx1Cre; Bdnf fl/fl mice. However, further experiments with a sufficient number of mice and more observation time points are required to statistically validate this hypothesis in detail. Unfortunately, preparing samples for such additional analyses would take more than four months, as we need to produce sufficient numbers of control and Prrx1Cre; Bdnf fl/fl mice that match the age groups used in this study. We hope that the Reviewer understands our limitation.

      Regarding analyzing NMJ structures after regeneration affected by reduced GDNF levels, using genetic models such as Plp1CreER; Gdnffl/fl mice would be appropriate, as we have used the Prrx1Cre; Bdnffl/fl mice in this study to reduce BDNF levels produced by FAPs. Unfortunately, we do not have the Gdnffl mice, and obtaining these mice to produce Plp1CreER; Gdnffl/fl mice and performing the additional experiment would take too much time for this current revision. In a further study, we will try to perform the additional experiment by obtaining the required mouse line. We hope that the Reviewer understands our limitation.

      - (c) We appreciate the Reviewer for highlighting this point. In this paper, we have shown that BDNF expression upon nerve injury is decreased in aged FAPs compared to young adult FAPs, and suggested that this may be one of the causes of the delayed nerve regeneration phenotype in aged mice. Previously, it has been reported that while intramuscular injection of BDNF accelerates nerve regeneration, intramuscular injection of anti-BDNF antibodies delays the regeneration process (Zheng et al., 2016). This implies that intramuscular levels of active BDNF can significantly influence the speed of nerve regeneration. Therefore, the gain of BDNF expression in aged FAPs may contribute to reversing the delayed nerve regeneration phenotype in aged mice, since it would result in additional supply of active, intramuscular BDNF, which has previously been shown to accelerate nerve regeneration. Though experimental validation is required to support such claim, we could not obtain sufficient numbers of aged mice within the limited time frame. We hope that the Reviewer understands our limitation.

      Recommendations for the authors:

      [1] The authors should include the experimental design and several drawings in the leading figures indicating, for example, how remyelination after injury was quantified and how the response of regenerated sciatic nerve to a depolarizing stimulus was studied.

      - We apologize for any confusion caused by insufficient information provided in the leading figures. Unfortunately, due to limited space, we could not add experimental designs or drawings in the leading figures. Instead, to do our best to comply with the

      Reviewer’s comment, we have revised the figure legends in the leading figures so that the experimental designs or diagrams can be referred to in the figure supplements.

      We hope that the Reviewer understands this limitation.

      Reviewer #3

      Public review:

      [1] In Fig. 1 and 2 authors provide data on scRNA seq and this is important information reporting the finding of RET and GFRa1 transcripts in the subpopulation of FAP cells. However, authors provide no data on the expression of RET and GFRa1 proteins in FAP cells.

      - Reply for this comment by the Reviewer is in the Recommendations for the authors section below ([2]), as the same comment is repeated.

      [2] Another problem is the lack of information showing that GDNF secreted by Schwann cells can activate RET and its down-stream signaling in FAP cells. There is no direct experimental proof that GDNF activating GFRa1-RET signaling triggers BDNF upregulation In FAP cells. The data that GDNF signaling is inducing the synthesis and secretion of BDNF is also not conclusive.

      - Reply for this comment by the Reviewer is in the Recommendations for the authors section below ([3]), as the same comment is repeated.

      Recommendations for the authors:

      [1] Although this is a novel study and contains very well-performed parts, the GDNF section is preliminary and requires additional experimentation. In the introduction authors describe well FAPs but even do not mention how GDNF is signaling. Moreover, the reader may get an impression that Ras-MAPK pathway is the only or at least the main GDNF signaling pathway. In fact, for neurons Akt and Src signaling pathways play also crucial role.

      - We apologize for the missing content in the Introduction section of our manuscript and for any confusion caused by our misleading description of the GDNF signaling pathway. We have revised our manuscript to include the GDNF signaling pathway in the Introduction section, along with a description of other downstream signaling pathways of GDNF that are known to play crucial roles, as mentioned by the Reviewer (lines 115-130). Additionally, we changed the expression in the Results section to avoid making any misleading impressions (lines 318-319).

      [2] In Fig. 1 and 2 authors provide data on scRNA seq and this is important information reporting the finding of RET and GFRa1 transcripts in the subpopulation of FAP cells. However, authors provide no data on the expression of RET and GFRa1 proteins in FAP cells.

      - We appreciate the Reviewer for the constructive comment. Though we fully agree with the Reviewer that validating the expression of RET and GFRα1 proteins in FAPs is needed, we were unable to obtain the antibodies required for such experiments within the limited time frame for this revision. We hope that the Reviewer understands our limitation. Although we could not directly show the expression of those GDNF receptor genes at the protein level in FAPs, based on the result where intramuscular GDNF injection could sufficiently induce Bdnf expression in FAPs compared to PBS control in the absence of nerve damage, it is likely that GDNF receptors are indeed expressed at the protein level in FAPs, since if otherwise, FAPs would not have been able to respond to the injected GDNF protein. Nevertheless, in a future study, we will try to validate the protein-level expression of GDNF receptors in FAPs to comply with the Reviewer’s suggestion and to further support this study.

      [3] Another problem is the lack of information showing that GDNF secreted by Schwann cells can activate RET and its down-stream signaling in FAP cells. Authors can monitor activation of MAPK pathway by detecting phospho-Erk and PI3 kinase-Akt pathway measuring phospho-S6 using immunohistochemistry. We can recommend to use the following antibodies: pErk1/2 (1:300, Cell Signaling, Cat# 4370L RRID:AB_2297462), pS6 (1:300, Cell Signaling, Cat# 4858L RRID:AB_1031194). These experiments are crucial because RET and GFRa1 proteins maybe not expressed at the sufficient level on the cell surface.

      - We sincerely appreciate the Reviewer’s constructive comment. In this study, we suggested that the GDNF-BDNF axis within FAPs would signal through the MAPK pathway based on the bioinformatic analysis of our single cell RNA-seq data and matching the results with the previously known pathways. We fully agree that monitoring the activation of the MAPK pathway and the PI3K-Akt pathway by immunohistochemistry would experimentally demostrate whether GDNF can activate those pathways within FAPs through GFRα1/RET activation. Unfortunately, we could not obtain the antibodies suggested by the Reviewer for this revision due to insufficient research funds and limited time frame. We hope that the Reviewer understands our limitation. In future studies, we will try to validate the detailed molecular pathway that mediates the GDNF-BDNF axis in FAPs by incorporating the methodology suggested by the Reviewer, along with implementation of genetic models such as Plp1CreER; Gdnffl/fl, Prrx1Cre; Retfl/fl or Prrx1Cre; Gfra1fl/fl to validate whether Schwann cell-derived

      GDNF can actually signal through its canonical receptor RET/GFRα1 expressed in FAPs to induce expression of BDNF upon nerve injury.

      [4] (a) There is no direct experimental proof that GDNF activating GFRa1-RET signaling triggers BDNF upregulation in FAP cells. Authors can use GDNF blocking antibodies, siRNA or use RET or GFRa1 cKO mice to delete them from FAP cells. (b) The data that GDNF signaling is inducing the synthesis and secretion of BDNF is also not conclusive. Authors should show that GDNF injection is increasing BDNF protein levels in FAPs. To get sufficient material for ELISA detection of BDNF is perhaps problematic. However, authors can use BDNF antibodies from Icosagen company and use IHC.

      - (a) We appreciate the Reviewer for the critical comment. As mentioned in the reply for Reviewer #1 Public review [3], we used GDNF blocking antibodies to reduce GDNF signaling within the tibialis anterior and gastrocnemius muscles by intramuscular injection after sciatic nerve crush injury, and included the result as a new figure supplement in our revised manuscript (New Figure 4—figure supplement 2) with its details in both the Results section (lines 381-390) and the Materials and Methods section (lines 611-616). Though the results were not statistically significant, intramuscular injection of anti-GDNF antibodies showed a tendency toward reduced Bdnf expression in FAPs, compared to IgG controls. As mentioned in the reply for Reviewer #1 Public review [3], and as suggested by the Reviewer, using cKO mice such as Plp1CreER; Gdnffl/fl, Prrx1Cre; Retfl/fl, or Prrx1Cre; Gfra1fl/fl mice would further validate the GDNF-BDNF axis suggested in this study, likely with statistical significance. Unfortunately, obtaining these genetic models within the limited time frame of this current revision is not feasible. We will try to adopt such models in our future study to validate the role of Schwann cell-derived GDNF in inducing BDNF expression in FAPs via activation of RET/GFRα1.  

      - (b) We appreciate the Reviewer for the constructive comment. Though we fully agree that the experiment suggested by the Reviewer would validate the synthesis and secretion of BDNF protein by GDNF signaling in FAPs, we were not able to perform it due to lack of research funds to obtain enough amount of the GDNF protein. We hope that the Reviewer understands our limitation. Still, combining the results from New Figure 4H in this study with the New Figure 4F, where GDNF injection induced Bdnf mRNA expression in FAPs, and BDNF protein expression in FAPs in response to nerve injury was demonstrated via western blot, we anticipate that GDNF injection would increase BDNF protein levels in FAPs, though direct validation of this statement would require conducting the additional experiments mentioned by the Reviewer.

      References

      Chan JR, Cosgaya JM, Wu YJ, and Shooter EM (2001). Neurotrophins are key mediators of the myelination program in the peripheral nervous system. Proceedings of the National Academy of Sciences 98:14661-14668.

      English AW, Liu K, Nicolini JM, Mulligan AM, and Ye K (2013). Small-molecule trkB agonists promote axon regeneration in cut peripheral nerves. Proc Natl Acad Sci U S A 110:16217-22.10.1073/pnas.1303646110

      Giordani L, He GJ, Negroni E, Sakai H, Law JY, Siu MM, Wan R, Corneau A, Tajbakhsh S, and Cheung TH (2019). High-dimensional single-cell cartography reveals novel skeletal muscle-resident cell populations. Molecular Cell 74:609-621. e6.

      Höke A, Gordon T, Zochodne D, and Sulaiman O (2002). A decline in glial cell-linederived neurotrophic factor expression is associated with impaired regeneration after long-term Schwann cell denervation. Experimental neurology 173:77-85.

      Kim J-H, Kang J-S, Yoo K, Jeong J, Park I, Park JH, Rhee J, Jeon S, Jo Y-W, and Hann S-H (2022). Bap1/SMN axis in Dpp4+ skeletal muscle mesenchymal cells regulates the neuromuscular system. JCI Insight 7:

      Leinroth AP, Mirando AJ, Rouse D, Kobayahsi Y, Tata PR, Rueckert HE, Liao Y, Long JT, Chakkalakal JV, and Hilton MJ (2022). Identification of distinct non-myogenic skeletal-muscle-resident mesenchymal cell populations. Cell Reports 39:

      Liu L, Cheung TH, Charville GW, and Rando TA (2015). Isolation of skeletal muscle stem cells by fluorescence-activated cell sorting. Nature protocols 10:1612-1624.

      Oudega M, and Hagg T (1998). Neurotrophins promote regeneration of sensory axons in the adult rat spinal cord. Brain Research 818:431-438.10.1016/S0006-8993(98)01314-6

      Xiao J, Wong AW, Willingham MM, Kaasinen SK, Hendry IA, Howitt J, Putz U, Barrett GL, Kilpatrick TJ, and Murray SS (2009). BDNF exerts contrasting effects on peripheral myelination of NGF-dependent and BDNF-dependent DRG neurons. J Neurosci 29:4016-22.10.1523/JNEUROSCI.3811-08.2009

      Xu P, Rosen KM, Hedstrom K, Rey O, Guha S, Hart C, and Corfas G (2013). Nerve injury induces glial cell linederived neurotrophic factor (gdnf) expression in schwann cells through purinergic signaling and the pkcpkd pathway. Glia 61:1029-1040.

      Zhang JY, Luo XG, Xian CJ, Liu ZH, and Zhou XF (2000). Endogenous BDNF is required for myelination and regeneration of injured sciatic nerve in rodents. European Journal of Neuroscience 12:4171-4180.10.1111/j.1460-9568.2000.01312.x

      Zheng J, Sun J, Lu X, Zhao P, Li K, and Li L (2016). BDNF promotes the axonal regrowth after sciatic nerve crush through intrinsic neuronal capability upregulation and distal portion protection. Neuroscience letters 621:1-8.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This manuscript reports valuable findings on the role of the Srs2 protein in turning off the DNA damage signaling response initiated by Mec1 (human ATR) kinase. The data provide solid evidence that Srs2 interaction with PCNA and ensuing SUMO modification is required for checkpoint downregulation. However, experimental evidence with regard to the model that Srs2 acts at gaps after camptothecin-induced DNA damage is currently lacking. The work will be of interest to cell biologists studying genome integrity but would be strengthened by considering the possible role of Rad51 and its removal. 

      We thank editors and reviewers for their constructive comments and address their main criticisms below. 

      (1)  Srs2 action sites. Our data provide support to the model that Srs2 removal of RPA is favored at ssDNA regions with proximal PCNA, but not at ssDNA regions lacking proximal PCNA. A prominent example of the former type of ssDNA regions is an ssDNA gap with a 3’ DNA end permissive for PCNA loading. Examples of the latter type of ssDNA sites include those within R-loops and negatively supercoiled regions, both lacking 3’ DNA end required for PCNA loading. The former type of ssDNA regions can recruit other DNA damage checkpoint proteins, such as 9-1-1, which requires a 5’ DNA end for loading; thus, these ssDNA regions are ideal for Srs2’s action in checkpoint dampening. In contrast, ssDNA within supercoiled and Rloop regions, both of which can be induced by CPT treatment (Pommier et al, 2022), lacks the DNA ends required for checkpoint activation. RPA loaded at these sites plays important roles, such as recruiting Rloop removal factors (Feng and Manley, 2021; Li et al, 2024; Nguyen et al, 2017), and they are not ideal sites for Srs2’s checkpoint dampening functions. Based on the above rationale and our data, we suggest that Srs2 removal of RPA is favored only at a subset of ssDNA regions prone to checkpoint activation and can be avoided at other ssDNA regions where RPA mainly helps DNA protection and repair. We have modified the text and model drawing to better articulate the implications of our work, that is, Srs2 can distinguish between two types of ssDNA regions by using PCNA proximity as a guide for RPA removal_._ We noted that the precise sites of Srs2 actions in the genome remain to be determined. 

      (2)  Rad51 in the Srs2-RPA antagonism. In our previous report (Dhingra et al, 2021), we provided several lines of evidence to support the conclusion that Rad51 is not relevant to the Srs2-RPA antagonism, despite it being the best-studied protein that is regulated by Srs2. For example, while rad51∆ rescues the hyperrecombination phenotype of srs2∆ cells as shown by others, we found that rad51∆ did not affect the hypercheckpoint phenotype of srs2∆. In contrast, rfa1-zm1/zm2 have the opposite effects. The differential effects of rad51∆ and rfa1-zm1/zm2 were also seen for the srs2-ATPase dead allele (srs2-K41A). For example, rfa1-zm2 rescued the hyper-checkpoint defect and the CPT sensitivity of srs2-K41A, while rad51∆ had neither effect. These and other data described by Dhingra et al (2021) suggest that Srs2’s effects on checkpoint vs. recombination can be separated and that Rad51 removal by Srs2 is distinct from the Srs2RPA antagonism in checkpoint regulation. Given the functional separation summarized above, in our current work investigating which Srs2 features affect the Srs2-RPA antagonism, we did not focus on the role of Rad51. However, we did examine all known features of Srs2, including its Rad51 binding domain. Consistent with our conclusion summarized above, deleting the Rad51 binding domain in Srs2 (srs2∆Rad51BD) has no effect on rfa1-zm2 phenotype in CPT (Figure 2D). This data provides yet another evidence that Srs2 regulation of Rad51 is separable from the Srs2-RPA antagonism. Our work provides a foundation for future examination of how Srs2 regulates RPA and Rad51 in different manners and if there is a crosstalk between them in specific contexts. We have added this point to the revised text.

      Public Reviews: 

      Reviewer #1.

      Overall, the data presented in this manuscript is of good quality. Understanding how cells control RPA loading on ssDNA is crucial to understanding DNA damage responses and genome maintenance mechanisms. The authors used genetic approaches to show that disrupting PCNA binding and SUMOylation of Srs2 can rescue the CPT sensitivity of rfa1 mutants with reduced affinity for ssDNA. In addition, the authors find that SUMOylation of Srs2 depends on binding to PCNA and the presence of Mec1. Noted weaknesses include the lack of evidence supporting that Srs2 binding to PCNA and its SUMOylation occur at ssDNA gaps, as proposed by the authors. Also, the mutants of Srs2 with impaired binding to PCNA or impaired SUMOylation showed no clear defects in checkpoint dampening, and in some contexts, even resulted in decreased Rad53 activation. Therefore, key parts of the paper would benefit from further experimentation and/or clarification. 

      We thank the reviewer for the positive comments, and we address her/his remark regarding ssDNA gaps below. In addition, we provide evidence that redundant pathways can mask checkpoint dampening phenotype of the srs2-∆PIM and -3KR alleles.

      Major Comments 

      (1) The central model proposed by the authors relies on the loading of PCNA at the 3' junction of an ssDNA gap, which then mediates Srs2 recruitment and RPA removal. While several aspects of the model are consistent with the data, the evidence that it is occurring at ssDNA gaps is not strong. The experiments mainly used CPT, which generates mostly DSBs. The few experiments using MMS, which mostly generates ssDNA gaps, show that Srs2 mutants lead to weaker rescue in this context (Figure S1). How do the authors explain this discrepancy? In the context of DSBs, are the authors proposing that Srs2 is engaging at later steps of HRdriven DSB repair where PCNA gets loaded to promote fill-in synthesis? If so, is RPA removal at that step important for checkpoint dampening? These issues need to be addressed and the final model adjusted. 

      Our data provide supports to the model that Srs2 removal of RPA is favored at ssDNA regions with proximal PCNA, but not at ssDNA regions lacking proximal PCNA (Figure 7). A prominent example of the former type is ssDNA gap with 3’ DNA end permissive for PCNA loading. Examples of the latter type of ssDNA sites are present within R-loops and negatively supercoiled regions, and these ssDNA sites lack 3’ DNA ends required for PCNA loading. In principle, the former can recruit other DNA damage checkpoint proteins, such as 9-1-1, which requires 5’ DNA end for loading, thus it is ideal for Srs2’s action in checkpoint dampening. In contrast, ssDNA within supercoiled and R-loop regions, which can be induced by CPT treatment (Pommier et al., 2022), lacks DNA ends required for checkpoint activation. RPA loaded at these sites plays important roles such as recruiting R-loop removal factors (Feng and Manley, 2021; Li et al., 2024; Nguyen et al., 2017), and these are not ideal sites for Srs2 removal of RPA to achieve checkpoint dampening. Our work suggests that Srs2 removal of RPA is favored only at a subset of ssDNA regions prone to checkpoint activation and can be avoided at other ssDNA regions where RPA mainly helps DNA protection and repair. We have modified the text and the model to clarify our conclusions and emphasized that Srs2 can distinguish between two types of ssDNA regions using PCNA proximity as a guide for RPA removal. 

      We note that in addition to DSBs, CPT also induces both types of ssDNA mentioned above. For example, CPT can lead to ssDNA gap formation upon excision repair or DNA-protein crosslink repair of trapped Top1 (Sun et al, 2020). The resultant ssDNA regions contain 3’ DNA end for PCNA loading, thus favoring Srs2 removal of RPA. CPT treatment also depletes the functional pool of Top1, thus causing topological stress and increased levels of DNA supercoiling and R-loops (Petermann et al, 2022; Pommier et al., 2022). As mentioned above, R-loops and supercoiled regions do not favor Srs2 removal of RPA due to a lack of PCNA loading. We have now adjusted the text to clarify that CPT can lead to the generation of two types of ssDNA regions as stated above. We have also adjusted the model drawing to indicate that while ssDNA gaps can be logical Srs2 action sites, other types of ssDNA regions with proximal PCNA (e.g., resected ssDNA tails) could also be targeted by Srs2. Our work paves the way to determine the precise ssDNA regions for Srs2’s action. 

      Multiple possibilities should be considered in explaining the less potent suppression of rfa1 mutants by srs2 alleles in MMS compared to CPT conditions. For example, MMS and CPT affect checkpoints differently. While CPT only activates the DNA damage checkpoint, MMS additionally induces DNA replication checkpoint (Menin et al, 2018; Redon et al, 2003; Tercero et al, 2003). It is possible that the Srs2-RPA antagonism is more relevant to the DNA damage checkpoint compared with the DNA replication checkpoint. Further investigation of this possibility among other scenarios will shed light on differential suppression seen here. We have included this discussion in the revised text.

      (2) The data in Figure 3 showing that Srs2 mutants reduce Rad53 activation in the rfa1-zm2 mutant are confusing, especially given the claim of an anti-checkpoint function for Srs2 (in which case Srs2 mutants should result in increased Rad53 activation). The authors propose that Rad53 is hyperactivated in rfa1-zm2 mutant because of compromised ssDNA protection and consequential DNA lesions, however, the effects sharply contrast with the central model. Are the authors proposing that in the rfa1-zm2 mutant, the compromised protection of ssDNA supersedes the checkpoint-dampening effect?  Perhaps a schematic should be included in Figure 3 to depict these complexities and help the reader. The schematic could also include the compensatory dampening mechanisms like Slx4 (on that note, why not move Figure S2 to a main figure?... and even expand experiments to better characterize the compensatory mechanisms, which seem important to help understand the lack of checkpoint dampening effect in the Srs2 mutants) 

      Partially defective alleles often do not manifest null phenotype. In this case, while srs2∆ increases Rad53 activation (Dhingra et al., 2021), srs2-∆PIM and -3KR did not (Figure 3A-3B). However, srs2-∆PIM did increase Rad53 activation when combined with another checkpoint dampening mutant slx4<sup>RIM</sup> (now Figure 4B-4C). This result suggests that defects of partially defective srs2 alleles can be masked by Slx4. Further, srs2-∆PIM and 3KR rescued rfa1-zm2’s checkpoint abnormality (now Figure 3B-3C), suggesting that Srs2 binding to PCNA and its sumoylation contribute to the Srs2-RPA antagonism in the DNA damage checkpoint response.

      Partially defective alleles that impair specific features of a protein without producing null phenotype have been used widely to reveal biological mechanisms. For example, a partially defective allele of the checkpoint protein Rad9 perturbing binding to gamma-H2A (rad9-K1088M) does not cause DNA damage sensitivity on its own, due to the compensation from other checkpoint factors (Hammet et al, 2007). However_, rad9-K1088M_ rescues the DNA damage sensitivity and persistent G2/M checkpoint of slx4 mutants, providing strong evidence for the notion that Slx4 dampens checkpoint via regulating Rad9 (Ohouo et al, 2013).

      We have now indicated that our model highlights the checkpoint recovery process and does not depict another consequence of the Srs2-RPA antagonism, that is, rfa1 DNA binding mutants can lead to increased levels of DNA lesions and consequently stronger checkpoint activation, which are rescued by lessening Srs2’s ability to strip RPA from DNA (Dhingra et al., 2021). We have stated these points more clearly in the text and added a schematic (Figure 3A) to outline the genetic relationship and interpretations. We also moved Figure S2 to the main figures (Figure 4), as suggested by the reviewer. Better characterizing the compensatory mechanisms among the multiple checkpoint dampening pathways requires substantial amounts of work that will be pursued in the future.

      (3) The authors should demarcate the region used for quantifying the G1 population in Figure 3B and explain the following discrepancy: By inspection of the cell cycle graph, all mutants have lower G1 peak height compared to WT (CPT 2h). However, in the quantification bar graph at the bottom, ΔPIM has higher G1 population than the WT. 

      We now describe how the G1 region of the FACS histogram was selected to derive the percentage of G1 cells in Figure 3B (now Figure 3C). Briefly, the G1 region from the “G1 sample” was used to demarcate the G1 region of the “CPT 2h” sample. We noticed that a mutant panel was mistakenly put in the place of wild-type, and this error is now corrected. The conclusion remains that srs2-∆PIM and srs2-3KR improved rfa1-zm2 cells’ ability to exit G2/M, while they themselves do not show difference from the wild-type control for the percentage of G1 cells after 2hr CPT treatment. We have added statistics in Figure 3C that support this conclusion.

      Reviewer #2:

      This is an interesting paper that delves into the post-translational modifications of the yeast Srs2 helicase and proteins with which it interacts in coping with DNA damage. The authors use mutants in some interaction domains with RPA and Srs2 to argue for a model in which there is a balance between RPA binding to ssDNA and Srs2's removal of RPA. The idea that a checkpoint is being regulated is based on observing Rad53 and Rad9 phosphorylation (so there are the attributes of a checkpoint), but evidence of cell cycle arrest is lacking. The only apparent delay in the cell cycle is the re-entry into the second S phase (but it could be an exit from G2/M); but in any case, the wild-type cells enter the next cell cycle most rapidly. No direct measurement of RPA residence is presented. 

      We thank the reviewer for the helpful comments. Previous studies have shown that CPT does not induce the DNA replication checkpoint, and thus does not slow down or arrest S phase progression; however, CPT does induce the DNA damage checkpoint, which causes a delay (not arrest) in G2/M phase and re-entering into the second G1 (Menin et al., 2018; Redon et al., 2003). Our result is consistent with these findings, showing that CPT induces G2/M delay but not arrest. We have now made this point clearer in the text.

      We have previously reported chromatin-bound RPA levels in rfa1-zm2, srs2, and their double mutants, as well as in vitro ssDNA binding by wild-type and mutant RPA complexes (Dhingra et al., 2021). These data showed that Srs2 loss or its ATPase dead mutant led to 4-6-fold increase of RPA levels on chromatin, which was rescued by rfa1-zm2 (Dhingra et al., 2021). On its own, rfa1-zm2 did not cause defective chromatin association, despite modestly reducing ssDNA binding in vitro (Dhingra et al., 2021). This discrepancy could be due to a lack of sensitivity of the chromatin fractionation assay in revealing moderate changes of RPA residence on DNA in vivo. Our functional assays (Figure 2-3) were more effective in identifying the Srs2 features pertaining to RPA regulation. 

      Strengths:

      Data concern viability assays in the presence of camptothecin and in the post-translational modifications of Srs2 and other proteins.  

      Weaknesses:

      There are a couple of overriding questions about the results, which appear technically excellent. Clearly, there is an Srs2-dependent repair process here, in the presence of camptothecin, but is it a consequence of replication fork stalling or chromosome breakage? Is repair Rad51-dependent, and if so, is Srs2 displacing RPA or removing Rad51 or both? If RPA is removed quickly what takes its place, and will the removal of RPA result in lower DDC1-MEC1 signaling? 

      Srs2 can affect both the checkpoint response and DNA repair processes in CPT conditions. However, rfa1zm2 mainly affects the former role of Srs2; this allows us to gain a deeper understanding of this role, which is critical for cell survival in CPT (Dhingra et al., 2021). Building on this understanding, our current study identified two Srs2 features that could afford spatial and temporal regulation of RPA removal from DNA, providing a rationale for how cells can properly utilize an activity that can be beneficial yet also dangerous if it were to lack regulation. Study of Srs2-mediated DNA repair in CPT conditions, either in Rad51-dependent or -independent manner, to deal with replication fork stalling or DNA breaks will require studies in the future.

      Moreover, it is worth noting that in single-strand annealing, which is ostensibly Rad51 independent, a defect in completing repair and assuring viability is Srs2-dependent, but this defect is suppressed by deleting Rad51. Does deleting Rad51 have an effect here? 

      We have previously shown that rad51∆ did not rescue the hyper-checkpoint phenotype of srs2∆ cells in CPT conditions, while rfa1-zm1 and -zm2 did (Dhingra et al., 2021). This differential effect was also seen for the srs2 ATPase-dead allele (Dhingra et al., 2021). These and other data described by Dhingra et al (2021) suggest that Srs2’s effects on checkpoint vs. recombination are separable at least in CPT condition, and that the Srs2-RPA antagonism in checkpoint regulation is not affected by Rad51 removal (unlike in SSA).

      Neither this paper nor the preceding one makes clear what really is the consequence of having a weakerbinding Rfa1 mutant. Is DSB repair altered? Neither CPT nor MMS are necessarily good substitutes for some true DSB assay. 

      We have previously showed that rfa1-zm1/zm2 did not affect the frequencies of rDNA recombination, gene conversation, or direct repeat repair (Dhingra et al., 2021). Further, rfa1-zm1/zm2 did not suppress the hyperrecombination phenotype of srs2∆, while rad51∆ did (Dhingra et al., 2021). In a DSB system, wherein the DNA repeats flanking the break were placed 30 kb away from each other, srs2∆ led to hyper-checkpoint and lethality, both of which were rescued by rfa1-zm mutants (Dhingra et al., 2021). In this assay, rfa1-zm1/zm2 did not show sensitivity, suggesting largely proficient DNA repair. Collectively, these data suggest that moderately weakening DNA binding of Rfa1 does not lead to detectable effect on the recombinational repair examined thus far, rather it affects Srs2-mediated checkpoint downregulation. In-depth studies of rfa1-zm mutations in the context of various DSB repair steps will be interesting to pursue in the future.

      With camptothecin, in the absence of site-specific damage, it is difficult to test these questions directly. (Perhaps there is a way to assess the total amount of RPA bound, but ongoing replication may obscure such a measurement). It should be possible to assess how CPT treatment in various genetic backgrounds affects the duration of Mec1/Rad53-dependent checkpoint arrest, but more than a FACS profile would be required. 

      Quantitative measurement of RPA residence time on DNA in cellular context and the duration of the

      Mec1/Rad53-mediated cell cycle delay/arrest will be informative but requires further technology development. Our current work provides a foundation for such quantitative assessment.

      It is also notable that MMS treatment does not seem to yield similar results (Fig. S1). 

      Figure S1 showed that srs2-∆PIM and srs2-3KR had weaker suppression of rfa1-zm2 growth on MMS plates than on CPT plates. Multiple possibilities should be considered in explaining the less potent suppression of rfa1 mutants by srs2 in MMS compared with CPT conditions. For example, MMS and CPT affect checkpoints differently. While CPT only activates the DNA damage checkpoint, MMS additionally induces DNA replication checkpoint (Menin et al., 2018; Redon et al., 2003; Tercero et al., 2003). It is therefore possible that the Srs2RPA antagonism is more relevant for the DNA damage checkpoint control compared with the DNA replication checkpoint. Further investigation of this possibility will shed light on differential suppression seen here. We have included this discussion in the revised text.

      Reviewer #3:

      The superfamily I 3'-5' DNA helicase Srs2 is well known for its role as an anti-recombinase, stripping Rad51 from ssDNA, as well as an anti-crossover factor, dissociating extended D-loops and favoring non-crossover outcome during recombination. In addition, Srs2 plays a key role in ribonucleotide excision repair. Besides DNA repair defects, srs2 mutants also show a reduced recovery after DNA damage that is related to its role in downregulating the DNA damage signaling or checkpoint response. Recent work from the Zhao laboratory (PMID: 33602817) identified a role of Srs2 in downregulating the DNA damage signaling response by removing RPA from ssDNA. This manuscript reports further mechanistic insights into the signaling downregulation function of Srs2. 

      Using the genetic interaction with mutations in RPA1, mainly rfa1-zm2, the authors test a panel of mutations in Srs2 that affect CDK sites (srs2-7AV), potential Mec1 sites (srs2-2SA), known sumoylation sites (srs2-3KR), Rad51 binding (delta 875-902), PCNA interaction (delta 1159-1163), and SUMO interaction (srs2SIMmut). All mutants were generated by genomic replacement and the expression level of the mutant proteins was found to be unchanged. This alleviates some concern about the use of deletion mutants compared to point mutations. The double mutant analysis identified that PCNA interaction and SUMO sites were required for the Srs2 checkpoint dampening function, at least in the context of the rfa1-zm2 mutant. There was no effect of these mutants in a RFA1 wild-type background. This latter result is likely explained by the activity of the parallel pathway of checkpoint dampening mediated by Slx4, and genetic data with an Slx4 point mutation affecting Rtt107 interaction and checkpoint downregulation support this notion. Further analysis of Srs2 sumoylation showed that Srs2 sumoylation depended on PCNA interaction, suggesting sequential events of Srs2 recruitment by PCNA and subsequent sumoylation. Kinetic analysis showed that sumoylation peaks after maximal Mec1 induction by DNA damage (using the Top1 poison camptothecin (CPT)) and depended on Mec1. These data are consistent with a model that Mec1 hyperactivation is ultimately leading to signaling downregulation by Srs2 through Srs2 sumoylation. Mec1-S1964 phosphorylation, a marker for Mec1 hyperactivation and a site found to be needed for checkpoint downregulation after DSB induction did not appear to be involved in checkpoint downregulation after CPT damage. The data are in support of the model that Mec1 hyperactivation when targeted to RPA-covered ssDNA by its Ddc2 (human ATRIP) targeting factor, favors Srs2 sumoylation after Srs2 recruitment to PCNA to disrupt the RPA-Ddc2-Mec1 signaling complex. Presumably, this allows gap filling and disappearance of long-lived ssDNA as the initiator of checkpoint signaling, although the study does not extend to this step. 

      Strengths 

      (1) The manuscript focuses on the novel function of Srs2 to downregulate the DNA damage signaling response and provide new mechanistic insights. 

      (2) The conclusions that PCNA interaction and ensuing Srs2-sumoylation are involved in checkpoint downregulation are well supported by the data. 

      We thank the reviewer for carefully reading our work and for his/her positive comments. 

      Weaknesses 

      (1) Additional mutants of interest could have been tested, such as the recently reported Pin mutant, srs2Y775A (PMID: 38065943), and the Rad51 interaction point mutant, srs2-F891A (PMID: 31142613). 

      Residue Y775 of Srs2 was shown to serve as a separation pin in unwinding D-loops and dsDNA with 3’ overhang in vitro; however, srs2-Y775A lacks cellular phenotype in assays for gene conversion, crossover, and genetic interactions. As such, the biological role of this residue has not been clear. In addressing reviewer’s comment, we obtained srs2-Y775A, and the control strains as described in the recent publication (Meir et al, 2023). While srs2-Y775A on its own did not affect CPT sensitivity, it improved rfa1-zm_2 mutant growth on media containing CPT. This result suggests that Y775 can influence RPA regulation during in checkpoint dampening. Given that truncated Srs2 (∆Cter 276 a.a.) containing Y775A showed normal RPA stripping activity _in vitro, it is possible that cellular assay using rfa1-zm2 is more sensitive for revealing defect of this activity or full-length protein is required for manifest Y775A effect. Future experiments distinguishing these possibilities can provide more clarity. Nevertheless, our result reveals the first phenotype of Srs2 separation pin mutant. We have added this new result (Figure S4) and our interpretation.

      We have already included data showing that a srs2 mutant lacking the Rad51 binding domain (srs2∆Rad51BD, ∆875-902) did not affect rfa1-zm2 growth in CPT nor caused defects in CPT on its own (Figure 2D). This data suggest that Rad51 binding is not relevant to the Srs2-RPA antagonism in CPT, a conclusion fully supported by data in our previous study (Dhingra et al., 2021). Collectively, these findings do not provide a strong rationale to test a point mutation within the Rad51BD region. 

      (2) The use of deletion mutants for PCNA and RAD51 interaction is inferior to using specific point mutants, as done for the SUMO interaction and the sites for post-translational modifications. 

      We generally agree with this view. However, it is less of a concern in the context of the Rad51 binding site mutant (srs2-∆Rad51BD) since it behaved as the wild-type allele in our assays. The srs2-∆PIM mutant (lacking 4 amino acids) has been examined for PCNA binding in vitro and in vivo (Kolesar et al, 2016; Kolesar et al, 2012); to our knowledge no detectable defect was reported. Thus, we believe that this allele is suitable for testing whether Srs2’s ability to bind PCNA is relevant to RPA regulation.

      (3) Figure 4D and Figure 5A report data with standard deviations, which is unusual for n=2. Maybe the individual data points could be plotted with a color for each independent experiment to allow the reader to evaluate the reproducibility of the results. 

      We have included individual data points as suggested and corrected figure legend to indicate that three independent biological samples per genotype were examined in both panels.

      References:

      Dhingra N, Kuppa S, Wei L, Pokhrel N, Baburyan S, Meng X, Antony E, Zhao X (2021) The Srs2 helicase dampens DNA damage checkpoint by recycling RPA from chromatin. Proc Natl Acad Sci U S A 118: e2020185118.

      Feng S, Manley JL (2021) Replication Protein A associates with nucleolar R loops and regulates rRNA transcription and nucleolar morphology. Genes Dev 35: 1579-1594.

      Fiorani S, Mimun G, Caleca L, Piccini D, Pellicioli A (2008) Characterization of the activation domain of the Rad53 checkpoint kinase. Cell Cycle 7: 493-499.

      Hammet A, Magill C, Heierhorst J, Jackson SP (2007) Rad9 BRCT domain interaction with phosphorylated H2AX regulates the G1 checkpoint in budding yeast. EMBO Rep 8: 851-857.

      Kolesar P, Altmannova V, Silva S, Lisby M, Krejci L (2016) Pro-recombination role of Srs2 protein requires SUMO (Small Ubiquitin-like Modifier) but is independent of PCNA (Proliferating Cell Nuclear Antigen) interaction. J Biol Chem 291: 7594-7607.

      Kolesar P, Sarangi P, Altmannova V, Zhao X, Krejci L (2012) Dual roles of the SUMO-interacting motif in the regulation of Srs2 sumoylation. Nucleic Acids Res 40: 7831-7843.

      Li Y, Liu C, Jia X, Bi L, Ren Z, Zhao Y, Zhang X, Guo L, Bao Y, Liu C et al (2024) RPA transforms RNase H1 to a bidirectional exoribonuclease for processive RNA-DNA hybrid cleavage. Nat Commun 15: 7464.

      Meir A, Raina VB, Rivera CE, Marie L, Symington LS, Greene EC (2023) The separation pin distinguishes the pro- and anti-recombinogenic functions of Saccharomyces cerevisiae Srs2. Nat Commun 14: 8144.

      Memisoglu G, Lanz MC, Eapen VV, Jordan JM, Lee K, Smolka MB, Haber JE (2019) Mec1(ATR) autophosphorylation and Ddc2(ATRIP) phosphorylation regulates dna damage checkpoint signaling. Cell Rep 28: 1090-1102 e1093.

      Menin L, Ursich S, Trovesi C, Zellweger R, Lopes M, Longhese MP, Clerici M (2018) Tel1/ATM prevents degradation of replication forks that reverse after Topoisomerase poisoning. EMBO Rep 19: e45535.

      Nguyen HD, Yadav T, Giri S, Saez B, Graubert TA, Zou L (2017) Functions of Replication Protein A as a sensor of R loops and a regulator of RNaseH1. Mol Cell 65: 832-847 e834.

      Ohouo PY, Bastos de Oliveira FM, Liu Y, Ma CJ, Smolka MB (2013) DNA-repair scaffolds dampen checkpoint signalling by counteracting the adaptor Rad9. Nature 493: 120-124.

      Papouli E, Chen S, Davies AA, Huttner D, Krejci L, Sung P, Ulrich HD (2005) Crosstalk between SUMO and ubiquitin on PCNA is mediated by recruitment of the helicase Srs2p. Mol Cell 19: 123-133.

      Petermann E, Lan L, Zou L (2022) Sources, resolution and physiological relevance of R-loops and RNA-DNA hybrids. Nat Rev Mol Cell Biol 23: 521-540.

      Pommier Y, Nussenzweig A, Takeda S, Austin C (2022) Human topoisomerases and their roles in genome stability and organization. Nat Rev Mol Cell Biol 23: 407-427.

      Redon C, Pilch DR, Rogakou EP, Orr AH, Lowndes NF, Bonner WM (2003) Yeast histone 2A serine 129 is essential for the efficient repair of checkpoint-blind DNA damage. EMBO Rep 4: 678-684.

      Sun Y, Saha S, Wang W, Saha LK, Huang SN, Pommier Y (2020) Excision repair of topoisomerase DNAprotein crosslinks (TOP-DPC). DNA Repair (Amst) 89: 102837.

      Tercero JA, Longhese MP, Diffley JFX (2003) A central role for DNA replication forks in checkpoint activation and response. Mol Cell 11: 1323-1336.

      Reviewer #1 (Recommendations For The Authors): 

      (1) "the srs2-ΔPIM (Δ1159-1163 amino acids)". "11" should not be italic.

      Corrected.

      (2) "the srs2-SIMmut (1170 IIVID 1173 to 1170 AAAAD 1173)". "1173" should be 1174.

      Corrected.

      (3) Can Slx4-RIM mutant rescue rfa1-zm2 CPT sensitivity?  

      We found that unlike srs2∆, slx4∆ failed to rescue rfa1-zm2 CPT sensitivity (picture on the right). On the other hand, slx4∆ counteracts Rad9-dependent Rad53 activation as shown by Ohouo et al (2013). 

      Author response image 1.

      (4) One genotype (rfa1-zm2 srs2-3KR) is missing in Figure 5B.

      Corrected.

      (5) In Fig. S2C, FACS plots do not match the bar graph (see major concern 3). 

      Corrected and is described in more detail in Major Concern #3.

      Reviewer #2 (Recommendations For The Authors): 

      Figure 1. The colors in A are not well-conserved in B.

      Colors for srs2-7AV and -2SA in panel B are now matched with those in panel A.

      Figure 2. Is srs2-SIMmut the same as srs2-sim? 

      This mutant allele is now referred to as srs2-SIM<sup>mut</sup> throughout the text and figures.

      The suppression of rfa1-zm2 and (less strongly) rfa-t33 by the Srs2 mutants is interesting. Based on previous data, the suppression is apparently mutual, though it isn't shown here, unless we misunderstand. 

      We have previously shown that rfa1-zm2 and srs2∆ showed mutual suppression (Dhingra et al 2021 PNAS) and have included an example in Figure S1A. Unlike srs2∆, srs2-∆PIM and -3KR showed little damage sensitivity and DDC defects, likely due to the compensation by the Slx4-mediated checkpoint dampening (detailed in the Public Review section). Suppression is not applicable toward mutants lacking a phenotype, though the mutants could confer suppression when there is a functional relationship with another mutant, as we see here toward rfa1-zm2.

      Is Srs2 interaction with PCNA dependent on its ubiquitylation or SUMO? Does PCNA mutant K164R mimic this mutation? (this may well be known; our ignorance). 

      It was known that Srs2 can bind unmodified PCNA, though SUMO enhances this interaction; however, a very small percentage of PCNA is sumoylated in cells and PCNA sumoylation affects both Srs2-dependent and independent processes (e.g., (Papouli et al, 2005). As such, the genetic interaction of K164R with rfa1-zm2 can be difficult to interpret.

      Why srs2-7AV or srs2-sim make rfa1-zm2 even more sensitive is also not obvious. The authors take refuge in the statement that Srs2 "has multiple roles in cellular survival of genotoxic stress" but don't attempt to be more precise. 

      Our understanding of srs2-7AV and -sim is limited; thus, more specific speculation cannot be made at this time.

      Figure 3. It is striking (Figure 3A) that all the cells have reached G2 an hour after releasing from alpha-factor arrest, even though presumably CPT treatment must impair replication. It is even more striking that there is apparently no G2/M arrest in the presumably damaged cells as the WT (Figure 3B) has the most rapid progression through the cell cycle. How does this compare with cells in the absence of CPT? The idea that CPT is triggering Rad53-mediated response is hard to understand if there is in fact no delay in the cell cycle. Instead, the several mutants appear to delay re-entry into S... Or maybe it is actually an exit from G2/M? 

      This phenomenon needs a better explanation. 

      CPT does not induce the DNA replication checkpoint nor S phase delay, explaining apparent G2 content by the one hour time point; however, CPT does induce the DNA damage checkpoint, and a delay (not arrest) in G2/M (Menin et al., 2018; Redon et al., 2003; Tercero et al., 2003). We confirmed these findings. In our hand, wildtype G1 cells released into the cell cycle in the absence of CPT complete the first cell cycle within 80 minutes, such that most cells are in the second G1 phase by 90 min. In contrast, when wild-type cells were treated with CPT, G2/M exit was only partial at 120min (e.g., Figure 3B). These features differentiate CPT treatment from MMS treatment, which induces both types of checkpoints and lengthening the time that cells reach G2. We have highlighted this unique feature of CPT in checkpoint induction.

      What is "active Rad53"? If the authors mean they are using a phospho-specific Ab versus Rad53, they should explain this. It's impossible to know if total Rad53 is altered from Figure 3A. A blot with an antibody that detects both phosphorylated and nonphosphorylated Rad53 would help. 

      The F9 antibody used here detects phosphorylated Rad53 forms induced by Mec1 activation and does not detect unphosphorylated Rad53 (Fiorani et al, 2008). We changed “active Rad53” to “phosphorylated Rad53”. We used Pgk1 as a loading control to ensure equal loading, which help to quantify the relative amount of “active Rad53” in cells. This method has been used widely in the field.

      Also is there a doublet of Rad53 in the right two lanes and in WT? Rad53 often shows more than one slowmigrating species, so this isn't necessarily a surprise. Were both forms used in quantitation? 

      Both forms are used for quantification. 

      Figure 4A. Is there a di-SUMO form above the band marked Srs2-Su? Is this known? Is it counted? 

      Mono-sumoylated form of Srs2 is the most abundant form of sumoylated Srs2, though we detected a sumoylated Srs2 band that can represent its di-sumo form. We did quantify both forms in the plot.

      B. The dip at 1.5 h in Rad9-P is curious. It would be useful to know what % of Rad9 is phosphorylated in a repair-defective (rad52?) background with CPT treatment. And would such rad52 cells show a long arrest? 

      This dip is reproducible and may reflect that a population of cells escape G2/M delay at this timepoint.  

      Figure 5. It seems clear that the autophosphorylation site of Mec1, which was implicated in turning off a longdelayed G2/M arrest has no effect here, but presumably, a kinase-dead Mec1 (or deletion) does? The idea that a checkpoint is being regulated seems to come more from an assumption than from any direct data; as noted above, the only apparent delay in the cell cycle is the re-entry into S. There clearly is Rad53 and Rad9 phosphorylation so there are the attributes of a checkpoint.  If PI3KK phosphorylation is important, can this be accomplished by Tel1 as well as Mec1? 

      A mec1 helicase dead or null would not activate the checkpoint at the first place, therefore will not be useful to address whether Mec1 autophosphorylation is implicated in turning off checkpoint. A recent study from the Haber lab provided evidence that Mec1 autophosphorylation at S1964 helps to turn off the checkpoint in a DSB situation (Memisoglu et al, 2019). The role of Tel1 in checkpoint dampening will be interesting to examine in the future.  

      Figure 6. Two Rfa1 phospho-sites don't appear to be important, but do the known multiple phosphorylations of Rfa2 play a role?  

      Figure 6D examined three Rfa2 phosphorylation sites and found no genetic interaction with srs2∆.   

      Summary:  There are a lot of interesting data here, but they don't strongly support the author's model in the absence of a more direct way to monitor RPA binding and removal. This could be done using some sitespecific damage, but hard to do with CPT or MMS (which themselves don't appear to have the same effect).  The abstract suggests Srs2 is "temporally and spatially regulated to both allow timely checkpoint termination and to prevent superfluous RPA removal." But where is the checkpoint termination if there's no evident checkpoint? And "superfluous" is probably not the right word (= unnecessary); probably the authors intend "excessive"? As noted above, it also isn't clear if the displacement is of RPA or of Rad51, which normally replaces RPA and which is well-known to be itself displaced by Srs2. Again, if CPT is causing enough damage to kill orders of magnitudes of cells (are the plate and liquid concentrations comparable, we suddenly wonder) then why isn't there some stronger evidence for a cell cycle response to the DDC? 

      As described in the Public Review section, we have previously shown that a lack of Srs2-mediated checkpoint downregulation leads to a 4-6 fold increase of RPA on chromatin, which was rescued by rfa1-zm2 (Dhingra et al., 2021). On its own, rfa1-zm2 did not cause defective chromatin association in our assays, despite modestly reducing ssDNA binding in vitro (Dhingra et al., 2021). This discrepancy could be due to a lack of sensitivity of chromatin fractionation assay in revealing moderate changes of RPA residence on DNA. Considering this, we decided to employ functional assays (Figure 2-3) that are more effective in identifying the specific Srs2 features pertaining to RPA regulation. 

      We respectfully disagree with the reviewer’s point that there is “no evident checkpoint” in CPT.  Previous studies have shown that CPT induces the DNA damage checkpoint as evidenced by Mec1 activation and phosphorylation of Rad53 and Rad9, and delaying exit from G2/M (Dhingra et al., 2021; Menin et al., 2018; Redon et al., 2003). Our data are fully consistent with these reports. It is important to note that DNA damage checkpoint can manifest at a range of strengths depending on the genotoxic conditions and treatment, but the fundamental principles are the same. For example, we found that the Srs2-RPA antagonism not only affects the checkpoint downregulation in CPT, but also does so in MMS treatment and in a DSB system. We focused on CPT condition in this work, since CPT only induces the DNA damage checkpoint but not DNA replication checkpoint while MMS induces both. Further investigating the Srs2-RPA antagonism in a DSB system can be interesting to pursue in the future.  

      We believe that “superfluous removal” is appropriately used when discussing RPA regulation at genomic sites wherein it supports ssDNA protection and DNA repair, rather than DDC. Examples of these sites include R-loops and negatively supercoiled regions. These sites lack 3’ and 5’ DNA ends at the ss-dsDNA junctions for loading PCNA and the 9-1-1 checkpoint factors, and thus are not designated for checkpoint regulation.

      We addressed the reviewer’s point regarding Rad51 in the Public Review section. We disagree with reviewer’s view that “Rad51 normally replaces RPA”. RPA is involved in many more processes than Rad51 wherein it is not replaced by Rad51.  

      Regarding toxicity of CPT, our view is that it stems from a combination of checkpoint regulation and other processes that also involve the Srs2-RPA antagonism. While this work focused on the checkpoint aspect of this antagonism, future studies will be conducted to address the latter.

      One reference is entered as Lee Zhou and Stephen J. Elledge as opposed to "Zhou and Elledge."

      Corrected.  

      Reviewer #3 (Recommendations For The Authors): 

      (1) It would be nice to see the additional point mutants (srs2-Y775A, srs2-F891A) be tested, as they showed little to no phenotypes in the previously reported analyses, which did not specifically test the function surveyed here. 

      This point is addressed in the Public Reviews section.

      (2) Maybe the caveat of using deletion versus point mutations could be discussed. 

      This point is addressed in the Public Reviews section.

      (3) Please plot individual data points of the two independent experiments in Figures 4D and 5A so that the reader can evaluate reproducibility. N=2 does not really allow deriving SD.

      This point is addressed in the Public Reviews section and three individual data points are now included in both panels.

      (4) It will help the reader to have the exact strains used in each experiment listed in each figure legend.  Minor point.

      The strain table is now updated to address this point.

      (5) Page 7 middle paragraph: The reference to Figure 4A in line 11 should probably be Figure S3A. 

      Corrected.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      The authors use the innovative CRISPRi method to uncover regulators of cell density and volume in neutrophils. The results show that cells require NHE activity during chemoattractant-driven cell migration. Before migration occurs, cells also undergo a rapid cell volume increase. These results indicate that water flux, driven by ion channels, appears to play a central role in neutrophil migration. The paper is very well written and clear. I suggest adding some discussion about the role of actin in the process, but this is not essential.

      Strengths

      The novel use of CRIPSPi to uncover cell density regulators is very novel. Some of the uncovered molecules were known before, e.g. discussed in Li & Sun, Frontiers in Cell and Developmental Biology, 2021. Others are more interesting, for example PI3K-gamma. The use of caged fMLP is also nice.

      We thank the reviewer for their positive appraisal of our work and have pursued their suggestions for improving our paper in this revision.

      Weaknesses

      One area of investigation that seems to be absent is mentioned in the introduction. I.e., actin is expected to play a role in regulating cell volume increase. Did the authors perform any experiments with LatA? What was seen there? Do cells still migrate with LatA, or is a different interplay seen? The role of PI3K is interesting, and maybe somewhat related to actin. But this may be a different line of inquiry for the future.

      We agree that we could have done a better job explicitly investigating the role of actin dynamics in volume changes. Towards this end, by using Latrunculin B to depolymerize actin, we find that the volume increase in suspension is not affected (Figure 1 – supplemental figure 2A). In our FxM single cell volume measurements of adherent cells, we similarly observed unhindered swelling following latrunculin treatment. These data indicate that actin is dispensable for chemoattractant-induced cell swelling (Figure 1 – supplemental figure 2B) . There was a minor apparent reduction in the final volume reached with the Latrunculin-treated cells as measured by FxM, but this likely reflects minor uptake of the excluded dye following Latrunculin treatment rather than an actual change in final volume. This conclusion is reinforced by the change in 2D footprint area being well modeled by the 2D projection of an isotropically expanding sphere (Figure 1 – supplemental figure 2C) . Latrunculin treatment completely abolishes migration, as is expected for unconfined migration on fibronectin (Figure 1 – supplemental figure 2D-E) . The second Reviewer also wanted us to dig deeper on the role of PI3K-gamma, so we expanded our analysis of this hit (Figure 3 – supplemental figure 1B-D; Figure 4 – supplemental figure 1D-G) .

      Author response image 1.

      Chemoattractant-induced swelling, but not motility, is independent of actin polymerization. (A) Human primary neutrophils were incubated with DMSO or Latrunculin B, activated with 20 nM fMLP, and then volume responses were measured using electronic sizing via a Coulter counter. Latrunculin treatment did not alter cell swelling, indicating that actin polymerization is dispensable for the chemoattractant-induced volume increase. (B) Similar results were obtained using the FxM assay, showing that Latrunculin-treated cells are capable of swelling after stimulation. (C) The Latrunculin-treated cells also increase their footprints, albeit less so than control cells, but this is within the range of what would be expected for this degree of chemoattractant-induced volume increase (modeled by a sphere expanding an equivalent volume). (D) Single cell tracks of primary human neutrophils responding to acute chemoattractant stimulation. Both panels show 15 minutes of tracks with the tracks prior (left) and the 15 minutes post (right) uncaging the chemoattractant. The scale bar is 50 microns. The top panels show the large increase in motility displayed by control cells, while the Latrunculin-treated cells (bottom panels) fail to move. (E) Latrunculin-treated cells consistently fail to move in response to chemoattractant-stimulation. (F) Representative single cell volume traces show that Latrunculin-treated cells (black) lack short-term volume fluctuations but persistently maintain an elevated volume following chemoattractant stimulation. Control cells (blue) exhibit short-term volume fluctuations. (G) The lack of short-term volume fluctuations following latrunculin treatment is borne out across the population, with the coefficient of variation in the volume for single cells (post-swelling) being dramatically lower in Latrunculin-treated cells, suggesting that these short term volume fluctuations depend on actin-based motility.

      Author response image 2.

      Additional validation of swelling screen hits. (A) Mixed WT and CRISPR KO dHL-60 populations post-stimulation show that CA2 (black) and PI3Ky (green) KO both fail to decrease their densities as much as the WT (cyan) population following chemoattractant stimulation. Cells with negative control guides (light gray) have normal volume responses. All tubes were fractionated and aligned on the fraction containing the median of the WT population. Negative values indicate a fraction with a higher density than WT. (B) To validate the perturbations to cell swelling observed with FxM, primary human neutrophils were stimulated in suspension, and their volumes were measured using a Coulter counter. 20 nM fMLP was added at the 0 minute mark. Shaded regions represent the 95% confidence intervals. (C) PI3Kγ inhibition blocks the chemoattractant-induced volume change in primary human neutrophils, as assayed by FxM. (D) PI3Kγ inhibition also blocked the chemoattractant-drive shape change in human primary neutrophils, as measured by the change in footprint area in FxM (E) The coefficient of variation in volume for control (cyan) and iNHE1 (gold) inhibited human primary neutrophils undergoing chemokinesis are comparable, suggesting that the volume fluctuations are unchanged in moving cells upon NHE1 and PI3Kγ inhibition despite the different baseline volumes.

      Author response image 3.

      Additional validation of motility phenotypes. (A-D) Single cell tracks of primary human neutrophils responding to acute chemoattractant stimulation. Both panels show tracks of cells 15 minutes prior (left) versus 15 minutes post (right) uncaging the chemoattractant. The scale bar is 50 microns. Color saturation indicates time with tracks progressing from gray to full color. (A) Control cells show a large increase in movement upon uncaging, (B) NHE1 inhibited cells also initiate movement but to a lesser degree, (C) hypo-osmotic shock rescues the NHE1 motility defect. (D) PI3Kγ leads to a large fraction of cells failing to initiate movement. (E) PI3Kγ inhibition showed near complete blockage of the chemoattractant-induced motility increase in primary human neutrophils. (F) Control neutrophils (blue) show an increased angular alignment upon stimulation as their motility becomes directional. NHE1-inhibition (gold, iNHE1) has very little effect on this process, while PI3Kγ inhibition (green) leads to a reduction in this alignment at the population level. (G) For the PI3Kγ inhibited cells that start migrating, the migration-induced volume fluctuations are comparable to iNHE1 and control cells. The top panel shows the track of a representative migrating PI3Kγ inhibited cell and the bottom panel, its corresponding volume normalized to the pre-stimulation volume. The scale bar is 50 microns.

      Reviewer #2 (Public Review):

      Nagy et al investigated the role of volume increase and swelling in neutrophils in response to the chemoattractant. Authors show that following chemoattractant response cells lose their volume slightly owing to the cell spreading phase and then have a relatively rapid increase in the cell volume that is concomitant with cell migration. The authors performed an impressive genome-wide CRISPR screen and buoyant density assay to identify the regulators of neutrophil swelling. This assay showed that stimulating cells with chemoattractant fMLP led to an increase in the cell volume that was abrogated with the FPR1 receptor knockout. The screen revealed a cascade that could potentially be involved in cell swelling including NHE1 (sodium-proton antiporter) and PI3K. NHE1 and PI3K are required for chemoattractant-induced swelling in human primary neutrophils. Authors also suggest slightly different functions of NHE1 and PI3K activity where PI3K is also required to maintain chemoattractant-induced cell shape changes. The authors convincingly show that chemoattractant-induced cell swelling is linked to cell migration and NHE1 is required for swelling at the later stages of swelling since the cells at the early point work on low-volume and low-velocity regime. Interestingly, the authors also show that lack of swelling in NHE1-inhibited cells could be rescued by mild hypo-osmotic swelling strengthening the argument that water influx followed chemoattractant stimulation is important for potentiation for migration.

      The conclusions of this paper are mostly well supported by data and are pretty convincing, but some aspects of image acquisition and data analysis need to be clarified and extended.

      We thank the reviewer for their positive appraisal of our work and pursued their suggestions for improving our paper in this revision.

      Weaknesses

      (1) It would really help if the authors could add the missing graph for the footprint area when cells are treated with Latranculin. Graph S1F for volume changes with Lat treatment should be compared with DMSO-treated controls.

      We agree that the Latrunculin condition merits more thorough investigation. To this end, we compared the volume response of human primary neutrophils to chemoattractant addition for Latrunculin B treated cells versus DMSO controls in suspension and show that there is no difference in swelling (Figure 1 – supplemental figure 2A) . This is additionally confirmed with FxM measurements with a slight undershooting of the final volume likely due to minor uptake of the excluded dye by Latrunculin treated cells (Figure 1 – supplemental figure 2B) . We have also included the requested footprint area changes in the Latrunculin treated cells as compared to controls (Figure 1 – supplemental figure 2C) . The treated cell footprints increase much less than the controls, and this is likely due to a lack of active cell spreading in the Latrunculin treated cells. The increase in footprint area observed following latrunculin treatment is within the range of what would be expected for the 2D projection of an isotropically expanding sphere fitted to the Latrunculin volume data (salmon line).

      Author response image 4.

      Chemoattractant-induced swelling, but not motility, is independent of actin polymerization. (A) Human primary eutrophils were incubated with DMSO or Latrunculin B, activated with 20 nM fMLP, and then volume responses were measured using electronic sizing via a Coulter counter. Latrunculin treatment did not alter cell swelling, indicating that actin polymerization is dispensable for the chemoattractant-induced volume increase. (B) Similar results were obtained using the FxM assay, showing that Latrunculin-treated cells are capable of swelling after stimulation. (C) The Latrunculin-treated cells also increase their footprints, albeit less so than control cells, but this is within the range of what would be expected for this degree of chemoattractant-induced volume increase (modeled by a sphere expanding an equivalent volume).

      (2) The authors show inhibition of NHE1 blocked cell swelling using Coulter counter, a similar experiment should be done with PI3K inhibitions especially since they see PI3K inhibition impact chemoattractant-induced cell shape change.

      Good idea. PI3Ky inhibition led to a substantial reduction in the chemoattractant-driven swelling in suspension showing the critical role of PI3K in the swelling of human primary neutrophils (Figure 3 – supplemental figure 1B) .

      Author response image 5.

      Additional validation of swelling screen hits. (B) To validate the perturbations to cell swelling observed with FxM, primary human neutrophils were stimulated in suspension, and their volumes were measured using a Coulter counter. 20 nM fMLP was added at the 0 minute mark. Shaded regions represent the 95% confidence intervals.

      (3) It would be more convincing visually if the authors could also include the movie of cell spreading (footprint) and then mobility with PI3K inhibition.

      Included as suggested. We agree this is a more compelling way to present the data (Figure 4 – supplemental figure 1A-D,G)

      Author response image 6.

      Additional validation of motility phenotypes. (A-D) Single cell tracks of primary human neutrophils responding to acute chemoattractant stimulation. Both panels show tracks of cells 15 minutes prior (left) versus 15 minutes post (right) uncaging the chemoattractant. The scale bar is 50 microns. Color saturation indicates time with tracks progressing from gray to full color. (A) Control cells show a large increase in movement upon uncaging. (D) PI3Kγ leads to a large fraction of cells failing to initiate movement. (E) PI3Kγ inhibition showed near complete blockage of the chemoattractant-induced motility increase in primary human neutrophils. (G) For the PI3Kγ inhibited cells that start migrating, the migration-induced volume fluctuations are comparable to iNHE1 and control cells. The top panel shows the track of a representative migrating PI3Kγ inhibited cell and the bottom panel, its corresponding volume normalized to the pre-stimulation volume. The scale bar is 50 microns.

      (4) It is not clear how cell spreading and later volume increase are linked to overall mobility of neutrophils. Are authors suggesting that cell spreading is not required for cell mobility in neutrophils?

      We did not mean to imply that cell spreading is not required for neutrophil motility. We take advantage of the fact that we can inhibit cell swelling without inhibiting spreading to investigate the specific role of swelling on migration ( Figure 4) . Conversely, cell spreading on a substrate is not required for chemoattractant-induced cell swelling, as chemoattractant-induced swelling occurs in latrunculin-treated cells (Figure 1 – supplemental figure 2A-C) . However, these latrunculin-treated cells are not able to migrate, at least not in the context studied here (Figure 1 – supplemental figure 2 D-E) . Cell spreading and swelling are likely both critical contributors to neutrophil motility, but their relative importance is dependent on the migratory context. The single cell volume fluctuation analysis indicates that migration-associated spreading and shape changes have large impacts on cell volume ( Figure 1 F) . These fluctuations are asynchronous, obscuring their observation at the population level, but the single cell traces clearly demonstrate them and their correlation with movement.

      ( 5) Volume fluctuations associated with motility were impacted by NHE1 inhibition at the baselines, what about PI3K inhibitions? Does that impact the actual fluctuations?

      PI3K inhibition causes a significant fraction of cells to stop migrating (Figure 4 – supplemental figure 1D) , but among those that do move, they are still able to fluctuate in volume (Figure 4 – supplemental figure 1G) .

      Author response image 7.

      Additional validation of motility phenotypes. (G) For the PI3Kγ inhibited cells that start migrating, the migration-induced volume fluctuations are comparable to iNHE1 and control cells. The top panel shows the track of a representative migrating PI3Kγ inhibited cell and the bottom panel, its corresponding volume normalized to the pre-stimulation volume. The scale bar is 50 microns.

      In contrast, latrunculin abolishes the volume fluctuations that normally accompany migration (Figure 1 – supplemental figure 2F-G) . These data suggest that movement/spreading itself is the driver of the rapid volume fluctuations. In contrast, the sustained volume increase following chemoattractant stimulation is independent of shape change and still occurs in latrunculin-treated cells.

      Author response image 8.

      Chemoattractant-induced swelling, but not motility, is independent of actin polymerization. (F) Representative single cell volume traces show that Latrunculin-treated cells (black) lack short-term volume fluctuations but persistently maintain an elevated volume following chemoattractant stimulation. Control cells (blue) exhibit short-term volume fluctuations. (G) The lack of short-term volume fluctuations following latrunculin treatment is borne out across the population, with the coefficient of variation in the volume for single cells (post-swelling) being dramatically lower in Latrunculin-treated cells, suggesting that these short term volume fluctuations depend on actin-based motility.

      (6) It would really help if the authors compared similar analyses and drew conclusions from that, for example, it is unclear what the authors mean by they found no change in the angular persistence of WT and NHE1 inhibited cells which is in contrast to PI3K inhibition since they do not really have an analysis for angular persistence in PI3K inhibited cells. (S4A and S4B).

      Thanks for catching this oversight in these experiments that we previously performed but neglected to include in the initial submission. We now include plots for angular persistence, velocity, and footprint size for the PI3K-gamma-inhibited cells. The results show that PI3K-gamma inhibition interferes both with swelling (Figure 3 – supplemental figure 1B-D) and motility (Figure 4 – supplemental figure 1D-F) , which aligns with its role upstream of the other hits identified in our screen.

      Author response image 9.

      Additional validation of motility phenotypes. (A-D) Single cell tracks of primary human neutrophils responding to acute chemoattractant stimulation. Both panels show tracks of cells 15 minutes prior (left) versus 15 minutes post (right) uncaging the chemoattractant. The scale bar is 50 microns. Color saturation indicates time with tracks progressing from gray to full color. (A) Control cells show a large increase in movement upon uncaging, (B) NHE1 inhibited cells also initiate movement but to a lesser degree, (C) hypo-osmotic shock rescues the NHE1 motility defect. (D) PI3Kγ leads to a large fraction of cells failing to initiate movement. (E) PI3Kγ inhibition showed near complete blockage of the chemoattractant-induced motility increase in primary human neutrophils. (F) Control neutrophils (blue) show an increased angular alignment upon stimulation as their motility becomes directional. NHE1-inhibition (gold, iNHE1) has very little effect on this process, while PI3Kγ inhibition (green) leads to a reduction in this alignment at the population level. (G) For the PI3Kγ inhibited cells that start migrating, the migration-induced volume fluctuations are comparable to iNHE1 and control cells. The top panel shows the track of a representative migrating PI3Kγ inhibited cell and the bottom panel, its corresponding volume normalized to the pre-stimulation volume. The scale bar is 50 microns.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Weaknesses to be addressed: 

      (1) More detail is required to understand the effects of genetic and drug manipulations on heart rate as these are important experiments. At the very least, a discussion on the limitations of these manipulations is needed. 

      - For example, how does one separate the pulsatile versus nutritive effects of blood flow/heartrate reduction? 

      - The conclusion that arterial SMC differentiation is driven by pulsatile blood flow needs to be toned down. Indeed, this conclusion is mainly supported by in vitro cell co-cultures exposed to laminar versus pulsatile flow. In vivo, reducing Tnnt2a expression affects cardiac contractility and blood flow does not selectively affect pulsatility. To make this conclusion, the authors would need an experimental means to selectively dampen the pulsatility of blood flow.

      We understand this concern and we toned down the statements related to the pulsatile flow of our conclusion by using 'flow' instead of 'pulsatile flow' in all text except for the in vitro co-cultures part. We also added a paragraph to discuss the limited capability of qualitatively reduce blood flow in vivo, and acknowledge that the effects of nutrients and flow reduction could not be uncoupled in live zebrafish embryos. We proposed that in the future, in vitro 3D vascular culture models may be combined with microfluidics to precisely calibrate nutrient composition in culture media, flow velocity and pulse; these methods would help address these questions more thoroughly. See page 11-12 line 312-322.

      (2) Since mural cells are sensitive to transmural pressure, could the authors elaborate on the potential role of raised intravascular pressure in SMC differentiation? This would better parallel rodents and humans. 

      We thank you for this suggestion. We added a paragraph to discuss the potential role of raised intravascular pressure in VSMC differentiation in the discussion section (see page 11 line 296-311).

      (3) The authors use nifedipine to reduce blood flow. Nifedipine is a specific and potent inhibitor of voltage-dependent calcium channels (VDCC) which are expressed in SMCs. Prior studies (PMID: 35588738) showed that VDCC blockers increased rather than inhibited SMC differentiation. Nifedipine is also likely to act upon VSMC calcium handling in the circle of Willis, which may in turn affect cell maturation. Could the authors comment on this seeming discrepancy?

      It is possible that off-target or indirect effects of Nifedipine decrease smooth muscle cell proliferation, or that altered cardiac contractility fundamentally alters aspects of vascular development other than blood flow. 

      - Additionally, it would be helpful to report the quantitative heart rate reduction achieved with Nifedipine. This would clear up concerns that the heart rate reduction is too large for normal vascular development to occur, and thus decrease proliferation rate independent of changes in blood flow pulsatility. 

      We concur with these comments, which is why our experimentation with Nifedipine is reinforced by employing an alternative, non-pharmacological strategy to inhibit blood flow: the use of morpholino against tnnt2a gene. The results with either Nifedipine or tnnt2a support the lack of VSMCs maturation. In addition, we provided the quantitative heart rate reduction achieved with Nifedipine shown in new Figure S2A-S2C, suggesting that the drug is not completely halting the heart rate but decreasing it. Nevertheless, we report that Zebrafish embryos can survive and develop a normal blood vascular system without any heartbeat. Hence, we exclude that the effect on VSMCs maturation is linked non-specifical effects caused by the loss of heartbeat. Nevertheless, we now acknowledged in our discussion the limitation of nifedipine, as it may affect VSMC through VDCCs (page 12, line 323-334).

      We also added a paragraph in the discussion section to compare nifedipine, an L-type VDCC blocker, and ML218, a T-type VDCC selective inhibitor from the previous study (Ando et al., 2022). We noted that in this previous study, the increase in VSMC differentiation only occur on anterior metencephalic central arteries (AMCtAs) that are more than 40 mm away from the BCA; these AMCtAs are much smaller than CoW arteries and have different geometry hence possible different kinetics of VSMC maturation (Ando et al., 2022) as our manuscript discovery would suggest.

      (4) The authors should provide more information on how blood flow velocity and wall shear stress are calculated from the Circle of Willis vascular structure. It is presumed that these values are dependent upon the 3-D morphology of the vessel network, as labeled by intravenous dextran dye, but this is not clear. (a second reviewer similarly comments: I was unclear how flow velocity values were obtained in Fig. 3E. Are they based on computational simulation, or are they experimentally calculated following the dextran injection?) Small local differences in vessel diameter and shape will influence blood flow velocity, but these morphological changes are not clearly articulated. Further, it is unclear how flow input levels to the CaDI and basilar arteries are decided across time points. For instance, is it possible to measure the blood flow speed empirically with line-scanning or high-speed tracking of labeled blood cells or particles? This would provide validation of the modeling results. 

      The computational fluid dynamic simulation was performed according to previous study from our lab (Barak et al., 2021). Blood flow velocity and wall shear stress are dependent upon the 3D morphology of the vessel network labeled by intravascular dextran. Details on how the computational fluid dynamic simulation was performed are added in method section page 17 line 433-449.

      Moreover, to address this reviewer concern we have now provided new experimental measurement of blood flow using the red blood cell (RBC) velocity via axial line scanning microscopy in Tg(kdrl:gfp;gata1:DsRed)zn1/sd2 zebrafish embryos at 54 hpf, 3 dpf, and 4 dpf. By using the experimental RBC velocity, we re-simulated the computational fluid dynamic. The new findings align with our conclusion and are further elaborated upon in response to this reviewer comment listed as point 6. Details on how RBC velocity calculated is added in method section page 16 line 414-431.

      (5) Does the cardiac injection of dextran itself affect the diameter of the arteries, given the invasiveness of the procedure? This could be examined in fish with a transgenic endothelial label with and without dextran. 

      Here, we performed an experiment on wildtype zebrafish at 5 days post-fertilization (dpf) with and without Dextran injection, examining the effects of Dextran injection on vessel diameters. As shown in the representative image below, the XZ panel clearly illustrates a Dextran-filled PCS vessel with no alteration in vessel size. Dextran microangiography, a technique employed to obtain vessel geometry with fluorescent microsphere, has been well established in zebrafish (Kamei et al., 2010). Our findings, demonstrating that Dextran does not affect vessel size, are consistent with previous studies utilizing Dextran microangiography.

      Author response image 1.

      (6) The data from the microangiography experiment in Figure 3 does not fully support the stated results. The authors report that the CaDI had the highest blood flow speed starting from 54 hpf, but it does not appear to be higher than the other arteries at this time point. Additionally, there is not sufficient evidence that wall shear stress coincides with smooth muscle cell differentiation in the CaDI. Wall shear stress appears to be similar between 54 hpf and 3 dpf in the CaDI, only increasing between 3 dpf and 4 dpf, while differentiation is shown to begin at 3 dpf. The authors need to address this and/or soften conclusions. 

      First, In response to this specific reviewer concern, we measured red blood cell (RBC) velocity by used axial line scanning microscopy to analyze Tg(kdrl:gfp;gata1:DsRed)zn1/sd2 zebrafish embryos (the detailed method was added in Method section in the manuscript). We replaced the computational simulated blood flow velocity by RBC velocity in new Figure 3E-3G, and re-run the computational simulated wall shear stress (WSS) using the RBC velocity in new Figure 3I-3K. We compared RBC velocity and WSS among different vessels at each time point. We confirmed that CaDI has the highest RBC velocity starting from 54 hpf to 4 dpf (new Figure 3A-3C, and 3E-3G) and found an overall increase in average WSS from 54 hpf to 4 dpf (new Figure 3A-3C, and 3H). Further, WSS in CaDI was significantly higher than BCA and PCS at 54 hpf, 3 dpf, and 4 dpf (new Figure 3A-3C, 3I-3K). Altogether, the CFD simulation suggests that CoW arteries experience different hemodynamic WSS that is associated with spatiotemporal pattern of VSMC differentiation on CoW arteries.”.  (Page 6, line 153-162)

      Second, to identify the correlation of WSS and VSMC differentiation in CaDI, we performed Pearson correlation analysis. In the image provided here, we plotted a linear regression with normalized # of acta2+ cells in CaDI and WSS with developmental stages (54 hpf, 3 and 4 dpf), and performed Pearson correlation coefficient analysis by using GraphPad Prism 10.0.3. The correlation coefficient r = 0.595, suggesting that the two variables (acta2+ cells and WSS) tend to increase together with developmental stages (54 hpf, 3 and 4 dpf).

      Author response image 2.

      Third, we softened our conclusion as the RBC velocity across CoW arteries was differentially distributed while VSMC differentiation occurred in these vessels.

      (7) It is unclear if acta2 expression is conferring vascular tone, as would be expected if the cells are behaving as mature VSMCs. Does arterial diameter decrease with an increase in acta2 expression? Are acta2-positive mural cells associated with more dynamic changes in arteriole diameter under basal or stimulated conditions? 

      Thanks for this interesting question. VSMC maturation and its vasoactivity could be further investigated in the future. Our study focused on early stage of VSMC differentiation, in which pdgfrb+ progenitors started to express VSMC marker acta2. We discussed the onset of transgelin expression and loss of abcc9 expression as markers of VSMC maturation. In addition, a previous study found that VSMC covered vessels in zebrafish brain dilate as early as 4 dpf and constrict at 6 dpf (Bahrami & Childs, 2020). Future study may focus on the association between expression of different VSMC markers and VSMC functional maturation. (page 10, line 272-279)

      (8) The authors argue that CoW vessels transition from venous to arterial identity (Fig. 1). However, kdrl is not an ideal arterial marker for this experiment as it is expressed in both arteries and veins. While it is true that many arterial beds have stronger kdrl expression than the veins, its expression in both arteries and veins changes with developmental stage, and its expression level may vary depending on the type of vessel. Therefore, showing that kdrl increases from 32 hpf - 4 dpf in CoW vessels is not convincing because its expression may increase in both venous or arterial vasculature as the vessels mature. In addition, flt4 expression is not exclusively venous; for example, it has noticeable expression in the dorsal aorta at 24-32 hpf stages. It would be helpful to confirm this transition by analyzing additional arterial and venous markers. 

      We acknowledge this and we added a paragraph to discuss the limitation. We combined loss of flt4 and increase in kdrl to establish the temporal sequence of circle of Willis morphogenesis, arterial specification, and VSMC differentiation. We acknowledge that additional arterial and venous markers need to be analyzed for a more thorough characterization of arterial specification in vertebrate brain vascular development. See page 12 line 335-341.

      (9) The authors show that acta2+ VSMCs are absent in tnnt2a MO embryos, concluding that blood flow is required for their differentiation from pericytes. However, there is no data showing that pericytes are still present in tnnt2a MO embryos. Although this has been previously shown by Ando et al 2016, it would be beneficial to confirm in the current study as this is a critical piece of evidence needed for this conclusion. 

      To determine if blood flow is dispensable for pdgfrb+ progenitor recruitment, we performed tnnt2a MO (0.35 ng/embryo) injection in Tg(pdgrb:egfp, kdrl:ras-mcherry) ncv22/s896. Loss of blood flow did not affect pdgfrb+ progenitor emergence around the CoW (new Figure S2G-S2H) at 3 days post fertilization (dpf). This is consistent with previous observation in Ando et al 2016 Figure S2C (Ando et al., 2016).

      (10) The authors show that klf2a MO injected embryos have a reduced number of VSMCs at 3 dpf but a normal number at 4 dpf (Fig. 6), concluding that klf2a is only important to initiate CaDI muscularization. If this is true, it would raise important questions about how VSMCs differentiate at a later stage in the absence of klf2a. For instance, is blood flow not required to differentiate at a later stage, or is there another factor that compensates in the absence of klf2a? The alternative explanation/ caveat is that klf2a MO loses efficacy with development, leading to the recovery of VSMCs at this stage. Therefore, it would be important to confirm this result using a genetic klf2a mutant. 

      Thank you for pointing this out.  We note that based on the klf2a reporter line, klf2a activity in CoW arterial endothelial cells is highly correlated with the number of acta2+ VSMCs in CaDI, BCA and PCS at 3 dpf (r = 0.974, new Figure S5J). Interestingly however, klf2a activity remained stable from 3 dpf to 4 dpf, well beyond initiation of VSMC differentiation. Thus, we speculate sustained klf2a expression may support further maturation of VSMCs, as acta2+ VSMCs showed distinct morphology at 4 dpf compared with 3 dpf. (Page 10, line 268-272). As for the observation that klf2a morphants have normal number of VSMCs at 4 dpf, we think that in addition to the temporary effect of morpholino, a proximal explanation is compensation by paralogous klf2b in zebrafish. We acknowledge that further characterization of CoW VSMC development in klf2a and klf2b double genetic mutants (Rasouli et al., 2018; Steed et al., 2016) may help determine whether klf2b compensates klf2a in CoW VSMC differentiation beyond 4 dpf. See page 10-11 line 292-295.

      (11) A large part of the discussion focuses on Notch and Wnt signaling, as downstream Klf2 effectors. While these are reasonable hypotheses to propose, there is no data on the involvement of these pathways in the current study. It seems excessive to speculate on detailed mechanisms of how Klf2 activates Notch and Wnt signaling in the absence of data showing that these pathways are affected in CoW vessels. Therefore, the discussion could be shortened here unless additional data can be obtained to demonstrate the involvement of these pathways in VSMCs in CoW.

      We concur and have condensed the discussion on Notch and Wnt signaling as downstream klf2 effectors.

      Minor comments: 

      (1) Line 138 "CaDI is the only vessels in the CoW receiving pulsatile arterial blood low ... ". Adding a reference to support this statement would be useful. 

      We agree and revised this sentence into ‘CaDI receive proximal arterial feed through lateral dorsal aorta from cardiac outflow tract (Isogai et al., 2001)’. It was also based on our general observation of zebrafish vascular anatomy and blood flow under a confocal microscope.

      (2) The image insets in Figs. 1A, 2A, 4E-L, 5A, 6A are quite small. Please make them larger to help the reader interpret the findings. 

      We agree. We maximized the image size to help the reader interpret the finding, and to visualize confocal images and schematics side-by-side.

      (3) The schematics in Figs. 1-2, and 4-6 are helpful, but the different cell types are difficult to see because they are small and their colors/shapes are not very distinct. 

      We agree. We increased the size and color contrast to provide better visualization of the schematics in new schematic Figures. 1-2 and 4-6.

      (4) It is stated that there are no diameter differences between different arteries, but statistics are not reported. 

      The statistics in Figure 3D were performed by ordinary two-way ANOVA followed by Tukey’s multiple comparisons test, with a single pooled variance. Here we added pairwise comparisons among vessels in the CoW. Hence when non indicated the difference are non-significant.

      (5) Figure 3F would be better visualized on a log scale, as it is difficult to see the differences between each post-fertilization timepoint. 

      We agree. In the new Figure 3H, the average wall shear stress (WSS) in CoW arteries is presented on log scale in y axis to see the differences between each post-fertilization timepoint.

      (6) Please provide more background and validation on the pericyte cell line, and their use for the questions in this study. 

      Thank you for the question, TgBAC(pdgfrb:egfp)ncv22 was generated and described by Ando et al 2016 to clarify mural cell coverage of vascular endothelium in zebrafish (Ando et al., 2016). We added a describe in the method section to provide background and validation on this pericyte line (see page 13 line 368-372).

      (7) Flow velocity and WSS changes are shown in each vessel in Figs. 3E,G. However, the comparison should be made between different types of vessels to see if there is a statistical difference and PCS, for example, which would explain differences in VSMC coverage. 

      We agreed. We compared the difference among arteries in the CoW at each developmental timepoint and performed ordinary one-way ANOVA with Tukey’s multiple comparisons test. Figure. 3E is replaced by new Figure. 3E-G and Figure. 3G is replaced by new Figure. 3I-K.

      (8) Similarly, between CaDI, the number of klf2a cells in Fig. 5B should be compared between different vessels, not between different stages of the same vessel. 

      We agree. In new Figure 5B-E, the number of klf2a+ cells per 100 μm vessel length are compared among different vessels at each developmental stage and analyzed by ordinary one-way ANOVA with Tukey’s multiple comparisons test.

      (9) When quantifying klf2+ cells in Fig. 5, it would be helpful to quantify klf2 expression level between cells in different vessels. This could be done by quantifying GFP expression in existing images. The difference in expression level may explain the variation between CaDI and PCS more accurately than just the difference in cell number. 

      The GFP expression reflect the stability of GFP protein expression and labels discrete nuclei with active klf2a expression. Hence the quantification of GFP level might not give an accurate readout of klf2a expression per se but rather of its activity. For this reason we don’t think that this experiment will add accurate measurement of klf2a expression.

      (10) Do data points in Figure 4D correspond to different cells in the same chamber experiment? If so, they cannot be treated as independent replicates. Each data point should correspond to an independent replicate experiment. 

      We agree. Now in the figure legend, we report the number of cells analyzed.

      (11) Graph placement is confusing in Figs. 4I, M. An adjacent Fig. 4G shows Nifedipine treated embryos, while the graph next to (Fig. 4I) shows acta+ cell number from tnnt2a 4 dpf experiment. Similarly, the bottom Fig. 4K tnn2a 4 dpf MO experiment has an adjacent graph Fig. 4M, which shows nifedipine treatment quantification, which makes it very confusing. 

      We agreed. We rearranged Figure 4E (representative images of control embryos at 3 dpf and 4 dpf), Figure 4F (tnnt2a MO embryos at 3 dpf and 4 dpf), Figure 4G (nifedipine treated embryos at 3 dpf and 4 dpf).

      Reference:

      Ando, K., Fukuhara, S., Izumi, N., Nakajima, H., Fukui, H., Kelsh, R. N., & Mochizuki, N. (2016). Clarification of mural cell coverage of vascular endothelial cells by live imaging of zebrafish. Development, 143(8), 1328-1339. https://doi.org/10.1242/dev.132654

      Ando, K., Tong, L., Peng, D., Vazquez-Liebanas, E., Chiyoda, H., He, L., Liu, J., Kawakami, K., Mochizuki, N., Fukuhara, S., Grutzendler, J., & Betsholtz, C. (2022). KCNJ8/ABCC9-containing K-ATP channel modulates brain vascular smooth muscle development and neurovascular coupling. Dev Cell, 57(11), 1383-1399 e1387. https://doi.org/10.1016/j.devcel.2022.04.019

      Bahrami, N., & Childs, S. J. (2020). Development of vascular regulation in the zebrafish embryo. Development, 147(10). https://doi.org/10.1242/dev.183061

      Barak, T., Ristori, E., Ercan-Sencicek, A. G., Miyagishima, D. F., Nelson-Williams, C., Dong, W., Jin, S. C., Prendergast, A., Armero, W., Henegariu, O., Erson-Omay, E. Z., Harmanci, A. S., Guy, M., Gultekin, B., Kilic, D., Rai, D. K., Goc, N., Aguilera, S. M., Gulez, B., . . . Gunel, M. (2021). PPIL4 is essential for brain angiogenesis and implicated in intracranial aneurysms in humans. Nat Med, 27(12), 2165-2175. https://doi.org/10.1038/s41591-021-01572-7

      Isogai, S., Horiguchi, M., & Weinstein, B. M. (2001). The vascular anatomy of the developing zebrafish: an atlas of embryonic and early larval development. Dev Biol, 230(2), 278-301. https://doi.org/10.1006/dbio.2000.9995

      Kamei, M., Isogai, S., Pan, W., & Weinstein, B. M. (2010). Imaging blood vessels in the zebrafish. In Methods in cell biology (Vol. 100, pp. 27-54). Elsevier.

      Rasouli, S. J., El-Brolosy, M., Tsedeke, A. T., Bensimon-Brito, A., Ghanbari, P., Maischein, H. M., Kuenne, C., & Stainier, D. Y. (2018). The flow responsive transcription factor Klf2 is required for myocardial wall integrity by modulating Fgf signaling. Elife, 7. https://doi.org/10.7554/eLife.38889

      Steed, E., Faggianelli, N., Roth, S., Ramspacher, C., Concordet, J. P., & Vermot, J. (2016). klf2a couples mechanotransduction and zebrafish valve morphogenesis through fibronectin synthesis. Nat Commun, 7, 11646. https://doi.org/10.1038/ncomms11646

    1. Author response:

      The following is the authors’ response to the original reviews.

      New Experiments

      (1) Activation-dependent dynamics of PKA with the RIα regulatory subunit, adding to the answer to Reviewers 1 and 2. To determine the dynamics of all PKA isoforms, we have added experiments that used PKA-RIα as the regulatory subunit. We found differential translocation between PKA-C (co-expressed with PKA-RIα) and PKA-RIα (Figure 1–figure supplement 3), similar to the results when PKA-RIIα or PKA-RIβ was used.

      (2) PKA-C dynamics elicited by a low concentration of norepinephrine, addressing Reviewer 3’s comment. We have found that PKA-C (co-expressed with RIIα) exhibited similar translocation into dendritic spines in the presence of a 5x lowered concentration (2 μM) of norepinephrine, suggesting that the translocation occurs over a wide range of stimulus strengths (Figure 1-figure supplement 2).

      Reviewer #1 (Public Review):

      Summary:

      This is a short self-contained study with a straightforward and interesting message. The paper focuses on settling whether PKA activation requires dissociation of the catalytic and regulatory subunits. This debate has been ongoing for ~ 30 years, with renewed interest in the question following a publication in Science, 2017 (Smith et al.). Here, Xiong et al demonstrate that fusing the R and C subunits together (in the same way as Smith et al) prevents the proper function of PKA in neurons. This provides further support for the dissociative activation model - it is imperative that researchers have clarity on this topic since it is so fundamental to building accurate models of localised cAMP signalling in all cell types. Furthermore, their experiments highlight that C subunit dissociation into spines is essential for structural LTP, which is an interesting finding in itself. They also show that preventing C subunit dissociation reduces basal AMPA receptor currents to the same extent as knocking down the C subunit. Overall, the paper will interest both cAMP researchers and scientists interested in fundamental mechanisms of synaptic regulation.

      Strengths:

      The experiments are technically challenging and well executed. Good use of control conditions e.g untransfected controls in Figure 4.

      We thank the reviewer for their accurate summarization of the position of the study in the field and for the positive evaluation of our study.

      Weaknesses:

      The novelty is lessened given the same team has shown dissociation of the C subunit into dendritic spines from RIIbeta subunits localised to dendritic shafts before (Tillo et al., 2017). Nevertheless, the experiments with RII-C fusion proteins are novel and an important addition.

      We thank the reviewer for noticing our earlier work. The first part of the current work is indeed an extension of previous work, as we have articulated in the manuscript. However, this extension is important because recent studies suggested that the majority of PKA-RIIβ are axonal localized. The primary PKA subtypes in the soma and dendrite are likely PKA-RIβ or PKA-RIIα. Although it is conceivable that the results from PKA-RIIβ can be extended to the other subunits, given the current debate in the field regarding PKA dissociation (or not), it remains important to conclusively demonstrate that these other regulatory subunit types also support PKA dissociation within intact cells in response to a physiological stimulant. To complete the survey for all PKA-R isoforms, we have now added data for PKA-RIα (New Experiment #1), as they are also expressed in the brain (e.g., https://www.ncbi.nlm.nih.gov/gene/5573). Additionally, as the reviewer points out, our second part is a novel addition to the literature.

      Reviewer #2 (Public Review):

      Summary:

      PKA is a major signaling protein that has been long studied and is vital for synaptic plasticity. Here, the authors examine the mechanism of PKA activity and specifically focus on addressing the question of PKA dissociation as a major mode of its activation in dendritic spines. This would potentially allow us to determine the precise mechanisms of PKA activation and address how it maintains spatial and temporal signaling specificity.

      Strengths:

      The results convincingly show that PKA activity is governed by the subcellular localization in dendrites and spines and is mediated via subunit dissociation. The authors make use of organotypic hippocampal slice cultures, where they use pharmacology, glutamate uncaging, and electrophysiological recordings.

      Overall, the experiments and data presented are well executed. The experiments all show that at least in the case of synaptic activity, the distribution of PKA-C to dendritic spines is necessary and sufficient for PKA-mediated functional and structural plasticity.

      The authors were able to persuasively support their claim that PKA subunit dissociation is necessary for its function and localization in dendritic spines. This conclusion is important to better understand the mechanisms of PKA activity and its role in synaptic plasticity.

      We thank the reviewer for their positive evaluation of our study.

      Weaknesses:

      While the experiments are indeed convincing and well executed, the data presented is similar to previously published work from the Zhong lab (Tillo et al., 2017, Zhong et al 2009). This reduces the novelty of the findings in terms of re-distribution of PKA subunits, which was already established. A few alternative approaches for addressing this question: targeting localization of endogenous PKA, addressing its synaptic distribution, or even impairing within intact neuronal circuits, would highly strengthen their findings. This would allow us to further substantiate the synaptic localization and re-distribution mechanism of PKA as a critical regulator of synaptic structure, function, and plasticity.

      We thank the reviewer for noticing our earlier work. The first part of the current work is indeed an extension of previous work, as we have articulated in the manuscript. However, this extension is important because recent studies suggested that the majority of PKA-RIIβ are axonal localized. The primary PKA subtypes in the soma and dendrite are likely PKA-RIβ or PKA-RIIα. Although it is conceivable that the results from PKA-RIIβ can be extended to the other subunits, given the current debate in the field regarding PKA dissociation (or not), it remains important to conclusively demonstrate that these other regulatory subunit types also support PKA dissociation within intact cells in response to a physiological stimulant. To complete the survey for all PKA-R isoforms, we have now added data for PKA-RIα (New Experiment #1), as they are also expressed in the brain (e.g., https://www.ncbi.nlm.nih.gov/gene/5573). Additionally, as Reviewer 1 points out, our second part is a novel addition to the literature.

      We also thank the reviewer for suggesting the experiments to examine PKA’s synaptic localization and dynamics as a key mechanism underlying synaptic structure and function. We agree that this is a very interesting topic. At the same time, we feel that this mechanistic direction is open ended at this time and beyond what we try to conclude within this manuscript: prevention of PKA dissociation in neurons affects synaptic function. Therefore, we will save the suggested direction for future studies. We hope the reviewer understand.

      Reviewer #3 (Public Review):

      Summary:

      Xiong et al. investigated the debated mechanism of PKA activation using hippocampal CA1 neurons under pharmacological and synaptic stimulations. Examining the two PKA major isoforms in these neurons, they found that a portion of PKA-C dissociates from PKA-R and translocates into dendritic spines following norepinephrine bath application. Additionally, their use of a non-dissociable form of PKC demonstrates its essential role in structural long-term potentiation (LTP) induced by two-photon glutamate uncaging, as well as in maintaining normal synaptic transmission, as verified by electrophysiology. This study presents a valuable finding on the activation-dependent re-distribution of PKA catalytic subunits in CA1 neurons, a process vital for synaptic functionality. The robust evidence provided by the authors makes this work particularly relevant for biologists seeking to understand PKA activation and its downstream effects essential for synaptic plasticity.

      Strengths:

      The study is methodologically robust, particularly in the application of two-photon imaging and electrophysiology. The experiments are well-designed with effective controls and a comprehensive analysis. The credibility of the data is further enhanced by the research team's previous works in related experiments. The conclusions of this paper are mostly well supported by data. The research fills a significant gap in our understanding of PKA activation mechanisms in synaptic functioning, presenting valuable insights backed by empirical evidence.

      We thank the reviewer for their positive evaluation of our study.

      Weaknesses:

      The physiological relevance of the findings regarding PKA dissociation is somewhat weakened by the use of norepinephrine (10 µM) in bath applications, which might not accurately reflect physiological conditions. Furthermore, the study does not address the impact of glutamate uncaging, a well-characterized physiologically relevant stimulation, on the redistribution of PKA catalytic subunits, leaving some questions unanswered.

      We agreed with the Reviewer that testing under physiological conditions is critical especially given the current debate in the literature. That is why we tested PKA dynamics induced by the physiological stimulant, norepinephrine. It has been suggested that, near the release site, local norepinephrine concentrations can be as high as tens of micromolar (Courtney and Ford, 2014). Based on this study, we have chosen a mid-range concentration (10 μM). At the same time, in light of the Reviewer’s suggestion, we have now also tested PKA-RIIα dissociation at a 5x lower concentration of norepinephrine (2 μM; New Experiment #2). The activation and translocation of PKA-C is also readily detectible under this condition to a degree comparable to when 10 μM norepinephrine was used.

      Regarding the suggested glutamate uncaging experiment, it is extremely challenging because of finite signal-to-noise ratios in our experiments. From our past studies, we know that activated PKA-C can diffuse three dimensionally, with a fraction as membrane-associated proteins and the other as cytosolic proteins. Although we have evidence that its membrane affinity allows it to become enriched in dendritic spines, it is not known (and is unlikely) that activated PKA-C is selectively targeted to a particular spine. Glutamate uncaging of a single spine presumably would locally activate a small number of PKA-C. It will be very difficult to trace the 3D diffusion of these small number of molecules in the presence of surrounding resting-state PKA-C molecules. Finally, we hope the reviewer agrees that, regardless of the result of the glutamate uncaging experiment, the above new experiment (New Experiment #2) already indicate that certain physiologically relevant stimuli can drive PKA-C dissociation from PKA-R and translocation to spines, supporting our conclusion.

      Reviewer #2 (Recommendations For The Authors):

      It was a pleasure reading your paper, and the results are well-executed and well-presented.

      My main and only recommendations are two ways to further expand the scope of the findings.

      First, I believe addressing the endogenous localization of PKA-C subunit before and after PKA activation would be highly important to validate these claims. Overexpression of tagged proteins often shows vastly different subcellular distribution than their endogenous counterparts. Recent technological advances with CRISPR/Cas9 gene editing (Suzuki et al Nature 2016 and Gao et al Neuron 2019 for example) which the Zhong lab recently contributed to (Zhong et al 2021 eLife) allow us to tag endogenous proteins and image them in fixed or live neurons. Any experiments targeting endogenous PKA subunits that support dissociation and synaptic localization following activation would be very informative and greatly increase the novelty and impact of their findings.

      We agreed that addressing the endogenous PKA dynamics is important. However, despite recent progress, endogenous labeling using CRISPR-based methods remains challenging and requires extensive optimization. This is especially true for signaling proteins whose endogenous abundance is often low. We have tried to label PKA catalytic subunits and regulatory subunits using both the homologous recombination-based method SLENDR and our own non-homologous end joining-based method CRISPIE. We did not succeed, in part because it is very difficult to see any signal under wide-field fluorescence conditions, which makes it difficult to screen different constructs for optimizing parameters. It is also possible that, at the endogenous abundance, the label is just not bright enough to be seen. Nevertheless, for both PKA type Iβ and type IIα that we studied in this manuscript, we have correlated the measured parameters (specifically, Spine Enrichment Index or SEI) with the overexpression level (Figure 1-figure supplement 1). We found that they are not strongly correlated with the expression level under our conditions. By extrapolating to non-overexpression conditions, our conclusion remains valid.

      To overcome the inability to label endogenous PKA subunits using CRISPR-based methods, we have also attempted a conditional knock-in method call ENABLED that we previously developed to label PKA-Cα. In preliminary results, we found that endogenously label PKA were very dim. However, in a subset of cells that are bright enough to be quantified, the PKA catalytic subunit indeed translocated to dendritic spines upon stimulation (see Additional Fig. 1 in the next page), corroborating our results using overexpression. These results, however, are not ready to be published because characterization of the mouse line takes time and, at this moment, the signal-to-noise ratio remains low. We hope that the reviewer can understand.

      Author response image 1.

      Endogeneous PKA-Cα translocate to dendritic spines upon activation.

      Second, experiments which would advance and validate these findings in vivo would be highly valuable. This could be achieved in a number of ways - one would be overexpression of tagged PKA versions and examining sub-cellular distribution before and after physiological activation in vivo. Another possibility is in vivo perturbation - one would speculate that disruption or tethering of PKA subunits to the dendrite would lead to cell-specific functional and structural impairments. This could be achieved in a similar manner to the in vitro experiments, with a PKA KO and replacement strategy of the tethered C-R plasmid, followed by structural or functional examination of neurons.

      I would like to state that these experiments are not essential in my opinion, but any improvements in one of these directions would greatly improve and extend the impact and findings of this paper.

      We thank the reviewer for the suggestion and the understanding. The suggested in vivo experiments are fascinating. However, in vivo imaging of dendritic spine morphology is already in itself challenging. The difficulty greatly increases when trying to detect partial, likely transient translocation of a signaling protein. It is also very difficult to knock down endogenous PKA while simultaneously expressing the R-C construct in a large number of cells to achieve detectable circuit or behavioral effect (and hope that compensation does not happen over weeks). We hope the reviewer agrees that these experiments would be their own project and go beyond the time and scope of the current study.

      Reviewer #3 (Recommendations For The Authors):

      Please elaborate on the methods used to visualize PKA-RIIα and PKA-RIβ subunits.

      As suggested, we have now included additional details for visualizing PKA-Rs in the text. Specifically, we write (pg. 5): “…, as visualized using expressed PKA-R-mEGFP in separate experiments (Figs. 1A-1C).”.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      This manuscript by Vuong and colleagues reports a study that pooled data from 3 separate longitudinal studies that collectively spanned an observation period of over 15 years. The authors examined for correlation between viraemia measured at various days from illness onset with thrombocytopaenia and severe dengue, according to the WHO 2009 classification scheme. The motivation for this study is both to support the use of viraemia measurement as a prognostic indicator of dengue and also when an antiviral drug becomes licensed for use, to guide the selection of patients for antiviral therapy. They found that the four DENVs show differences in peak and duration of viraemia and that viraemia levels before day 5 but not those after from illness onset correlated with platelet count and plasma leakage at day 7 onwards. They concluded that the viraemia kinetics call for early measurement of viraemia levels in the early febrile phase of illness.

      Strengths:

      This is a unique study due to the large sample size and longitudinal viraemia measurements in the study subjects. The data addresses a gap in information in the literature, where although it has been widely indicated that viraemia levels are useful when collected early in the course of illness, this is the first time anyone has systematically examined this notion.

      Weaknesses:

      The study only analysed data from dengue patients in Vietnam. Moreover, the majority of these patients had DENV-1 infection; few had DENV-4 infection. The data could thus be skewed by the imbalance in the prevalence of the different types of DENV during the period of observation. The use of patient-reported time of symptom onset as a reference point for viraemia measurement is pragmatic although there is subjectivity and thus noise in the data.

      We acknowledge and appreciate your comments regarding the limitations of our study, including the pooled data from Vietnam and the use of symptom onset as a reference point for viremia kinetics. These points have been incorporated into the “Limitations” section.

      Reviewer #2 (Public Review):

      Summary:

      This manuscript highlights very important findings in the field, especially in designing clinical trials for the evaluation of antivirals.

      Strengths:

      The study shows significant differences between the kinetics of viral loads between serotypes, which is very interesting and should be taken into account when designing trials for antivirals.

      Weaknesses:

      The kinetics of the viral loads based on disease severity throughout the illness are not described, and it would be important if this could be analyzed.

      In response to your suggestion, we have expanded our analysis to investigate the relationship between the rate of viremia decline and clinical outcomes. Our findings demonstrate that a faster rate of viremia decline is associated with a reduced risk of severe clinical outcomes. We have incorporated this new analysis into the revised manuscript, providing further details in the “Statistical Analysis” section (page 7) and presenting the results on pages 15 and in Figure 6.

      Reviewer #1 (Recommendations For The Authors):

      Several areas require additional attention. I have limited my comments on the findings as I am not a mathematician and cannot knowledgeably comment on the statistical modelling methods.

      Comment #1: Lines 83-84. Although viraemia level shows declining trends from illness onset and thus lessens its prognostic value, it remains unknown if a more rapid rate of decline in viraemia is associated with a reduced risk of severe dengue. This is the fundamental premise of antiviral drug development for the treatment of dengue. The authors are uniquely poised to show if this logic that underpins antiviral development is likely correct and perhaps even estimate the extent to which a decline in viraemia needs to occur for a measurable reduction in the risk of severe dengue. Could the authors consider such an analysis?

      We appreciate your valuable suggestion. In response, we have expanded our analysis to investigate the relationship between the rate of viremia decline and clinical outcomes Utilizing a model of viremia kinetics with the assumption of a linear log-10 viremia decrease over time, we calculated the rate of decline for each patient. Our findings demonstrate that a faster rate of viremia decline is associated with a significantly reduced risk of severe clinical outcomes. We have incorporated this new analysis into the revised manuscript, providing further details in the “Statistical Analysis” section (page 7) and presenting the results on pages 15 and in Figure 6.

      Comment #2: Lines 101-102. Studies A and B were conducted in parallel, and several patients enrolled in study A from primary healthcare clinics were eventually also enrolled in study B upon hospitalization. It would be helpful to know how many patients from study A were included in study B. It would also be useful for the authors to indicate if such inclusion would constitute double-counting at any point in their analyses.

      To address potential confusion regarding patient overlap between studies A and B, we have provided further clarification in the revised manuscript’s Legend of Figure 1. Among confirmed dengue patients, 31 individuals enrolled in study A were later included in study B upon hospitalization. Of these, 9 had viremia measurements available in both studies and were consequently analysed in study A only. The remaining 22 lacked viremia data in study A but had measurements in study B, leading to their inclusion in study B in the analysis. We have taken meticulous care to ensure no patient data is double-counted.

      Comment #3: Lines 126-127. The definition of probable primary and secondary dengue from IgG measurements needs more detail. How was the anti-DENV IgG ELISA data from paired sera interpreted?

      To ensure clarity, we have moved the definitions of probable primary and secondary infections from the supplementary file (Appendix 2) to the main text of the revised manuscript (Methods section – Plasma viremia measurement, dengue diagnostics, and clinical endpoints – page 6): “A probable primary infection was defined by two negative/equivocal IgG results on separate samples taken at least two days apart within the first ten days of symptom onset, with at least one sample during the convalescent phase (days 6-10). A probable secondary infection was defined by at least one positive IgG result during the first ten days. Cases without time-appropriate IgG results were classified as indeterminate.”

      Comment #4: Lines 230-232 and Figure 4. The findings reported in Figure 4 are curious. Why is the platelet count highest (significantly?) for DENV-1 compared to other DENV-type infections at low viraemia levels on LM days 1-3? Does that also mean that DENV-3 and -4 infections have a greater impact on platelet counts at days 7-10 than DENV-1 and -2?

      In our analyses, we allowed the relation between viremia and platelet count to differ by serotype. Figure 4 shows the highest platelet counts for DENV-1 compared to other serotypes, especially at low viremia levels. Apparently, while DENV-1 on average has higher viremia (Figure 3), the same viremia level in DENV-1 compared to other serotypes is associated with a less severe disease course and higher platelet count. This does not necessarily imply that platelet count overall, uncorrected for viremia level, differs by genotype. Indeed, our unpublished analysis (shown below) indicates a modest influence of serotype on platelet count.

      Author response image 1.

      Comment #5: Figure 5. In a recent paper (Vuong et al, Clin Infect Dis 2021), the authors show elegantly that the viraemia levels on admission correlated with severe dengue. However, these correlations were different for each of the four DENV types and whether the infection was primary or secondary. Why wasn't the analysis in Figure 5 further stratified by their probable primary or secondary dengue status?

      We appreciate your feedback and have stratified Figure 5 by serotype and immune status as suggested. Please note that due to the limited number of severe dengue in primary infections (only 1 case in DENV-1) and plasma leakage in primary DENV-4 (see Appendix 4-table 1), the estimated probability of having these outcomes is nearly zero across all viremia levels within these subgroups.

      Comment #6: Line 279. The description in this line is at odds with the data in Figure 3A, which shows that DENV-2 could be detected over a longer period than DENV-1 as the one-step RT-qPCR assay has a lower detection limit than DENV-1.

      In response to your feedback, we have revised the description to clarify that DENV-1 exhibits higher viremia levels compared to DENV-2 and DENV-3 in the revised manuscript (page 18).

      Reviewer #2 (Recommendations For The Authors):

      Introduction

      Comment #1: Line 56: the authors state that viraemia is associated with dengue disease severity and cite their previous results. They then summarize the results of this study and others. The highlights of this paper should be described in more detail. It is important that the authors state the conclusions of their own paper, including that the association was not very strong and that the viral loads were lowest with DENV2, but DENV2 was associated with more severe disease.

      Thank you for your comment. To improve the introduction’s flow, we have removed that sentence in line 56 of the manuscript and have added the weak association in the next paragraph (pages 3-4).

      Comment #2: It would be important to cite smaller studies that show a delay in clearance of the virus being associated with more severe disease outcomes.

      Thanks for your suggestion. We have added information to the introduction (page 4), highlighting a study which found a slower rate of viral clearance to be associated with more severe outcomes (Wang et al., 2008). However, other studies have shown no association (Vaughn et al., 2000; Fox et al., 2011). This lack of conclusive evidence underscores the need for further research.

      Methods

      Comment #3: The authors highlight the possible discrepancies in comparing viral kinetics of two RT-PCR methods. Although it is not ideal to combine such results, the authors have analyzed them separately, providing valuable data.

      We appreciate your comment.

      Comment #4: Which tests were used to define the immune status as primary and secondary? What were the definitions?

      We have moved the definitions of probable primary and secondary infections from the supplementary file (Appendix 2) to the main text of the revised manuscript (Methods section – Plasma viremia measurement, dengue diagnostics, and clinical endpoints – page 6): “A probable primary infection was defined by two negative/equivocal IgG results on separate samples taken at least two days apart within the first ten days of symptom onset, with at least one sample during the convalescent phase (days 6-10). A probable secondary infection was defined by at least one positive IgG result during the first ten days. Cases without time-appropriate IgG results were classified as indeterminate.”

      Results

      Comment #5: It is interesting that DENV2 showed the slowest decline, but yet associated with overall lower viral loads during early illness and more severe disease outcomes. Could delayed clearance of the virus be associated with disease severity?

      We have expanded our analysis to investigate the relationship between the rate of viremia decline and clinical outcomes Utilizing a model of viremia kinetics with the assumption of a linear log-10 viremia decrease over time, we calculated the rate of decline for each patient. Our findings demonstrate that a faster rate of viremia decline is associated with a significantly reduced risk of severe clinical outcomes. We have incorporated this new analysis into the revised manuscript, providing further details in the “Statistical Analysis” section (page 7) and presenting the results on pages 15 and in Figure 6.

      Comment #6: Were there any differences in the kinetics of viral loads in children vs adults? I.e. children, young adults and older adults (>60 or 50?). Or were there insufficient numbers for this comparison?

      To address this point, we have modified the reported results of Figure 3-D by ages of 5, 10, 15, 25, and 50 years, represented children, adolescents, young adults, and older adults. Our analysis shows that viremia kinetics are largely similar across ages.

      Comment #7: Did any patients have comorbidities such as diabetes, obesity etc... if so, were there any differences in the viral loads?

      We appreciate your interest in the potential impact of comorbidities on viral loads. However, due to data limitations, we were unable to analyze this association. Only 6 patients had documented diabetes in the pooled dataset. In study C, 39 patients had obesity, whereas body mass index data is not available for studies A and B, although reports suggest a lower prevalence of obesity compared to study C.

      Comment #8: Were there any differences in the kinetics of the overall viral loads between DF/DHF/DSS or dengue with warning signs, without warning signs and severe dengue? Especially related to the time for viral clearance?

      Thank you for your suggestion. Such analysis reverses time and the causal direction, while we are more interested in looking forward. Therefore, instead of analyzing viremia kinetics based on disease severity, we have added an analysis to investigate the relationship between the rate of decline in viremia and clinical outcomes, as shown in the response to your comment #5. Results show that a more rapid rate of viremia decline is associated with a reduced risk of more severe clinical outcomes. In addition, in this study, we selected two clinical outcomes severe dengue and plasma leakage. The definitions are based on the WHO 2009 guidelines and standard endpoint definitions for dengue trials (Tomashek et al., 2018).

    1. Author Response

      The following is the authors’ response to the original reviews.

      We thank the reviewers and the editors for their careful reading of our manuscript and for the detailed and constructive feedback on our work. Please find attached the revised version of the manuscript. We performed an extensive revision of the manuscript to address the issues raised by the referees. We provide new analyses (regarding the response consistency and the neural complexity), added supplementary figures and edits to figures and texts. Based on the reviewers’ comments, we introduced several major changes to the manuscript.

      Most notably, we

      • added a limitation statement to emphasize the speculative nature of our interpretation of the timing of word processing/associative binding

      • emphasized the limitations of the control condition

      • added analyses on the interaction between memory retrieval after 12h versus 36h

      • clarified our definition of episodic memory

      • added detailed analyses of the “Feeling of having heard” responses and the confidence ratings

      We hope that the revised manuscript addresses the reviewers' comments to their satisfaction. We believe that the revised manuscript has been significantly improved owing to the feedback provided. Below you can find a point-by-point response to each reviewer comment in blue. We are looking forward that the revision will be published in the Journal eLife.

      Reviewer #1 (Public Review):

      The authors show that concurrently presenting foreign words and their translations during sleep leads to the ability to semantically categorize the foreign words above chance. Specifically, this procedure was successful when stimuli were delivered during slow oscillation troughs as opposed to peaks, which has been the focus of many recent investigations into the learning & memory functions of sleep. Finally, further analyses showed that larger and more prototypical slow oscillation troughs led to better categorization performance, which offers hints to others on how to improve or predict the efficacy of this intervention. The strength here is the novel behavioral finding and supporting physiological analyses, whereas the biggest weakness is the interpretation of the peak vs. trough effect.

      R1.1. Major importance:

      I believe the authors could attempt to address this question: What do the authors believe is the largest implication of this studies? How far can this technique be pushed, and how can it practically augment real-world learning?

      We revised the discussion to put more emphasis on possible practical applications of this study (lines 645-656).

      In our opinion, the strength of this paper is its contribution to the basic understanding of information processing during deep sleep, rather than its insights on how to augment realworld learning. Given the currently limited data on learning during sleep, we believe it would be premature to make strong claims about potential practical applications of sleep-learning. In addition, as pointed out in the discussion section, we do not know what adverse effects sleep-learning has on other sleep-related mechanisms such as memory consolidation.

      R1.2. Lines 155-7: How do the authors argue that the words fit well within the half-waves when the sounds lasted 540 ms and didn't necessarily start right at the beginning of each half-wave? This is a major point that should be discussed, as part of the down-state sound continues into the up-state. Looking at Figure 3A, it is clear that stimulus presented in the slow oscillation trough ends at a time that is solidly into the upstate, and would not neurolinguists argue that a lot of sound processing occurs after the end of the sound? It's not a problem for their findings, which is about when is the best time to start such a stimulus, but it's a problem for the interpretation. Additionally, the authors could include some discussion on whether possibly presenting shorter sounds would help to resolve the ambiguities here.

      The word pairs’ presentations lasted on average ~540 ms. Importantly, the word pairs’ onset was timed to occur 100 ms before the maximal amplitude of the targeted peaks/troughs.

      Therefore, most of a word’s sound pattern appeared during the negative going half-wave (about 350ms of 540ms). Importantly, Brodbeck and colleagues (2022) have shown that phonemes are continuously analyzed and interpreted with delays of about 50-200 ms, peaking at 100ms delay. These results suggest that word processing started just following the negative maximum of a trough and finished during the next peak. Our interpretation (e.g. line 520+) suggests that low-level auditory processing reaches the auditory cortex before the positive going half-wave. During the positive going half-wave the higher-level semantic networks appear the extract the presented word's meaning and associate the two simultaneously presented words. We clarified the time course regarding slow-wave phases and sound presentation in the manuscript (lines 158-164). Moreover, we added the limitation that we cannot know for sure when and in which slow-wave phase words were processed (lines 645-656). Future studies might want to look at shorter lasting stimuli to narrow down the timing of the word processing steps in relation to the sleep slow waves.

      R1.3. Medium importance:

      Throughout the paper, another concern relates to the term 'closed-loop'. It appears this term has been largely misused in the literature, and I believe the more appropriate term here is 'real-time' (Bergmann, 2018, Frontiers in Psychology; Antony et al., 2022, Journal of Sleep Research). For instance, if there were some sort of algorithm that assessed whether each individual word was successfully processed by the brain during sleep and then the delivery of words was subsequently changed, that could be more accurately labelled as 'closed-loop'.

      We acknowledge that the meaning of “closed-loop” in its narrowest sense is not fulfilled here. We believe that “slow oscillation phase-targeted, brain-state-dependent stimulation” is the most appropriate term to describe the applied procedure (BSDBS, Bergmann, 2018). We changed the wording in the manuscript to brain-state-dependent stimulation algorithm. Nevertheless, we would like to point out that the algorithm we developed and used (TOPOSO) is very similar to the algorithms often termed closed-loop algorithm in memory and sleep (e.g. Esfahani et al., 2023; Garcia-Molina et al., 2018; Ngo et al., 2013, for a comparison of TOPOSO to these techniques see Wunderlin et al., 2022 and for more information about TOPOSO see Ruch et al., 2022).

      R1.4. Figure 5 and corresponding analyses: Note that the two conditions end up with different sounds with likely different auditory complexities. That is, one word vs. two words simultaneously likely differ on some low-level acoustic characteristics, which could explain the physiological differences. Either the authors should address this via auditory analyses or it should be added as a limitation.

      This is correct, the two conditions differ on auditory complexities. Accordingly, we added this issue as another limitation of the study (line 651-653). We had decided for a single word control condition to ensure that no associative learning (between pseudowords) could take place in the control condition because this was the critical learning process in the experimental condition. We would like to point out that we observed significant differences in brain responses to the presentation of word-pairs (experimental condition) vs single pseudowords (control condition) in the Trough condition, but not the Peak condition. If indeed low-level acoustic characteristics explained the EEG differences occurring between the two conditions then one would expect these differences occurring in both the trough and the peak condition because earlier studies showed that low-level acoustic processing proceeds in both phases of slow waves (Andrillon et al., 2016; Batterink et al., 2016; Daltrozzo et al., 2012).

      R1.5. Line 562-7 (and elsewhere in the paper): "episodic" learning is referenced here and many times throughout the paper. But episodic learning is not what was enhanced here. Please be mindful of this wording, as it can be confusing otherwise.

      The reported unconscious learning of novel verbal associations during sleep may not match textbook definitions of episodic memory. However, the traditional definitions of episodic memory have long been criticised (e.g., Dew & Cabeza, 2011; Hannula et al., 2023; Henke, 2010; Reder et al., 2009; Shohamy & Turk-Browne, 2013).

      We stand by our claim that sleep-learning was of episodic nature. Here we use a computational definition of episodic memory (Cohen & Eichenbaum, 1993; Henke, 2010; O’Reilly et al., 2014; O’Reilly & Rudy, 2000) and not the traditional definition of episodic memory that ties episodic memory to wakefulness and conscious awareness (Gabrieli, 1998; Moscovitch, 2008; Schacter, 1998; Squire & Dede, 2015; Tulving, 2002). We revised the manuscript to clarify that and how our definition differs from traditional definitions. Please see reviewer comment R3.1 for a more extensive answer.

      Reviewer #2 (Public Review):

      In this project, Schmidig, Ruch and Henke examined whether word pairs that were presented during slow-wave sleep would leave a detectable memory trace 12 and 36 hours later. Such an effect was found, as participants showed a bias to categorize pseudowords according to a familiar word that they were paired with during slow-wave sleep. This behavior was not accompanied by any sign of conscious understanding of why the judgment was made, and so demonstrates that long-term memory can be formed even without conscious access to the presented content. Unconscious learning occurred when pairs were presented during troughs but not during peaks of slow-wave oscillations. Differences in brain responses to the two types of presentation schemes, and between word pairs that were later correctly- vs. incorrectly-judged, suggest a potential mechanism for how such deep-sleep learning can occur.

      The results are very interesting, and they are based on solid methods and analyses. Results largely support the authors' conclusions, but I felt that there were a few points in which conclusions were not entirely convincing:

      R2.1. As a control for the critical stimuli in this study, authors used a single pseudoword simultaneously played to both ears. This control condition (CC) differs from the experimental condition (EC) in a few dimensions, among them: amount of information provided, binaural coherence and word familiarity. These differences make it hard to conclude that the higher theta and spindle power observed for EC over CC trials indicate associative binding, as claimed in the paper. Alternative explanations can be made, for instance, that they reflect word recognition, as only EC contains familiar words.

      We agree. In the revised version of the manuscript, we emphasise this as a limitation of our study (line 653-656). Moreover, we understand that the differences between stimuli of the control and the experimental condition must not rely only on the associative binding of two words. We cautioned our interpretation of the findings.

      Interestingly, EC vs CC exhibits differences following trough- but not peak targeting (see R1.4). If indeed all the EC vs CC differences were unrelated to associative binding, we would expect the same EC vs CC differences when peaks were targeted. Hence, the selective EC vs CC differences in the trough condition suggest that the brain is more responsive to sound, information, word familiarity and word semantics during troughs, where we found successful learning, compared to peaks, where no learning occurred. Troughtargeted word pairs (EC) versus foreign words (CC) enhanced the theta power 336 at 500 ms following word onset and this theta enhancement correlated significantly with interindividual retrieval performance indicating that theta probably promoted associative learning during sleep. This correlation was insignificant for spindle power.

      R2.2. The entire set of EC pairs were tested both following 12 hours and following 36 hours. Exposure to the pairs during test #1 can be expected to have an effect over memory one day later, during test #2, and so differences between the tests could be at least partially driven by the additional activation and rehearsal of the material during test #1. Therefore, it is hard to draw conclusions regarding automatic memory reorganization between 12 and 36 hours after unconscious learning. Specifically, a claim is made regarding a third wave of plasticity, but we cannot be certain that the improvement found in the 36 hour test would have happened without test #1.

      We understand that the retrieval test at 12h may have had an impact on performance on the retrieval test at 36h. Practicing retrieval of newly formed memories is known to facilitate future retrieval of the same memories (e.g. Karpicke & Roediger, 2008). Hence, practicing the retrieval of sleep-formed memories during the retrieval test at 12h may have boosted performance at 36h.

      However, recent literature suggests that retrieval practice is only beneficial when corrective feedback is provided (Belardi et al., 2021; Metcalfe, 2017). In our study, we only presented the sleep-played pseudowords at test and participants received no feedback regarding the accuracy of their responses. Thus, a proper conscious re-encoding could not take place. Nevertheless, the retrieval at 12h may have altered performance at 36h in other ways. For example, it could have tagged the reactivated sleep-formed memories for enhanced consolidation during the next night (Rabinovich Orlandi et al., 2020; Wilhelm et al., 2011).

      We included a paragraph on the potential carry-over effects from retrieval at 12h on retrieval at 36h in the discussion section (line 489-496; line 657-659). Furthermore, we removed the arguments about the “third wave of plasticity”.

      R2.3. Authors claim that perceptual and conceptual processing during sleep led to increased neural complexity in troughs. However, neural complexity was not found to differ between EC and CC, nor between remembered and forgotten pairs. It is therefore not clear to me why the increased complexity that was found in troughs should be attributed to perceptual and conceptual word processing, as CC contains meaningless vowels. Moreover, from the evidence presented in this work at least, I am not sure there is room to infer causation - that the increase in HFD is driven by the stimuli - as there is no control analysis looking at HFD during troughs that did not contain stimulation.

      With the analysis of the HFD we would like to provide an additional perspective to the oscillation-based analysis. We checked whether the boundary condition of Peak and Trough targeting changes the overall complexity or information content in the EEG. Our goal was to assess the change in neural complexity (relative to a pre-stimulus baseline) following the successful vs unsuccessful encoding of word pairs during sleep.

      We acknowledge that a causal interpretation about HFD is not warranted, and we revised the manuscript accordingly. It was unexpected that we could not find the same results in the contrast of EC vs CC or correct vs incorrect word pairs. We suggest that our signal-to noise ratio might have been too weak.

      One could argue that the phase targeting alone (without stimulation) induces peak/trough differences in complexity. We cannot completely rule out this concern. But we tried to use the EEG that was not influenced by the ongoing slow-wave: the EEG 2000-500ms before the stimulus onset and 500-2000ms after the stimulus onset. Therefore, we excluded the 1s of the targeted slow-wave, hoping that most of the phase inherent complexity should have faded out (see Figure 2). We could not further extend the time window of analysis due to the minimal stimulus onset interval of 2s. Of course we cannot exclude that the targeted Trough impacted the following HFD. We clarified this in the manuscript (line 384-425).

      Furthermore, we did find a difference of neural complexity between the pre-stimulus baseline and the post-stimulus complexity in the Peak condition but not in the Trough condition (we now added this contrast to the manuscript, line 416-419). Hence, the change in neural complexity is a reaction to the interaction of the specific slow-wave phase with the processing of the word pairs. Even though these results cannot provide unambiguous, causal links, we think they can figure as an important start for other studies to decipher neural complexity during slow wave sleep.

      Reviewer #3 (Public Review):

      The study aims at creating novel episodic memories during slow wave sleep, that can be transferred in the awake state. To do so, participants were simultaneously presented during sleep both foreign words and their arbitrary translations in their language (one word in each ear), or as a control condition only the foreign word alone, binaurally. Stimuli were presented either at the trough or the peak of the slow oscillation using a closed-loop stimulation algorithm. To test for the creation of a flexible association during sleep, participant were then presented at wake with the foreign words alone and had (1) to decide whether they had the feeling of having heard that word before, (2) to attribute this word to one out of three possible conceptual categories (to which translations word actually belong), and (3) to rate their confidence about their decision.

      R3.1. The paper is well written, the protocol ingenious and the methods are robust. However, the results do not really add conceptually to a prior publication of this group showing the possibility to associate in slow wave sleep pairs of words denoting large or small object and non words, and then asking during ensuing wakefulness participant to categorise these non words to a "large" or "small" category. In both cases, the main finding is that this type of association can be formed during slow wave sleep if presented at the trough (versus the peak) of the slow oscillation. Crucially, whether these associations truly represent episodic memory formation during sleep, as claimed by the authors, is highly disputable as there is no control condition allowing to exclude the alternative, simpler hypothesis that mere perceptual associations between two elements (foreign word and translation) have been created and stored during sleep (which is already in itself an interesting finding). In this latter case, it would be only during the awake state when the foreign word is presented that its presentation would implicitly recall the associated translation, which in turn would "ignite" the associative/semantic association process eventually leading to the observed categorisation bias (i.e., foreign words tending to be put in the same conceptual category than their associated translation). In the absence of a dis-confirmation of this alternative and more economical hypothesis, and if we follow Ocam's razor assumption, the claim that there is episodic memory formation during sleep is speculative and unsupported, which is a serious limitation irrespective of the merits of the study. The title and interpretations should be toned down in this respect

      Our study conceptually adds to and extends the findings by Züst et al. (a) by highlighting the precise time-window or brain state during which sleep-learning is possible (e.g. slow-wave trough targeting), (b) by demonstrating the feasibility of associative learning during night sleep, and (c) by uncovering the longevity of sleep-formed memories.

      We acknowledge that the reported unconscious learning of novel verbal associations during sleep may not match textbook definitions of episodic memory. However, the traditional definitions of episodic memory have long been criticised (e.g, (Dew & Cabeza, 2011; Hannula et al., 2023; Henke, 2010; Reder et al., 2009; Shohamy & Turk-Browne, 2013). We stand by our claim that sleep-learning was of episodic nature. We use a computational definition of episodic memory (Cohen & Eichenbaum, 1993; Henke, 2010; O’Reilly et al., 2014; O’Reilly & Rudy, 2000), and not the traditional definition of episodic memory that ties episodic memory to wakefulness and conscious awareness (Gabrieli, 1998; Moscovitch, 2008; Schacter, 1998; Squire & Dede, 2015; Tulving, 2002). The core computational features of episodic memory are 1) rapid learning, 2) association formation, and 3) a compositional and flexible representation of the associations in long-term memory.

      Therefore, we revised the manuscript to emphasize how our definition differs from traditional definitions (line 64).

      For the current study, we designed a retrieval task that calls on the core computational features of episodic memory by assessing flexible retrieval of sleep-formed compositional word-word associations. Reviewer 3 suggests an alternative interpretation for the learning observed here: mere perceptual associations between foreign words and translations words are stored during sleep, and semantic associations are only inferred at retrieval testing during ensuing wakefulness. First, these processing steps would require the rapid soundsound associative encoding, long-term storage, and the flexible sound retrieval, which would still require hippocampal processing and computations in the episodic memory system. Second, this mechanism seems highly laborious and inefficient. The sound pattern of a word at 12 hours after learning triggers the reactivation of an associated sound pattern of another word. This sound pattern then elicits the activation of the translation words’ semantics leading to the selection of the correct superordinate semantic category at test.

      Overall, we believe that our pairwise-associative learning paradigm triggered a rapid conceptual-associative encoding process mediated by the hippocampus that provided for flexible representations of foreign and translation words in episodic memory. This study adds to the existing literature by examining specific boundary conditions of sleep-learning and demonstrates the longevity (at least 36 hours) of sleep-learned associations.

      Other remarks:

      R3.2. Lines 43-45 : the assumption that the sleeping brain decides whether external events can be disregarded, requires awakening or should be stored for further consideration in the waking state is dubious, and the supporting references date from a time (the 60') during which hypnopedia was investigated in badly controlled sleep conditions (leaving open the doubt about the possibility that it occurred during micro awakenings)

      We revised the manuscript to add timelier and better controlled studies that bolster the 60ties-born claim (line 40-51). Recently, it has been shown that the sleeping brain preferentially processes relevant information. For example the information conveyed by unfamiliar voices (Ameen et al., 2022), emotional content (Holeckova et al., 2006; Moyne et al., 2022), our own compared to others’ names (Blume et al., 2018).

      R3.3. 1st paragraph, lines 48-53 , the authors should be more specific about what kind of new associations and at which level they can be stored during sleep according to recent reports, as a wide variety of associations (mostly elementary levels) are shown in the cited references. Limitations in information processing during sleep should also be acknowledged.

      In the lines to which R3 refers, we cite an article (Ruch & Henke, 2020) in which two of the three authors of the current manuscript elaborate in detail what kind of associations can be stored during sleep. We revised these lines to more clearly present the current understanding of the potential and the limitations of sleep-learning (line 40-51). Although information processing during sleep is generally reduced (Andrillon et al., 2016), a variety of different kinds of associations can be stored, ranging from tone-odour to word-word association (Arzi et al., 2012, 2014; Koroma et al., 2022; Züst et al., 2019).

      R3.4. The authors ran their main behavioural analyses on delayed retrieval at 36h rather than 12h with the argument that retrieval performance was numerically larger at 36 than 12h but the difference was non-significant (line 181-183), and that effects were essentially similar. Looking at Figure 2, is the trough effect really significant at 12h ? In any case, the fact that it is (numerically) higher at 36 than 12h might suggest that the association created at the first 12h retrieval (considering the alternative hypothesis proposed above) has been reinforced by subsequent sleep.

      The Trough effect at 12h is not significant, as stated on line 185 (“Planned contrasts against chance level revealed that retrieval performance significantly exceeded chance at 36 hours only (P36hours = 0.036, P12hours = 0.094).”). It seems that our wording was not clear. Therefore, we refined the description of the behavioural analysis in the manuscript (lines 188-193).

      In brief, we report an omnibus ANOVA with a significant main effect of targeting type (Trough vs Peak, main effect Peak versus Trough: F(1,28) = 5.237, p = 0.030, d = 0.865). Because Trough-targeting led to significantly better memory retention than Peak-targeting, we computed a second ANOVA, solely including participants with through-targeted word-pair encoding. The memory retention in the Trough condition is above chance (MTrough = 39.11%, SD = 10.76; FIntercept (1,14) = 5.660, p = 0.032) and does not significantly differ between the 12h and 36h retrieval (FEncoding-Test Delay (1,14) = 1.308, p = 0.272). However, the retrieval performance at 36h numerically exceeds the performance at 12h and the direct comparison against chance reveals that the 36h but not the 12h retrieval was significant (P36hours = 0.036, P12hours = 0.094). Hence, we found no evidence for above chance performance at the 12h retrieval and focused on the retrieval after 36h in the EEG analysis.

      We agree with the reviewer that the subsequent sleep seems to have improved consolidation and subsequent retrieval. We assume that the reviewer suggests that participants merely formed perceptual associations during sleep and encoded episodic-like associations during testing at 12h (as pointed out in R 3.1). However, we believe that it is unlikely that the awake encoding of semantic associations during the 12h retrieval led to improved performance after 36h. We changed the discussion regarding the interaction between retrieval at 12h and 36h (line 505-512, also see R 2.2)

      R3.5> In the discussion section lines 419-427, the argument is somehow circular in claiming episodic memory mechanisms based on functional neuroanatomical elements that are not tested here, and the supporting studies conducted during sleep were in a different setting (e.g. TMR)

      Indeed, the TMR and animal studies are a different setting compared to the present study. We re-wrote this part and only focused on the findings of Züst and colleagues (2019), who examined hippocampal activity during the awake retrieval of sleep-formed memories (lines 472-482). Additionally, we would like to emphasise that our main reasoning is that the task requirements called upon the episodic memory system.

      R3.6. Supplementary Material: in the EEG data the differentiation between correct and incorrect ulterior classifications when presented at the peak of the slow oscillation is only significant in association with 36h delayed retrieval but not at 12h, how do the authors explain this lack of effect at 12 hour ?

      We assume that the reviewer refers to the TROUGH condition (word-pairs targeted at a slow-wave trough) and not as written to the peak condition. We argue that the retention performance at 12h is not significantly above chance (M12hours = 37.4%, P12hours = 0.094).

      Hence, the distinction between “correctly” and “incorrectly” categorised word pairs was not informative for the EEG analysis during sleep. For whatever reason the 12h retrieval was not significantly above chance, the less successful memory recall and thus a less balanced trial count makes recall accuracy a worse delineator for separating EEG trials then the recall performance after 36 hours.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Minor importance:

      Abstract: The opening framing is confusing here and in the introduction. Why frame the paper in the broadest terms about awakenings and threats from the environment when this is a paper about intersections between learning & memory and sleep? I do understand that there is an interesting point to be made about the counterintuitive behavioral findings with respect to sleep generally being perceived as a time when stimuli are blocked out, but this does not seem to me to be the broadest points or the way to start the paper. The authors should consider this but of course push back if they disagree.

      We understand the reviewer’s criticism but believe that this has more to do with personal preferences than with the scientific value or validity of our work. We believe that it is our duty as researchers to present our study in a broader context because this may help readers from various fields to understand why the work is relevant. To some readers, evidence for learning during sleep may seem trivial, to others, it may seem impossible or a weird but useless conundrum. By pointing out potential evolutionary benefits of the ability to acquire new information during sleep, we help the broad readership of eLife understand the relevance of this work.

      Lines 31-32: "Neural complexity" -> "neural measures of complexity" because it isn't clear what "neural complexity" means at this point in the abstract. Though, note my other point that I believe this analysis should be removed.

      To our understanding, “neural complexity” is a frequently used term in the field and yields more than 4000 entries on google scholar. Whereas ‘neural measures of complexity’ only finds 3 hits on google scholar [September 2023]. In order to link our study with other studies on neural complexity, we would like to keep this terminology. As an example, two recent publications using “neural complexity” are Lee et al. (2020) and Frohlich et al. (2022).

      Lines 42-43: The line of work on 'sentinel' modes would be good to cite here (e.g., Blume et al., 2017, Brain & Language).

      We added the suggested citation to the manuscript (lines 52).

      Lines 84-90: While I appreciate the authors desire to dig deep and try to piece this all together, this is far too speculative in my opinion. Please see my other points on the same topic.

      In this paragraph, we point out why both peaks and troughs are worth exploring for their contributions to sensory processing and learning during sleep. Peaks and troughs are contributing mutually to sleep-learning. Our speculations should inspire further work aimed at pinning down the benefits of peaks and troughs for sleep-learning. We clarified the purpose and speculative nature of our arguments in the revised version of the manuscript.

      Line 109: "outlasting" -> "lasting over" or "lasting >"

      We changed the wording accordingly.

      Line 111: I believe 'nonsense' is not the correct term here, and 'foreign' (again) would be preferred. Some may be offended to hear their foreign word regarded as 'nonsense'. However, please let me know if I have misunderstood.

      We would like to use the linguistic term “pseudoword” (aligned with reviewer 2’s comment) and we revised the manuscript accordingly.

      Figure 1A: "Enconding" -> "Encoding"

      Thank you for pointing this out.

      Lines 201-2: Were there interactions between confidence and correctness on the semantic categorization task? Were correct responses given with more confidence than incorrect ones? This would not necessarily be a problem for the authors' account, as there can of course be implicit influences on confidence (i.e., fluency).

      As is stated in the results section, confidence ratings did not differ significantly between correct and incorrect assignments (Trough condition: F(1,14) = 2.36, p = 0.15); Peak condition: F(1,14) = 0.48, p = 0.50).

      Line 236: "Nicknazar" -> "Niknazar"

      Thank you for pointing this out.

      Line 266: "profited" -> "benefited"

      We changed the wording accordingly.

      Lines 280-4: There seems some relevance here with Malerba et al. (2018) and her other papers to categorize slow oscillations.

      Diving into the details on how to best categorise slow oscillations is beyond the scope of this manuscript. Here, we build on work from the field of microstate analyses and use two measures to describe and quantify the targeted brain states: the topography of the electric field (i.e., the correlation of the electric field with an established template or “microstate”), and the field strength (global field power, GFP). While the topography of a quasi-stable electric field reflects activity in a specific neural network, the strength (GFP) of a field most likely mirrors the degree of activation (or inactivity) in the specific network. Here, we find that consistent targeting of a specific network state yielding a strong frontal negativity benefitted learning during sleep. For a more detailed explanation of the slow-wave phase targeting see (Ruch et al., 2022).

      Lines 343-6: Was it intentional to have 0.5 s (0.2-0.7 s) surrounding the analysis around 500 ms but only 0.4 s (0.8-1.2 s) surrounding the analysis around 1 s? Could the authors use the same size interval or justify having them be different?

      We apologise for the misleading phrasing and we clarified this in the revised manuscript. We applied the same procedure for the comparison of later correctly vs incorrectly classified pseudowords as we did for the comparison between EC and CC. Hence, we analysed the entire window from 0s to 2.5s with a cluster-based permutation approach. Contrary to the EC vs CC contrast, no cluster remained significant for the comparison of the subsequent memory effect. By mistake we reported the wrong time window. In the revised manuscript, the paragraph is corrected (lines 364-369).

      Line 356-entire HFD section: it is unclear what's gained by this analysis, as it could simply be another reflection of the state of the brain at the time of word presentation. In my opinion, the authors should remove this analysis and section, as it does not add clarity to other aspects of the paper.

      (If the authors keep the section) Line 361-2 - "Moreover, high HFD values have been associated with cognitive processing (Lau et al., 2021; Parbat & Chakraborty, 2021)." This statement is vague. Could the authors elaborate?

      Please see our answer to Reviewer 2 (2.3) for a more detailed explanation. In brief, we would like to keep the analysis with the broad time window of -2 to -0.5 and from 0.5 to 2 s.

      Lines 403-4: How was it determined that these neural networks mediated both conscious/unconscious processes? Perhaps the authors meant to make a different point, but the way it reads to me is that there is evidence that some neural networks are conscious and others are not and both forms engage in similar functions.

      We revised the manuscript to be more precise and clear: “The conscious and unconscious rapid encoding and flexible retrieval of novel relational memories was found to recruit the same or similar networks including the hippocampus(Henke et al., 2003; Schneider et al., 2021). This suggests that conscious and unconscious relational memories are processed by the same memory system.” (p. 22, top).

      Lines 433-41: Performance didn't actually significantly increase from 12 to 36 hours, so this is all too speculative in my opinion.

      We removed the speculative claim that performance may have increased from the retrieval at 12 hours to the retrieval at 36 hours.

      Line 534: "assisted by enhanced" -> "coincident with". It's unclear whether theta reflects successful processing as having occurred or whether it directly affects or assists with it.

      We have adjusted the wording to be more cautious, as suggested (line 588).

      Line 572-4: Rothschild et al. (2016) is relevant here.

      Unfortunately, we do not see the relevance of this article within the context of our work.

      Line 577 paragraph: The authors may consider adding a note on the importance of ethical considerations surrounding this form of 'inception'.

      We extended this part by adding ethical considerations to the discussion section (Stickgold et al., 2021, line 657).

      Line 1366: It would be better if the authors could eventually make their data publicly available. This is obviously not required, but I encourage the authors to consider it if they have not considered it already.

      In my opinion, the discussion is too long. I really appreciate the authors trying to figure out the set of precise times in which each level of neural processing might occur and how this intersects with their slow oscillation phase results. However, I found a lot of this too speculative, especially given that the sounds may bleed into parts of other phases of the slow oscillation. I do not believe this is a problem unique to these authors, as many investigators attempting to target certain phases in the target memory reactivation literature have faced the same problem, but I do believe the authors get ahead of the data here. In particular, there seems to be one paragraph in the discussion that is multiple pages long (p. 22-24). This paragraph I believe has too much detail and should be broken up regardless, as it is difficult for the reader to follow.

      Considering the recent literature, we believe this interpretation best explains the data. As argued earlier, we believe that a speculative interpretation of the reported phenomena can provide substantial added value because it inspires future experimental work. We have improved the manuscript by clearly distinguishing between data and interpretation. We do declare the speculative nature of some offered interpretations. We hope that these speculations, which are testable hypotheses (!), will eventually be confirmed or refuted experimentally.

      Reviewer #2 (Recommendations For The Authors):

      I very much enjoyed the paper and think it describes important findings. I have a few suggestions for improvement, and minor comments that caught my eye during reading:

      (1) I was missing an analysis of CC ERP, and its comparison to EC ERP.

      We added this analysis to the manuscript (line 299-301). The comparison of CC ERP with EC ERP did not yield any significant cluster for either the peak (cluster-level Monte Carlo p=0.54) or the trough (cluster-level Monte Carlo p>0.37). We assume that the noise level was too high for the identification of differences between CC and EC ERP.

      (2) Regarding my public review comment #2, some light can be shed on between-test effects, I believe, using an item-based analysis - looking at correlations between items' classifications in test #1 and test #2. The assumption seems to be that items that were correct in test #1 remained correct in test #2 while other new correct classifications were added, owing to the additional consolidation happening between the two tests. But that is an empirical question that can be easily tested. If no consistency in item classification is found, on the other hand, or if only consistency in correct classification is found, that would be interesting in itself. This item-based analysis can help tease away real memory from random correct classification. For instance, the subset of items that are consistently classified correctly could be regarded as non-fluke at higher confidence and used as the focus of subsequent-memory analysis instead of the ones that were correct only in test #2.

      Thanks, we re-analysed the data accordingly. Participants were consistent at choosing a specific object category for an item at 12 hours and 36 hours (consistency rate = 47% same category, chance level is 1/3). Moreover, the consistency rate did not differ between the Trough and the Peak condition (MTrough = 47.2%, MPeak = 47.0%, P = 0.98). The better retrieval performance in the Trough compared to the Peak condition after 36 hours is due to: A) if participants were correct at 12h, they chose again the correct answer at 36h (Trough: 20% & Peak: 14%). B) Following an incorrect answer at 12h, participants switched to another object category at 36h (Trough: 72%, Peak: 67%). C) If participants switched the object category following an incorrect answer at 12h, they switched more often to the correct category at 36h in the trough versus the peak condition (Trough: in 56% & Peak: 53%). Hence, the data support the reviewer’s assumption: items that were correct after 12 hours remained correct after 36 hours, while other new correct classifications were generated at 36h owing to the additional consolidation happening between the two tests. We added this finding to the manuscript (line 191-200, Figure S6):

      Author response image 1.

      As suggested, we re-analysed the ERP with respect to the subsequent memory effect. This time we computed four conditions according to the reviewer’s argument about consistently correctly classified pseudowords, presented in the figure below: ERP of trials that were correctly classified at 36h (blue), ERP of trials that were incorrectly classified at 36h (light blue), ERP of trials that were correctly classified twice (brown) and ERP of trials that were not correctly classified twice (orange, all trials that are not in brown). Please note that the two blue lines are reported in the manuscript and include all trials. The brown and the orange line take the consistency into account and together include as well all trials.

      Author response image 2.

      By excluding even more trials from the group of correct retrieval responses, the noise level gets high. Therefore, the difference between the twice-correct and the not-twice-correct trials is not significant (cluster-level Monte Carlo p > 0.27). Because the ERP of twice-correct trials seems very similar to the ERP of the trials correctly classified at 36h at frontal electrodes, we assume that our ERP effect is not driven by a few extreme subjects. Similarly, not-twicecorrect trials (orange) have a stronger frontal trough than the trials incorrectly classified at 36h (light blue).

      (3) In a similar vein, a subject-based analysis would be highly interesting. First and foremost, readers would benefit from seeing the lines that connect individual dots across the two tests in figures 2B and 2C. It is reasonable to expect that only a subset of participants were successful learners in this experiment. Finding them and analyzing their results separately could be revealing.

      We added a Figure S1 to the supplementary material, providing the pairing between performance of the 12h and the 36h retrieval.

      It is an interesting idea to look at successful learners alone. We computed the ERP of the subsequent memory effect for those participants, who had an above change retrieval accuracy at 36h. The result shows a similar effect as reported for all participants (frontal cluster ~0-0.3s). The p-value is only 0.08 because only 9 of 15 participants exhibited an above chance retrieval performance at 36 hours.

      Author response image 3.

      ERP effect of correct (blue) vs incorrect (light blue) pseudoword category assignment of participants with a retrieval performance above chance at 36h (SD as shades):

      We prefer to not include this data in the manuscript, but are happy to provide it here.

      (4) I wondered why the authors informed subjects of the task in advance (that they will be presented associations when they slept)? I imagine this may boost learning as compared to completely naïve subjects. Whether this is the reason or not, I think an explanation of why this was done is warranted, and a statement whether authors believe the manipulation would work otherwise. Also, the reader is left wondering why subjects were informed only about test #1 and not about test #2 (and when were they told about test #2).

      Subjects were informed of all the tests upfront. We apologize for the inconsistency in the manuscript and revised the method part. The explanation of why participants were informed is twofold: a) Participants had to sleep with in-ear headphones. We wanted to explain to participants why these are necessary and why they should not remove them. b) We hoped that participants would be expecting unconsciously sounds played during sleep, would process these sounds efficiently and would remain deeply asleep (no arousals).

      (5) FoHH is a binary yes/no question, and so may not have been sensitive enough to demonstrate small differences in familiarity. For comparison, the Perceptual Awareness Scale (Ramsøy & Overgaard, 2004) that is typically used in studies of unconscious processing is of a 4-point scale, and this allows to capture more nuanced effects such as partial consciousness and larger response biases. Regardless, it would be informative to have the FoHH numbers obtained in this study, and not just their comparison between conditions. Also, was familiarity of EC and CC pseudowords compared? One may wonder whether hearing the pseudowords clearly vs. in one ear alongside a familiar word would make the word slightly more familiar.

      We apologize for having simplified this part too much in the manuscript. Indeed, the FoHH is comparable to the PAS. We used a 4-point scale, where participants rated their feeling of whether they have heard the pseudoword during previous sleep. In the revised manuscript, we report the complete results (line 203-223). The FoHH did not differ between any of the suggested contrasts. Thus, for both the peak and the trough condition, the FoHH did not differ between sleep-played vs new; correct EC trials vs new; correct vs incorrect EC trials; EC vs CC trials. To illustrate the results, a figure of the FoHH has been added to the supplement (Figure S4).

      (6) Similarly, it would be good to report the numbers of the confidence ratings in the paper as well.

      In the revised manuscript, we extended the description of the confidence rating results. We added the descriptive statistics (line 224-236) and included a corresponding figure in the supplement (Figure S5).

      Minor/aesthetic comments:

      We implemented all the following suggestions.

      (1) I suggest using "pseudoword" or "nonsense word" instead of "foreign word", because "foreign word" typically means a real word from a different language. It is quite confusing when starting to read the paper.

      After reconsidering, we think that pseudoword is the appropriate linguistic term and have revised the manuscript accordingly.

      (2) Lines 1000-1001: "The required sample size of N = 30 was determined based on a previous sleep-learning study". I was missing a description of what study you are referring to.

      (3) I am not sure I understood the claim nor the rationale made in lines 414-417. Is the claim that pairs did not form one integrated engram? How do we know that? And why would having one engram not enable extracting the meaning from a visual-auditory presentation of the cue? The sentence needs some rewording and/or unpacking.

      (4) Were categories counterbalanced (i.e., did each subjects' EC contain 9 animal words, 9 tool words and 9 place words)?

      (5) Asterisks indicating significant effects are missing from Figure 4 and S2.

      (6) Fig1 legend: "Participants were played with pairs" is ungrammatical.

      (7) Line 1093: no need for a comma.

      (8) Line 1336: missing opening parenthesis

      (9) Line 430: "observe" instead of "observed".

      (10) Line 466: two dots instead of one..

      Reviewer #3 (Recommendations For The Authors):

      Methods: 2 separate ANOVAs are performed (lines 160-185), but would not it make more sense to combine both in one ? If kept separated then a correction for multiple comparisons might be needed (p/2 = 0.025)

      We computed an omnibus ANOVA. In a next step, we examined the effect in the significant targeting condition by computing another ANOVA. For further explanations, see reviewer comment 3.4.

      References

      Ameen, M. S., Heib, D. P. J., Blume, C., & Schabus, M. (2022). The Brain Selectively Tunes to Unfamiliar Voices during Sleep. Journal of Neuroscience, 42(9), 1791–1803. https://doi.org/10.1523/JNEUROSCI.2524-20.2021

      Andrillon, T., Poulsen, A. T., Hansen, L. K., Léger, D., & Kouider, S. (2016). Neural Markers of Responsiveness to the Environment in Human Sleep. The Journal of Neuroscience, 36(24), Article 24. https://doi.org/10.1523/JNEUROSCI.0902-16.2016

      Arzi, A., Holtzman, Y., Samnon, P., Eshel, N., Harel, E., & Sobel, N. (2014). Olfactory Aversive Conditioning during Sleep Reduces Cigarette-Smoking Behavior. Journal of Neuroscience, 34(46), Article 46. https://doi.org/10.1523/JNEUROSCI.2291-14.2014

      Arzi, A., Shedlesky, L., Ben-Shaul, M., Nasser, K., Oksenberg, A., Hairston, I. S., & Sobel, N. (2012). Humans can learn new information during sleep. Nature Neuroscience, 15(10), Article 10. https://doi.org/10.1038/nn.3193

      Batterink, L. J., Creery, J. D., & Paller, K. A. (2016). Phase of Spontaneous Slow Oscillations during Sleep Influences Memory-Related Processing of Auditory Cues. Journal of Neuroscience, 36(4), 1401–1409. https://doi.org/10.1523/JNEUROSCI.3175-15.2016

      Belardi, A., Pedrett, S., Rothen, N., & Reber, T. P. (2021). Spacing, Feedback, and Testing Boost Vocabulary Learning in a Web Application. Frontiers in Psychology, 12. https://www.frontiersin.org/articles/10.3389/fpsyg.2021.757262

      Bergmann, T. O. (2018). Brain State-Dependent Brain Stimulation. Frontiers in Psychology, 9, 2108. https://doi.org/10.3389/fpsyg.2018.02108

      Blume, C., del Giudice, R., Wislowska, M., Heib, D. P. J., & Schabus, M. (2018). Standing sentinel during human sleep: Continued evaluation of environmental stimuli in the absence of consciousness. NeuroImage, 178, 638–648. https://doi.org/10.1016/j.neuroimage.2018.05.056

      Brodbeck, C., & Simon, J. Z. (2022). Cortical tracking of voice pitch in the presence of multiple speakers depends on selective attention. Frontiers in Neuroscience, 16. https://www.frontiersin.org/articles/10.3389/fnins.2022.828546

      Cohen, N. J., & Eichenbaum, H. (1993). Memory, Amnesia, and the Hippocampal System. A Bradford Book.

      Daltrozzo, J., Claude, L., Tillmann, B., Bastuji, H., & Perrin, F. (2012). Working memory is partially preserved during sleep. PloS One, 7(12), Article 12.

      Dew, I. T. Z., & Cabeza, R. (2011). The porous boundaries between explicit and implicit memory: Behavioral and neural evidence. Annals of the New York Academy of Sciences, 1224(1), 174–190. https://doi.org/10.1111/j.1749-6632.2010.05946.x

      Esfahani, M. J., Farboud, S., Ngo, H.-V. V., Schneider, J., Weber, F. D., Talamini, L. M., & Dresler, M. (2023). Closed-loop auditory stimulation of sleep slow oscillations: Basic principles and best practices. Neuroscience & Biobehavioral Reviews, 153, 105379. https://doi.org/10.1016/j.neubiorev.2023.105379

      Frohlich, J., Chiang, J. N., Mediano, P. A. M., Nespeca, M., Saravanapandian, V., Toker, D., Dell’Italia, J., Hipp, J. F., Jeste, S. S., Chu, C. J., Bird, L. M., & Monti, M. M. (2022). Neural complexity is a common denominator of human consciousness across diverse regimes of cortical dynamics. Communications Biology, 5(1), Article 1. https://doi.org/10.1038/s42003-022-04331-7

      Gabrieli, J. D. E. (1998). Cognitive neuroscience of human memory. Annual Review of Psychology, 87–115.

      Garcia-Molina, G., Tsoneva, T., Jasko, J., Steele, B., Aquino, A., Baher, K., Pastoor, S., Pfundtner, S., Ostrowski, L., Miller, B., Papas, N., Riedner, B., Tononi, G., & White, D. P. (2018). Closed-loop system to enhance slow-wave activity. Journal of Neural Engineering, 15(6), 066018. https://doi.org/10.1088/1741-2552/aae18f

      Hannula, D. E., Minor, G. N., & Slabbekoorn, D. (2023). Conscious awareness and memory systems in the brain. WIREs Cognitive Science, 14(5), e1648. https://doi.org/10.1002/wcs.1648

      Henke, K. (2010). A model for memory systems based on processing modes rather than consciousness. Nature Reviews Neuroscience, 11(7), Article 7. https://doi.org/10.1038/nrn2850

      Henke, K., Mondadori, C. R. A., Treyer, V., Nitsch, R. M., Buck, A., & Hock, C. (2003). Nonconscious formation and reactivation of semantic associations by way of the medial temporal lobe. Neuropsychologia, 41(8), Article 8. https://doi.org/10.1016/S0028-3932(03)00035-6

      Holeckova, I., Fischer, C., Giard, M.-H., Delpuech, C., & Morlet, D. (2006). Brain responses to a subject’s own name uttered by a familiar voice. Brain Research, 1082(1), 142–152. https://doi.org/10.1016/j.brainres.2006.01.089

      Karpicke, J. D., & Roediger, H. L. (2008). The Critical Importance of Retrieval for Learning. Science, 319(5865), 966–968. https://doi.org/10.1126/science.1152408

      Koroma, M., Elbaz, M., Léger, D., & Kouider, S. (2022). Learning New Vocabulary Implicitly During Sleep Transfers With Cross-Modal Generalization Into Wakefulness. Frontiers in Neuroscience, 16, 801666. https://doi.org/10.3389/fnins.2022.801666

      Lee, Y., Lee, J., Hwang, S. J., Yang, E., & Choi, S. (2020). Neural Complexity Measures. Advances in Neural Information Processing Systems, 33, 9713–9724. https://proceedings.neurips.cc/paper/2020/hash/6e17a5fd135fcaf4b49f2860c2474c7 c-Abstract.html

      Metcalfe, J. (2017). Learning from Errors. Annual Review of Psychology, 68(1), 465–489. https://doi.org/10.1146/annurev-psych-010416-044022

      Moscovitch, M. (2008). The hippocampus as a “stupid,” domain-specific module: Implications for theories of recent and remote memory, and of imagination. Canadian Journal of Experimental Psychology/Revue Canadienne de Psychologie Expérimentale, 62, 62–79. https://doi.org/10.1037/1196-1961.62.1.62

      Moyne, M., Legendre, G., Arnal, L., Kumar, S., Sterpenich, V., Seeck, M., Grandjean, D., Schwartz, S., Vuilleumier, P., & Domínguez-Borràs, J. (2022). Brain reactivity to emotion persists in NREM sleep and is associated with individual dream recall. Cerebral Cortex Communications, 3(1), tgac003. https://doi.org/10.1093/texcom/tgac003

      Ngo, H.-V. V., Martinetz, T., Born, J., & Mölle, M. (2013). Auditory Closed-Loop Stimulation of the Sleep Slow Oscillation Enhances Memory. Neuron, 78(3), Article 3. https://doi.org/10.1016/j.neuron.2013.03.006

      O’Reilly, R. C., Bhattacharyya, R., Howard, M. D., & Ketz, N. (2014). Complementary Learning Systems. Cognitive Science, 38(6), 1229–1248. https://doi.org/10.1111/j.1551-6709.2011.01214.x

      O’Reilly, R. C., & Rudy, J. W. (2000). Computational principles of learning in the neocortex and hippocampus. Hippocampus, 10(4), 389–397. https://doi.org/10.1002/1098-1063(2000)10:4<389::AID-HIPO5>3.0.CO;2-P

      Rabinovich Orlandi, I., Fullio, C. L., Schroeder, M. N., Giurfa, M., Ballarini, F., & Moncada, D. (2020). Behavioral tagging underlies memory reconsolidation. Proceedings of the National Academy of Sciences, 117(30), 18029–18036. https://doi.org/10.1073/pnas.2009517117

      Reder, L. M., Park, H., & Kieffaber, P. D. (2009). Memory systems do not divide on consciousness: Reinterpreting memory in terms of activation and binding. Psychological Bulletin, 135(1), Article 1. https://doi.org/10.1037/a0013974

      Ruch, S., & Henke, K. (2020). Learning During Sleep: A Dream Comes True? Trends in Cognitive Sciences, 24(3), 170–172. https://doi.org/10.1016/j.tics.2019.12.007

      Ruch, S., Schmidig, F. J., Knüsel, L., & Henke, K. (2022). Closed-loop modulation of local slow oscillations in human NREM sleep. NeuroImage, 264, 119682. https://doi.org/10.1016/j.neuroimage.2022.119682

      Schacter, D. L. (1998). Memory and Awareness. Science, 280(5360), 59–60. https://doi.org/10.1126/science.280.5360.59

      Schneider, E., Züst, M. A., Wuethrich, S., Schmidig, F., Klöppel, S., Wiest, R., Ruch, S., & Henke, K. (2021). Larger capacity for unconscious versus conscious episodic memory. Current Biology, 31(16), 3551-3563.e9. https://doi.org/10.1016/j.cub.2021.06.012

      Shohamy, D., & Turk-Browne, N. B. (2013). Mechanisms for widespread hippocampal involvement in cognition. Journal of Experimental Psychology: General, 142(4), 1159–1170. https://doi.org/10.1037/a0034461

      Squire, L. R., & Dede, A. J. O. (2015). Conscious and Unconscious Memory Systems. Cold Spring Harbor Perspectives in Biology, 7(3), a021667. https://doi.org/10.1101/cshperspect.a021667

      Stickgold, R., Zadra, A., & Haar, A. J. H. (2021). Advertising in Dreams is Coming: Now What? Dream Engineering. https://dxe.pubpub.org/pub/dreamadvertising/release/1

      Tulving, E. (2002). Episodic Memory: From Mind to Brain. Annual Review of Psychology, 53(1), 1–25. https://doi.org/10.1146/annurev.psych.53.100901.135114

      Wilhelm, I., Diekelmann, S., Molzow, I., Ayoub, A., Mölle, M., & Born, J. (2011). Sleep Selectively Enhances Memory Expected to Be of Future Relevance. Journal of Neuroscience, 31(5), 1563–1569. https://doi.org/10.1523/JNEUROSCI.3575-10.2011

      Wunderlin, M., Koenig, T., Zeller, C., Nissen, C., & Züst, M. A. (2022). Automatized online prediction of slow-wave peaks during non-rapid eye movement sleep in young and old individuals: Why we should not always rely on amplitude thresholds. Journal of Sleep Research, 31(6), e13584. https://doi.org/10.1111/jsr.13584

      Züst, M. A., Ruch, S., Wiest, R., & Henke, K. (2019). Implicit Vocabulary Learning during Sleep Is Bound to Slow-Wave Peaks. Current Biology, 29(4), 541-553.e7. https://doi.org/10.1016/j.cub.2018.12.038

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1

      As written in my public review I consider the science of this work to be high quality. I have some suggestions for the write-up though. As a general comment, I think that too much has been put into the appendices. In particular, the main text could contain more details about the model.

      We are pleased that this Reviewer feels that our work to be of “high quality”. We value the reviewer’s insightful suggestions and comments. Following this Reviewer’s suggestion we have moved certain sections to the main text.

      In what follows, we provide responses to each of the reviewer’s inquiry, and indicate the appropriate changes in the revised version.

      P2 -

      ϕ is introduce as packing fraction - on p3 it’s called cell density. Also it is not clear whether it is an area fraction or a cell number density. Please define properly and I would suggest sticking to one notion.

      ϕ is the cell packing fraction. In two dimensions (as is the case in our simulations) it is the area fraction. However, in order to stick to one general notation (independent of dimension) we use “packing fraction” to represent how densely the cells are packed. We changed it the revised manuscript to ensure uniformity.

      P3 -

      “which should and should slow down the overall dynamics” Typo?

      Corrected it in the revised manuscript.

      “One would intuitively expect that the ϕfree should decrease with increasing cell density” Please, define ϕfree

      ϕfree is defined in Eqn. 4. We ought to have defined it in the introduction.

      “When ϕ exceeds ϕS, the free area ϕfree saturates because the soft cells interpenetrate each other,” I suggest clearly distinguishing between biological cells and the agents (disks) used in the simulation. Please, also clarify What interpenetration of agents corresponds to in tissues?

      We have rewritten the sentence as, ”The simulations show that when..” Soft disks used in the simulations seem to be not an unrealistic model for biological cells. The small deformations noted in our model is not that different from the cells in the tissues. For visual reference, please see Author response image 1. In the left panel of the figure, a 2D snapshot of the experimental zebrafish tissue, displays the deformation of cells labeled as 1 and 2. Likewise, the right panel illustrates the extent to which such deformations are replicated in the simulation by allowing two cells to overlap (the white area in the right panel of Author response image 1 represents the interpenetration). In the revised manuscript, we have made the necessary change from “soft cells” to “soft disks.”

      Author response image 1.

      Snapshots of zebrafish tissue (left panel) (Ref. [14] main text) and model two dimensional tissue (right). In the right panel the white area represents the overlap and the black vertical line represents the intersection.

      “The facilitation mechanism, invoked in glassy systems [22] allows large cells to move with low mobility.” What is the facilitation mechanism?

      Facilitation, which is an intuitive idea, that refers to a mechanism by which cells in a in highly jammed environment can only move if the neighboring cells get out of the way. In our case (as shown in the text (Fig.3 (A) and Fig. 13 (A) & (B)) the smaller cells move faster almost independent of ϕ. When a small cell moves, it creates a void which could facilitate neighboring cells (including big ones) to move.

      “η (or relaxation time)” I suggest explaining the link between η and the relaxation time.

      First, in making this point on aging we only showed that the relaxation time is independent of the waiting time. In the revised manuscript we deleted η.

      Although not germane to this study, in the literature on glass transition, it is not uncommon to use relaxation time τα (as a proxy of viscosity η) to describe the dynamics. The relation between τα and η is given by

      where G∞ is the “infinite frequency” shear modulus, which holds in unjammed or in liquids. This relation suggests that τα is proportional to η, which is almost never satisfied in glass forming systems.

      P5 - “In addition, the elastic forces characterizing cell-cell interactions are soft, which implies that the cells can penetrate with rij − (Ri + Rj) < 0 when they are jammed.” Is this about the model or the biological tissue? Presumably the former, because real cells do not penetrate each other, right? What are rij, Ri and Rj?

      This is about the model. The cells are sufficiently soft that they can be deformed, which allows for modest interpenetration. Real cells exhibit similar behavior (see Fig. 1). In inset of Fig. 4 (b) rij is the center to center distance between cells with radii Ri and Rj. It is better to use the word overlap instead of penetrate, which is what we have done in the revised version.

      “we simulated a highly polydisperse system (PDs) in which the cell sizes vary by a factor of ∼ 8” Is it important to have a factor 8 - the zebra fish tissue presents a factor 5 − 6?

      This is an important question, which is difficult to answer using analytic theory. It does require simulations unfortunately. We do not know a priori the polysipersity value needed to observe saturation in η at high value of ϕ. However, we have shown that the a system with one type of cell (monodisperse) crystallizes. Furthermore, mixtures of two cell types do not show any saturation in η over the parameter range that we explored. A systematic simulation study is needed to explore a range of parameter values to determine the minimum PD, which would match the experimental findings.

      We performed 3D simulations to figure out if much less PD would yield saturation in η. Preliminary simulations in three dimensions with a lower value of PD (11.5% with a size variations by a factor of ≈ 2 ) exhibits saturation in the relaxation time. For comparison, the value of PD in the current work is ≈ 24% with a size variation by a factor of 8.

      P6 -

      “which is related to the Doolittle equation [26] for fluidity ( )” what is the Doolittle equation? Is it important here? Also: “VFT equation for cells”? Is it the same as given on p.2 - so nothing special for cells - or a different one?

      Historically, the Doolittle equation was proposed to describe the change in η in terms of free volume in the context polymer systems over 60 years ago. The physics in the polymers is very different from the soft models for cells considered here. Nevertheless, the equations has meaning in the context as well. The Doolittle (other names associated with similar equations are Ferry, Flory... ) equation is given by

      , where A and B are constants, V is the total volume and Vhc is the hardcore volume. Essentially, is the relative free volume. It can be shown that one can arrive at the VFT equation starting from the Doolittle equation.

      The VFT equation for cells is same as given in page 2, which we restate for completeness. Here, we introduce the apparent activation energy.

      “The stress-stress tensor” Why not simply stress tensor?

      We have corrected it.

      “shows qualitatively the same behavior as the estimate of viscosity (using dimensional arguments) made in experiments.” Where is this shown?

      The dependence of viscosity as a function ϕ is shown in Figure 1 (c).

      P7 -

      Fig 2A caption “dashed line” Maybe full line?

      This should be full line. It is fixed in in the revised manuscript.

      P8 -

      “a puzzling finding that is also reflected” Why is it puzzling?

      In figure 2 (C), it shows that the increase in the duration in the plateau of Fs(q,t) ceases when ϕ exceeds ≈ 0.90. This to us is puzzling (always a matter of perspective) because we expected that the duration of Fs(q,t) plateau to increase as a function of ϕ based on the VFT behavior for ϕ ≤ ϕS. As a result, we imagined that the relaxation time τα would continue to increase beyond ϕS. However, the simulations show that the relaxation time is essentially a constant for ϕ > 0.90, which implies that the soft disk system (our model for the tissue) is an unusual with behavior that has no counter part in the material world.

      “If the VFT relation continues” –“If the VFT relation continued”

      We have fixed it.

      First paragraph does not seem to be coherent

      What is RS (or Rs)?

      RS is the radius of the small cell. In the revised manuscript we have made this clear.

      P10 -

      Please, define the waiting time.

      The waiting time refers to the period between sample preparation and data collection either in experiments or in simulations. In an ergodic system, the properties should not depend on the waiting time provided provided it is large. In other words, after the system reaches thermal equilibrium, the waiting time tω should not have an impact on the properties of the system.

      “fully jammed” Please, define.

      The term “fully jammed” refers to a state in which the constituent particles in a system do not move. For example, it a hard sphere system at a packing fraction of approximately 0.84 is fully jammed, which implies there is wiggle room for a particle move without violating the excluded volume restriction. At this specific packing fraction, the hard sphere system undergoes a jamming transition, resulting in the particles becoming completely immobile. The nonconfluent tissue modeled here is not fully jammed.

      P11 -

      Fig.4 it is hard to see that the width of P(hij) increases with ϕ.

      Please see Author response image 2 with a less number of curves for a better visualization. We have replaced this figure in the revised version.

      Author response image 2.

      Probability of overlap (hij) between two cells, P(hij), for various ϕ values.

      “Thus, even if the cells are highly jammed at ϕ ≈ ϕS, free area is available because of an increase in the overlap between cells.” This conclusion seems premature at this point.

      The Referee is correct. This is shown in Fig. 5. We amended the ends of the sentence to reflect this observation.

      P12 -

      “as is the case when the extent of compression increases” extent of compression = density?

      This is correct. Extent of compression corresponds to the packing fraction or the density.

      “This effect is expected to occur with high probability at ϕS and beyond,” Why? What is special about ϕS.

      To achieve high packing fractions beyond a certain value of ϕ soft cells have, which would occur at a certain value ϕS. In the system studied here, ϕ ≈ 0.90 = ϕS. Note that ϕS could be altered by changing the system parameters.

      P15 -

      “local equilibrium” In a thermodynamic sense? There is also cell migration, so thermodynamic equilibrium does not seem to be appropriate.

      This is an important point. The observation that equilibrium concepts hold in what is manifestly a non-equilibrium system is a surprise. It is referred in a thermodynamic sense. We agree with the reviewer because of cell division (in Ref. [14] main text), cell death, thermodynamic equilibrium does not seems to be appropriate. This is exactly the point we raise in the introduction. However, considering the timescale of cell division and death it appears that there may be a local steady state, which we we call a “local equilibrium”. As a consequence phase transition ideas and Green-Kubo relations are applicable. Indeed, a surprise in the conclusion in Ref. [14] is that in the zebrafish morphogenesis equilibrium description seems adequate.

      “number of near neighbor cells that is in contact with the ith cell. The jth cell is the nearest neighbor of the ith cell, if hij > 0” A neighbour cell or the nearest neihbor?

      A neighbour cell is accurate.

      P16 -

      “In our model there is no dynamics with only systematic forces because the temperature is zero.” What is a systematic force? I do not understand the sentence.

      Systematic force between two cells is defined in Eqn. 5 in the main text. Because temperature is not a relevant variable in our model, we want to emphasize that in the absence of self propulsion, the cells would not move at all.

      Reviewer #2

      Major comments:

      A/ Role of size polydispersity

      In the text, and also in the methods (Appendix A), the authors mention that they need large polydispersity of particle sizes to explain the viscous plateau, as the dynamics of small vs large cells are ”dramatically different” (Appendix G). They simulate a system where cell sizes vary by a factor 8, mentioning this is typical in tissues, but I found this quite surprising - this would be heterogeneities in cell volume of 500, many orders of magnitude above what has been measured in tissues. As far as I’m aware, divisions are quite symmetric and synchronous in early vertebrate embryogenesis, so volume variations are expected to be very small (similarly in epithelial tissues, where jamming has been looked at extensively, I’m not aware of examples with ratio of 8 between cell diameters). One question I had is that when the authors look at ”small polydispersity”, there are 50 − 50 mixtures. Would small polydispersity with continuous distributions change this picture? Could they take their current simulations but smoothly change the ratio of polydispersity from 8 to 0 to see exactly how much they need to explain viscosity plateauing, and at which point is the transition?

      We thank the reviewer for raising this important question, which was also a concern for Reviewer #1. The value of polydispersity (PD) required to observe such behavior is not known a priori even within the simple model used. We selected a PD value, with a size variation of a factor of 8, guided in part by the experiment (projection onto 2D) shown in Figure 1(B) and Figure 6(D). We also showed that the monodisperse system crystallizes, and the binary system do not show signs of saturation within the explored range of parameter space and ϕ. This suggests that a certain degree of size dispersity is necessary to obtain saturation in η.

      As discussed in Appendix B, the binary system is characterized by the variables , where RB and RS represent the radii of the big and small cells, respectively, and the packing fraction ϕ. By more fully exploring the parameter space encompassing λ and ϕ than we did, it maybe possible, as the Referee suggests, that a system with two different cell sizes would yield the experimentally observed dependence of η on ϕ.

      As part of an answer to the Reviewer #1 on a the same issue, we mentioned results of preliminary simulations in three dimensions with reduced levels of polydispersity, and discovered that at lower levels of polydispersity (variation in size by a factor of ≈ 2 and polydispersity value 11.50%), the relaxation time does saturate beyond a certain packing fraction (see Fig. 3). We have not established if η, the key quantity of interest, would exhibit a similar behavior in 3D.

      Author response image 3.

      (A) τα as a function of ϕ for 11% polydispersity with size variation by a factor of ∼ 2 in the three dimensional system. (B) Same as (A) except polydispersity value is 24% and a size variation by a factor of ∼ 8.

      B/ Role of fluctuations/self-propulsion in this system, and relationship to recent findings

      “A priori it is unclear why equilibrium concepts should hold in zebrafish morphogenesis, which one would expect is controlled by non-equilibrium processes such as self-propulsion, growth and cell division. ”

      This is raised as a key paradox, but is not very clear to me in the context raised by the authors. In particular, they use self-propulsion as a source of activity and explain the evolution of viscosity but a facilitation process involving re-arrangements/motility. But I don’t think self-propulsion has been argued to play a role in zebrafish blastoderm - Ref 14 argues that this is effectively a zerotemperature phenomenon and that cell motility/rearrangements do not show any correlation with viscosity. So this part of the model assumption was not clear to me in relationship with the proposed experimental system. Active noise has been proposed to play key roles in other systems, including motility-driven and tension fluctuation-driven unjamming (among many others Bi et al, PRX, 2016, Mitchel et al, Nat Comm, 2020, Pinheiro et al, Nat Phys, 2022 as well as Kim & Campas, Nat Physics, 2021) - maybe this is somewhere where the author model could fit? In Kim & Campas, Nat Phys, 2021 in particular, the authors develop simulations of non-confluent tissues with noise, that seems to bear some resemblance to the model developed here, so it would be important to discuss the similarities and distinctions (usually I think polydispersity is not considered indeed). In general, the authors look here at a particle based model, but cells have adhesions with well-defined contact angles, so there is a question of the cross-over between their findings and the large body of recent literature on active foams/vertex models (which are not really discussed there).

      We appreciate the lengthy comment here, and there is a lot to unpack. We also thank the referee for the references, some of which we did not know about earlier.

      The primary objective of our study is to determine the simplest minimal model that would explain the experimentally observed dependence of viscosity in zebrafish blastoderm tissue as ϕ is increased beyond a certain packing fraction during morphogenesis. In Reference 14, the authors analyzed the data using the framework of rigidity percolation theory and presented evidence of a genuine equilibrium phase transition. Consequently, one would that expect zebrafish blastoderm tissue to be in equilibrium, which is surprising from many perspectives. However, since the tissue is a growing system involving numerous cell divisions and cell death, it is not immediately evident whether the assumption of equilibrium is valid. Indeed, the same problem arises when considering the glass transition where rapid cooling drives the system out of equilibrium. Nevertheless, heat capacity and η are often analyzed using the notion of equilibrium. Hence, considering this issue within the context of our research appears to be reasonable.

      To the best of our knowledge, the authors in Ref. 14 did not provide an explanation for the η behavior. The focus was, which was excellent and is the basis on which we initiated this study, was on the use of rigidity percolation theory to explain the results. Indeed, they performed an experiment by mildly reducing myosin II activity, which apparently affects cell motility. The quantitative effect was not reported.

      We did not impose any requirement of cell rearrangements etc in the model. There is essentially one variable, free area available, that explains the η dependence on ϕ. It is possible that one can come up with other zero temperature models that could also explain the data. To the best of our knowledge, it has not been proposed.

      It would be interesting to set our model in the context of other models that the referee points out. This would be an interesting research topic to explore. The only comment we would like to make is that it is unclear how vertex model for confluent tissues could explain the viscosity data.

      C/ Calculation of the effective shear viscosity

      The authors calculate viscosity from a Green-Kubo relation, although it would be good to clarify at which time scale (and maybe even shear amplitude) they expect this to be valid. These kinds of model would be expected to show plastic rearrangements for large deformations for instance, could the authors simulate realistic rheological deformations (e.g. Kim & Campas, 2021 applying external shear on the simulations) to see how much this matches both their expectation and the data?

      Once it is established that there is local equilibrium (as implied by the use of phase transition ideas to analyse the experimental data in Ref. 14), it is natural to use the Green-Kubo relation to calculate transport properties. Hence, for our purposes, it is valid for all time scales and amplitude. The Reviewer also wonders if the model could be used to simulate response to shear in order to probe rheological properties. There is no conceptual issue here and indeed this is an excellent suggestion that we intend to pursue in the future.

      D/ Role of cell adhesion

      The authors consider soft elastic disks of different sizes but unless I missed it, there is no adhesion being considered. This is expected to play a key role in jamming and multicellular mechanics, so I think the authors should either look at what this changes in their simulations, or at least discuss why they are neglecting it. One reason I’m asking is that it’s not totally clear to me that the ”free space” picture, coming from the fact that cells can interpenetrate in their model would hold in a model of deformable cells adhering to each other with constant volume (leading to more equilibration of deformations it would seem?).

      The referee raises another question regarding the lack of adhesion in the simulations. As pointed out before, we were trying to create a minimal model to account for the experimental observations for η upon changing the packing fraction. Thus, we a coarse-grained model where we considered poly-disperse cells with elastic interactions which recapitulates the experimental observations. The referee is correct that adhesion plays a role in jammed systems, and examination of how it would affect is an aspect that would be interesting to consider in the future. We hasten to add that even systems without attractive adhesion-type interaction become jammed. In principle, in many-body systems, the parameter space is large and one needs to carefully determine which parameter is important for the problem at hand. Therefore, in the first pass we did not find the need to consider the role of adhesion.

      Minor comments:

      The writing could be condensed in some places, with some details being moved to SI (for instance, section E on ageing is very short and seem more suited for supplements, or at least not as an independent section, note that the figure numbering also jumps to Fig. 9 there, although it’s Fig. 3 just before and Fig. 9 just after - re-ordering into main and supporting figures would be clearer.

      We thank the Reviewer for this recommendation. The ageing section, although is short, it does provide a line of evidence that equilibrium approaches could be valid. We have modestly expanded the section by moving Appendix D to the main text, a general suggestion made by Referee 1. We have tried to be consistent in the numbering of figures in the revision.

      Reviewer #3

      I am very much in favor of the manuscript in its present form - I only suggest commenting (in the manuscript) on the issue described below.

      Motivated by the fact that the experimental system consists of living, motile cells the authors use an active particle model (eq. 6) with stochastic selfpropulsion as the only source for noise (zero-temperature). It would be useful to elaborate briefly how important this stochastic self-propulsion is for the emergent rheological properties of the system (as summarized above): would these properties also be present in the “passive” version of the same model at “non-vanishing” temperature, and if not, why? Or analogously in a “passive” version which is “shaken”, reminiscent of shaken granular matter? To clarify these issues would relate this study to (or discriminate it from) passive, but complex, liquids or granular matter.

      We appreciate the reviewer’s positive feedback on our work. The reviewer has raised an important question concerning our model in which self-propulsion serves as the source of noise. Without self-propulsion, the system would come to a stationary state after reaching mechanical equilibrium. As mentioned in Eqn. (6) (in the main text), we can define a characteristic time . It is possible that scaling the time t by τ would not alter the results.

      The second question raised by the reviewer is also important. A passive version of the model would be to consider Eq. 6 in our article, and instead of using activity use the standard stochastic force. The resulting force would be at a finite temperature,. The coefficient of noise (a diffusion term) would be related to γi through the Fluctuation dissipation theorem(FDT)). Such a system of equations cannot ne mapped to Eq. 6 in which µ and γi are independently varied. It is unlikely that such a model, incorporating a “non-vanishing” temperature, would not result in the observed dependence of η on ϕ for the following reason. The passive model represents a polydisperse system, which would form a glass with η increasing with volume fraction, following the VFT law, as has been demonstrated in the glass transition literature for harmonic glasses. The other proposal whether the shaken version version would explain the experiments is also interesting. These are worth pursuing in future studies.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Zanetti et al use biophysical and cellular assays to investigate the interaction of the birnavirus VP3 protein with the early endosome lipid PI3P. The major novel finding is that association of the VP3 protein with an anionic lipid (PI3P) appears to be important for viral replication, as evidenced through a cellular assay on FFUs.

      Strengths:

      Support previously published claims that VP3 associates with early endosome membrane, potentially through binding to PI3P. The finding that mutating a single residue (R200) critically affects early endosome binding and that the same mutation also inhibits viral replication suggests a very important role for this binding in the viral life cycle.

      Weaknesses:

      The manuscript is relatively narrowly focused: the specifics of the bi-molecular interaction between the VP3 of an unusual avian virus and a host cell lipid (PIP3). Further, the affinity of this interaction is low and its specificity relative to other PIPs is not tested, leading to questions about whether VP3-PI3P binding is relevant.

      Regarding the manuscript’s focus, we challenge the notion that studying a single bi-molecular interaction makes the scope of the paper overly narrow. This interaction—between VP3 and PI3P—plays a critical role in the replication of the birnavirus, which is the central theme of our work. Moreover, identifying and understanding such distinct interactions is a fundamental aspect of molecular virology, as they shed light on the precise mechanisms that viruses exploit to hijack the host cell machinery. Consequently, far from being narrowly focused, we believe our work contributes to the broader understanding of host-pathogen interactions.

      As for the low affinity of the VP3-PI3P interaction, we argue that this is not a limitation but rather a biologically relevant feature. As discussed in the manuscript, the moderate strength of this interaction is likely critical for regulating the turnover rate of VP3/endosomal PI3P complexes, which in turn could optimize viral replication efficiency. A stronger affinity might trap VP3 on the endosomal membrane, whereas weaker interactions might reduce its ability to efficiently target PI3P. Thus, the observed affinity may reflect a fine-tuned balance that supports the viral life cycle.

      With regard to specificity, we emphasize that in the context of the paper, we refer to biological specificity, which is not necessarily the same as chemical specificity. The binding of PI3P to early endosomes is “biologically” preconditioned by the distribution of PI3P within the cell. PI3P is predominantly localized in endosomal membranes, which “biologically precludes” interference from other PIPs due to their distinct cellular distributions. Moreover, while early endosomes also contain other anionic lipids, our work demonstrates that among these, PI3P plays a distinctive role in VP3 binding. This highlights its functional relevance in the context of early endosome dynamics.

      Reviewer #3 (Public review):

      Summary:

      Infectious bursal disease virus (IBDV) is a birnavirus and an important avian pathogen. Interestingly, IBDV appears to be a unique dsRNA virus that uses early endosomes for RNA replication that is more common for +ssRNA viruses such as for example SARS-CoV-2. This work builds on previous studies showing that IBDV VP3 interacts with PIP3 during virus replication. The authors provide further biophysical evidence for the interaction and map the interacting domain on VP3.

      Strengths:

      Detailed characterization of the interaction between VP3 and PIP3 identified R200D mutation as critical for the interaction. Cryo-EM data show that VP3 leads to membrane deformation.

      We thank the reviewer for the feedback.


      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Zanetti et al. use biophysical and cellular assays to investigate the interaction of the birnavirus VP3 protein with the early endosome lipid PI3P. The major novel finding is that the association of the VP3 protein with an anionic lipid (PI3P) appears to be important for viral replication, as evidenced through a cellular assay on FFUs.

      Strengths:

      Supports previously published claims that VP3 may associate with early endosomes and bind to PI3P-containing membranes. The claim that mutating a single residue (R<sub>200</sub>) critically affects early endosome binding and that the same mutation also inhibits viral replication suggests a very important role for this binding in the viral life cycle.

      Weaknesses:

      The manuscript is relatively narrowly focused: one bimolecular interaction between a host cell lipid and one protein of an unusual avian virus (VP3-PI3P). Aspects of this interaction have been described previously. Additional data would strengthen claims about the specificity and some technical issues should be addressed. Many of the core claims would benefit from additional experimental support to improve consistency.

      Indeed, our group has previously described aspects of the VP3-PI3P interaction, as indicated in lines 100-105 from the manuscript. In this manuscript, however, we present biochemical and biophysical details that have not been reported before about how VP3 connects with early endosomes, showing that it interacts directly with the PI3P. Additionally, we have now identified a critical residue in VP3—the R<sub>200</sub>—for binding to PI3P and its key role in the viral life cycle. Furthermore, the molecular dynamics simulations helped us come up with a mechanism for VP3 to connect with PI3P in early endosomes. This constitutes a big step forward in our understanding of how these "non-canonical" viruses replicate.

      We have now incorporated new experimental and simulation data; and have carefully revised the manuscript in accordance with the reviewers’ recommendations. We are confident that these improvements have further strengthened the manuscript.

      Reviewer #2 (Public Review):

      Summary:

      Birnavirus replication factories form alongside early endosomes (EEs) in the host cell cytoplasm. Previous work from the Delgui lab has shown that the VP3 protein of the birnavirus strain infectious bursal disease virus (IBDV) interacts with phosphatidylinositol-3-phosphate (PI3P) within the EE membrane (Gimenez et al., 2018, 2020). Here, Zanetti et al. extend this previous work by biochemically mapping the specific determinants within IBDV VP3 that are required for PI3P binding in vitro, and they employ in silico simulations to propose a biophysical model for VP3-PI3P interactions.

      Strengths:

      The manuscript is generally well-written, and much of the data is rigorous and solid. The results provide deep knowledge into how birnaviruses might nucleate factories in association with EEs. The combination of approaches (biochemical, imaging, and computational) employed to investigate VP3-PI3P interactions is deemed a strength.

      Weaknesses:

      (1) Concerns about the sources, sizes, and amounts of recombinant proteins used for co-flotation: Figures 1A, 1B, 1G, and 4A show the results of co-flotation experiments in which recombinant proteins (control His-FYVE v. either full length or mutant His VP3) were either found to be associated with membranes (top) or non-associated (bottom). However, in some experiments, the total amounts of protein in the top + bottom fractions do not appear to be consistent in control v. experimental conditions. For instance, the Figure 4A western blot of His-2xFYVE following co-flotation with PI3P+ membranes shows almost no detectable protein in either top or bottom fractions.

      Liposome-based methods, such as the co-flotation assay, are well-established and widely regarded as the preferred approach for studying protein-phosphoinositide interactions. However, this approach is rather qualitative, as density gradient separation reveals whether the protein is located in the top fractions (bound to liposomes) or the bottom fractions (unbound). Our quantifications aim to demonstrate differences in the bound fraction between liposome populations with and without PI3P. Given the setting of the co-flotation assays, each protein-liposome system [2xFYVE-PI3P(-), 2xFYVE-PI3P(+), VP3-PI3P(-), or VP3-PI3P(+)] is assessed separately, and even if the experimental conditions are homogeneous, it is not surprising to observe differences in the protein level between different experiments. Indeed, the revised version of the manuscript includes membranes with more similar band intensities, as depicted in the new versions of Figures 1 and 4.

      Reading the paper, it was difficult to understand which source of protein was used for each experiment (i.e., E. coli or baculovirus-expressed), and this information is contradicted in several places (see lines 358-359 v. 383-384). Also, both the control protein and the His-VP3-FL proteins show up as several bands in the western blots, but they don't appear to be consistent with the sizes of the proteins stated on lines 383-384. For example, line 383 states that His-VP3-FL is ~43 kDa, but the blots show triplet bands that are all below the 35 kDa marker (Figures 1B and 1G). Mass spectrometry information is shown in the supplemental data (describing the different bands for His-VP3-FL) but this is not mentioned in the actual manuscript, causing confusion. Finally, the results appear to differ throughout the paper (see Figures 1B v. 1G and 1A v. 4A).

      Thank you for pointing out these potentially confusing points in the previous version of the manuscript. Indeed, we were able to produce recombinant VP3 from the two sources: Baculovirus and Escherichia coli. Initially, we opted for the baculovirus system, based on evidence from previous studies showing that it was suitable for ectopic expression of VP3. Subsequently, we successfully produced VP3 using Escherichia coli. On the other side, the fusion proteins His-2xFYVE and GST-2xFYVE were only produced in the prokaryotic system, also following previous reported evidence. We confirmed that VP3, produced in either system, exhibited similar behavior in our co-flotation and bio-layer interferometry (BLI) assays. However, the results of co-flotation and BLI assays shown in Figs. 1 and 4 were performed using the His-VP3 FL, His-VP3 FL R<sub>200</sub>D and His-VP3 FL DCt fusion proteins produced from the corresponding baculoviruses. We have clarified this in the revised version of our manuscript. Please, see lines 430-432.

      Additionally, we have made clear that the His-VP3 FL protein purification yielded four distinct bands, and we confirmed their VP3 identity through mass spectrometry in the revised version of the manuscript. Please, see lines 123-124.

      Finally, we replaced membranes for Figs. 4A and 1G (left panel) with those with more similar band intensities. Please, see the new version of Figures 1 and 4.

      (2) Possible "other" effects of the R<sub>200</sub>D mutation on the VP3 protein. The authors performed mutagenesis to identify which residues within patch 2 on VP3 are important for association with PI3P. They found that a VP3 mutant with an engineered R<sub>200</sub>D change (i) did not associate with PI3P membranes in co-floatation assays, and (ii) did not co-localize with EE markers in transfected cells. Moreover, this mutation resulted in the loss of IBDV viability in reverse genetics studies. The authors interpret these results to indicate that this residue is important for "mediating VP3-PI3P interaction" (line 211) and that this interaction is essential for viral replication. However, it seems possible that this mutation abrogated other aspects of VP3 function (e.g., dimerization or other protein/RNA interactions) aside from or in addition to PI3P binding. Such possibilities are not mentioned by the authors.

      The arginine amino acid at position 200 of VP3 is not located in any of the protein regions associated with its other known functions: VP3 has a dimerization domain located in the second helical domain, where different amino acids across the three helices form a total of 81 interprotomeric close contacts; however, R<sub>200</sub> is not involved in these contacts (Structure. 2008 Jan;16(1):29-37, doi:10.1016/j.str.2007.10.023); VP3 has an oligomerization domain mapped within the 42 C-terminal residues of the polypeptide, i.e., the segment of the protein composed by the residues at positions 216-257 (J Virol. 2003 Jun;77(11):6438–6449, doi: 10.1128/jvi.77.11.6438-6449.2003); VP3’s ability to bind RNA is facilitated by a region of positively-charged amino acids, identified as P1, which includes K<sub>99</sub>, R<sub>102</sub>, K<sub>105</sub>, and K<sub>106</sub> (PLoS One. 2012;7(9):e45957, doi: 10.1371/journal.pone.0045957). Furthermore, our findings indicate that the R<sub>200</sub>D mutant retains a folding pattern similar to the wild-type protein, as shown in Figure 4B. All these lead us to conclude that the loss of replication capacity of R<sub>200</sub>D viruses results from impaired, or even loss of, VP3-PI3P interaction.

      We agree with the reviewer that this is an important point and have accordingly addressed it in the Discussion section of the revised manuscript. Please, see lines 333-346.

      (3) Interpretations from computational simulations. The authors performed computational simulations on the VP3 structure to infer how the protein might interact with membranes. Such computational approaches are powerful hypothesis-generating tools. However, additional biochemical evidence beyond what is presented would be required to support the authors' claims that they "unveiled a two-stage modular mechanism" for VP3-PI3P interactions (see lines 55-59). Moreover, given the biochemical data presented for R<sub>200</sub>D VP3, it was surprising that the authors did not perform computational simulations on this mutant. The inclusion of such an experiment would help tie together the in vitro and in silico data and strengthen the manuscript.

      We acknowledge that the wording used in the previous version of the manuscript may have overstated the "unveiling" of the two-stage binding mechanism of VP3. Our intention was to propose a potential mechanism, that is consistent both with the biophysical experiments and the molecular simulations. In the revised version of the manuscript, we have tempered these claims and framed them more appropriately.

      Regarding the simulations for the R<sub>200</sub>D VP3 mutant, these simulations were indeed performed and included in the original manuscript as part of Figure S14 in the Supplementary Information. However, we realize that this was not sufficiently emphasized in the main text, an oversight on our part. We have now revised the manuscript to highlight these results more clearly.

      Additionally, to further strengthen the connection between experimental and simulation trends, we have now included a new figure in the Supplementary Information (Figure S15). This figure depicts the binding energy of VP3 ΔNt and two of its mutants, VP3 ΔNt R<sub>200</sub>D and VP3 ΔNt P2 Mut, as a function of salt concentration. The results show that as the number of positively charged residues in VP3 is systematically reduced, the binding of the protein to the membrane becomes weaker. The effect is more pronounced at lower salt concentrations, which highlights the weight of electrostatic forces on the adsorption of VP3 on negatively charged membranes. Please, see Supplementary Information (Figure S15).

      Reviewer #3 (Public Review):

      Summary:

      Infectious bursal disease virus (IBDV) is a birnavirus and an important avian pathogen. Interestingly, IBDV appears to be a unique dsRNA virus that uses early endosomes for RNA replication that is more common for +ssRNA viruses such as for example SARS-CoV-2.

      This work builds on previous studies showing that IBDV VP3 interacts with PIP3 during virus replication. The authors provide further biophysical evidence for the interaction and map the interacting domain on VP3.

      Strengths:

      Detailed characterization of the interaction between VP3 and PIP3 identified R<sub>200</sub>D mutation as critical for the interaction. Cryo-EM data show that VP3 leads to membrane deformation.

      Weaknesses:

      The work does not directly show that the identified R<sub>200</sub> residues are directly involved in VP3-early endosome recruitment during infection. The majority of work is done with transfected VP3 protein (or in vitro) and not in virus-infected cells. Additional controls such as the use of PIP3 antagonizing drugs in infected cells together with a colocalization study of VP3 with early endosomes would strengthen the study.

      In addition, it would be advisable to include a control for cryo-EM using liposomes that do not contain PIP3 but are incubated with HIS-VP3-FL. This would allow ruling out any unspecific binding that might not be detected on WB.

      The authors also do not propose how their findings could be translated into drug development that could be applied to protect poultry during an outbreak. The title of the manuscript is broad and would improve with rewording so that it captures what the authors achieved.

      In previous works from our group, we demonstrated the crucial role of the VP3 P2 region in targeting the early endosomal membranes and for viral replication, including the use of PI3K inhibitors to deplete PI3P, showing that both the control RFP-2xFYVE and VP3 lost their ability to associate with the early endosomal membranes and reduces the production of an infective viral progeny (J Virol. 2018 May 14;92(11):e01964-17, doi: 10.1128/jvi.01964-17; J Virol. 2021 Feb 24;95(6):e02313-20, doi: 10.1128/jvi.02313-20). In the present work, to further characterize the role of R<sub>200</sub> in binding to early endosomes and for viral replication, we show that: i) the transfected VP3 R<sub>200</sub>D protein loses the ability to bind to early endosomes in immunofluorescence assays (Figure 2E and Figure 3); ii) the recombinant His-VP3 FL R<sub>200</sub>D protein loses the ability to bind to liposomes PI3P(+) in co-flotation assays (Figure 4A); and, iii) the mutant virus R<sub>200</sub>D loses replication capacity (Figure 4C).

      Regarding the cryo-electron microscopy observation, we verified that there is no binding of gold particles to liposomes PI3P(-) when they are incubated solely with the gold-particle reagent, or when they are pre-incubated with the gold-particle reagent with either His-2xFYVE or His-VP3 FL. We have incorporated a new panel in Figure 1C showing a representative image of these results. Please, see lines 143-144 in the revised version of our manuscript and our revised version of Figure 1C.

      We have replaced the title of the manuscript by a more specific one. Thus, our current is " On the Role of VP3-PI3P Interaction in Birnavirus Endosomal Membrane Targeting".

      Regarding the question of how our findings could be translated into drug development, indeed, VP3-PI3P binding constitutes a good potential target for drugs that counteract infectious bursal disease. However, we did not mention this idea in the manuscript, first because it is somewhat speculative and second because infected farms do not implement any specific treatment. The control is based on vaccination.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Critical issues to address:

      (1) The citations in the important paragraph on lines 101-5 are not identifiable. These references are described as showing that VP3 is associated with EEs via P2 and PI3P, which is basically what this paper also shows. The significant advance here is unclear.

      We apologize for this mistake. These citations are identifiable in the revised version of the manuscript (lines 100-105). As mentioned before, in this manuscript we present biochemical and biophysical details that have not been reported before about how VP3 connects with early endosomes, showing that it interacts directly with the PI3P. Additionally, we have now identified a critical residue in VP3 P2—the R<sub>200</sub>—for binding to PI3P and its key role in the viral life cycle. Furthermore, the molecular dynamics simulations helped us come up with a mechanism for VP3 to connect with PI3P in early endosomes. This constitutes a big step forward in our understanding of how these "non-canonical" viruses replicate.

      (2) Even if all the claims were to be clearly supported through major revamping, authors should make the significance of knowing that this protein binds to early endosomes through PI3P more clear?

      Thank you for the recommendation, which aligns with a similar suggestion from Reviewer #2. In response, we have revised the significance paragraph to emphasize the mechanistic aspects of our findings. Please refer to lines 62–67 in the revised manuscript.

      (3) Flotation assay shows binding, but this is not quantitative. An estimate of a Kd would be useful. BLI experiments suggest that half of the binding disappears at 0.5 mM, implying a very low binding affinity.

      We agree with the reviewer that our biophysical and molecular simulation results suggest a specific but weak interaction of VP3 with PI3P bearing membranes. Indeed, our previous version of the manuscript already contained a paragraph in this regard. Please, see lines 323-332 in the revised version of the manuscript.

      From a biological point of view, a low binding affinity of VP3 for the endosomes may constitute an advantage for the virus, in the sense that its traffic through the endosomes may be short lived during its infectious cycle. Indeed, VP3 has been demonstrated to be a "multifunctional" protein involved in several processes of the viral cycle (detailed in lines 84-90), and in our laboratory we have shown that the Golgi complex and the endoplasmic reticulum are organelles where further viral maturation occurs. Taking all of this into account, a high binding affinity of VP3 for endosomes could result in the protein becoming trapped on the endosomal membrane, potentially hindering the progression of the viral infection within the host cell.

      (4) There are some major internal inconsistencies in the data: Figure 1B quantifies VP3-FL T/B ratio ~4 (which appears inconsistent with the image shown, as the T lanes are much lighter than the B) whereas apparently the same experiment in Figure 1G shows it to be ~0.6. With the error bars shown, these results would appear dramatically different from each other, despite supposedly measuring the same thing. The same issue with the FYVE domain between Figures 1A and 4A.

      We appreciate the reviewer’s comment, as it made us aware of an error in Figure 1B. There, the mean value for the VP3-FL Ts/B ratio is 3.0786 for liposomes PI3P(+) and 0.4553 for liposomes PI3P(-) (Please, see the new bar graph on Figure 1B). This may have occurred because, due to the significance of these experiments, we performed multiple rounds of quantification in search of the most suitable procedure for our observations, leading to a mix-up of data sets. Anyway, it’s possible that these corrected values still seem inconsistent given that T lanes are much lighter than the B for VP3-FL in the image shown. Flotation assays are quite labor-intensive and, at least in our experience, yield fairly variable results in terms of quantification. To illustrate this point, the following image shows the three experiments conducted for Figure 1B, where it is clear that, despite producing visually distinct images, all three yielded the same qualitative observation. For Figure 1B, we chose to present the results from experiment #2. However, all three experiments contributed to a Ts/B ratio of 3.0786 for His-VP3 FL, which may account for the apparent inconsistency when focusing solely on the image in Figure 1B.

      Author response image 1.

      We acknowledge that, at first glance, some inconsistencies may appear in the results, and we have thoroughly discussed the best approach for quantification. However, we believe the observations are robust in terms of reproducibility and reliable, as the VP3-PI3P interaction was consistently validated by comparison with liposomes lacking PI3P, where no binding was observed.

      (5) Comparison of PA (or PI) to PI3P at the same molar concentration is inappropriate because PI3P has at least double charge. The more interesting question about specificity would be whether PI45P2 (or even better PI35P2) binds or not. Without this comparison, no claim to specificity can be made.

      For us, "specificity" refers to the requirement of a phosphoinositide in the endosomal membrane for VP3 binding. Phosphoinositides have a conspicuous distribution among cellular compartments, and knowing that VP3 associates with early endosomes, our specificity assays aimed to demonstrate that PI3P is strictly required for the binding of VP3. To validate this, we used PI (lacking the phosphate group) and PA (lacking the inositol group) despite their similar charges. In spite of the potential chemical interactions between VP3 and various phosphoinositides, our experimental results suggest that the virus specifically targets endosomal membranes by binding to PI3P, a phosphoinositide present only in early endosomes.

      That said, we agree with the reviewer’s point and consider adequate to smooth our specificity claim in the manuscript as follows: “We observed that His-VP3 FL bound to liposomes PI3P(+), but not to liposomes PA or PI, reinforcing the notion that a phosphoinositide is required since neither a single negative charge nor an inositol ring are sufficient to promote VP3 binding to liposomes (SI Appendix, Fig. S2)” (Lines 136-139).

      (6) In the EM images, many of the gold beads are inside the vesicles. How do they cross the membranes?

      They do not cross the membrane. Our EM images are two-dimensional projections, meaning that the gold particles located on top or beneath the plane appear to be inside the liposome.

      (7) Images in Figure 2D are very low quality and do not show the claimed difference between any of the mutants. All red signal looks basically cytosolic in all images. It is not clear what criteria were used for the quantification in Figure 2E. The same issue is in Figure 2E, where no red WT puncta are observable at all. Consistently, there is minimal colocalization in the quantification in Figure S3, which appears to show no significant differences between any of the mutants, in direct contradiction to the claim in the manuscript.

      We apologize for the poor quality of panels in Figures 2D and 2E. Unfortunately, this was due to the PDF conversion of the original files. Please, check the high-quality version of Figure 2. As suggested by reviewers #2 and #3, we have incorporated zoomed panels, which help the reader to better see the differences in distribution.

      As mentioned in the legend to Figure 2, the quantification in Figure 2D was performed by calculating the percentage of cells with punctuated fluorescent red signal (showing VP3 distribution) for each protein. The data were then normalized to the P2 WT protein, which is the VP3 wild type.

      Figure S3 certainly shows a tendency which positively correlates with the results shown in Figure 3, where we used FYVE to detect PI3P on endosomes and observed significantly less co-localization when VP3 bears its P2 region all reversed or lacks the R<sub>200</sub>

      (8) The only significant differences in colocalization are in Figure 3B, whose images look rather dramatically different from the rest of the manuscript, leading to some concern about repeatability. Also, it is unclear how colocalization is quantified, but this number typically cannot be above 1. Finally, it is unclear what is being colocalized here: with three fluorescent components, there are 3 possible binary colocalizations and an additional ternary colocalization.

      We thank the reviewer for pointing out those aspects related to Figure 3. The experiments performed for Figure 3B were conducted by a collaborator abroad handling the purified GST-2xFYVE, which recognizes endogenous PI3P, while the rest of the cell biology experiments were conducted in our laboratory in Argentina. This is why they are aesthetically different. We have made an effort in homogenizing the way they look for the revised version of the manuscript. Please, see the new version of Figure 3.

      For quantification of the co-localization of VP3 and EGFP-2xFYVE (Figure 3A), the Manders M2 coefficient was calculated out of approximately 30 cells per construct and experiment. The M2 coefficient, which reflects co-localization of signals, is defined as the ratio of the total intensities of magenta image pixels for which the intensity in the blue channel is above zero to the total intensity in the magenta channel. JACoP plugin was utilized to determine M2. For VP3 puncta co-distributing with EEA1 and GST-FYVE (Figure 3B), the number of puncta co-distributing for the three signals was manually determined out of approximately 40 cells per construct and experiment per 200 µm². We understand that Manders or Pearson coefficients, typically ranging between 0 and 1, is the most commonly used method to quantify co-localizing immunofluorescent signals; however, this “manual” method has been used and validated in previous published manuscripts [Figures 3 and 7 from (Morel et al., 2013); Figure 7 in (Khaldoun et al., 2014); and Figure 4 in (Boukhalfa et al., 2021)].

      (9) SegA/B plasmids are not introduced, and it is not clear what these are or how this assay is meant to work. Where are the foci forming units in the images of Figure 4C? How does this inform on replication? Again, this assay is not quantitative, which is essential here: does the R<sub>200</sub> mutant completely kill activity (whatever that is here)? Or reduce it somewhat?

      We apologize for the missing information. Segments A and B are basically the components of the IBDV reverse genetics system. For their construction, we used a modification of the system described by Qi and coworkers (Qi et al., 2007), in which the full length sequences of the IBDV RNA segments A and B, flanked by a hammerhead ribozyme at the 5’-end and the hepatitis delta ribozyme at the 3’-end, were expressed under the control of an RNA polymerase II promoter within the plasmids pCAGEN.Hmz.SegA.Hdz (SegA) and pCAGEN.Hmz.SegB.Hdz (SegB). For this specific experiment we generated a third plasmid, pCAGEN.Hmz.SegA.R<sub>200</sub>D.Hdz (SegA.R<sub>200</sub>D), harboring a mutant version of segment A cDNA containing the R<sub>200</sub>D substitution. Then, QM7 cells were transfected with the plasmids SegA, SegB or Seg.R<sub>200</sub>D alone (as controls) or with a mixture of plasmids SegA+SegB (wild type situation) or SegA.R<sub>200</sub>D+SegB (mutant situation). At 8 h post transfection (p.t.), when the new viruses have been able to assemble starting from the two segments of RNA, the cells were recovered and re-plated onto fresh non-transfected cells for revealing the presence (or not) of infective viruses. At 72 h post-plating, the generation of foci forming units (FFUs) was revealed by Coomassie staining. As expected, single-transfections of SegA, SegB or Seg.R<sub>200</sub>D did not produce FFUs and, as shown in Figure 4C, the transfection of SegA+SegB produced detectable FFUs (the three circles in the upper panel) while no FFUs (the three circles in the lower panel) were detected after the transfection of SegA.R<sub>200</sub>D+SegB (Figure 4C). This system is quantitative, since the FFUs detected 72 h post-plating are quantifiable by simply counting the FFUs. However, since no FFUs were detected after the transfection of SegA.R<sub>200</sub>D+SegB, evidenced by a complete monolayer of cells stained blue, we did not find any sense in quantifying. In turn, this drastic observation indicates that viruses bearing the VP3 R<sub>200</sub>D mutation lose their replication ability (is “dead”), demonstrating its crucial role in the infectious cycle.

      We agree with the reviewer that a better explanation was needed in the manuscript, so we have incorporated a paragraph in the results section of our revised version of the manuscript (lines 209-219).

      (10) Why pH 8 for simulation?

      The Molecular Theory calculations were performed at pH 8 for consistency with the experimental conditions used in our biophysical assays. These biophysical experiments were also performed at pH 8, following the conditions established in the original study where VP3 was first purified for crystallization (DOI: 10.1016/j.str.2007.10.023).

      (11) There is minimal evidence for the sequential binding model described in the abstract. The simulations do not resolve this model, nor is truly specific PI3P binding shown.

      In response to your concerns, we would like to emphasize that our simulations provide robust evidence supporting the two more important aspects of the sequential binding model: 1) Membrane Approach: In all simulations, VP3 consistently approaches the membrane via its positively charged C-terminal (Ct) region. 2) PI3P Recruitment: Once the protein is positioned flat on the membrane surface, PI3P is unequivocally recruited to the positively charged P2 region. The enrichment of PI3P in the proximity to the protein is clearly observed and has been quantified via radial distribution functions, as detailed in the manuscript and supplementary material.

      While we understand that opinions may vary on the sufficiency of the data to fully validate the model, we believe the results offer meaningful insights into the proposed binding mechanism. That said, we acknowledge that the specificity of VP3 binding may not be restricted solely to PI3P but could extend to phosphoinositides in general. To address this, we performed the new set of co-flotation experiments which are discussed in detail in our response to point 5.

      Reviewer #2 (Recommendations For The Authors):

      (1) Line 1: Consider changing the title to better reflect the mostly biochemical and computational data presented in the paper: "Mechanism of Birnavirus VP3 Interactions with PI3P-Containing Membranes". There are no data to show hijacking by a virus presented.

      We appreciate this recommendation, which was also expressed by reviewer #3. Additionally, we thank for the suggested title. We have replaced the title of the manuscript by a more specific one. Thus, our current is

      "On the Role of VP3-PI3P Interaction in Birnavirus Endosomal Membrane Targeting".

      (2) Lines 53-54 and throughout: Consider rephrasing "demonstrate" to "validate" to give credit to Gimenez et al., 2018, 2022 for discovery.

      Thanks for the suggestion. We have followed it accordingly. Please see line 52 from our revised version of the manuscript.

      (3) Line 56-59 and throughout: Consider tempering and rephrasing these conclusions that are based mostly on computational data. For example, change "unveil" to "suggest" or another term.

      We have now modified the wording throughout the manuscript.

      (4) The abstract could also emphasize that this study sought to map the resides within VP3 that are important for P13P interaction.

      Thanks for the suggestion. We have followed it accordingly. Please, see lines 53-55 from our revised version of the manuscript.

      (5) Lines 63-69: This Significance paragraph seems tangential. The findings in this paper aren't at all related to the evolutionary link between birnaviruses and positive-strand RNA viruses. The significance of the work for me lies in the deep biochemical/biophysical insights into how a viral protein interacts with membranes to nucleate its replication factory.

      We have re-written the significance paragraph highlighting the mechanistic aspect of our findings. Please, see lines 62-67 in our revised version of the manuscript.

      (6) Line 74: Please define "IDBV" abbreviation.

      We apologize for the missing information. We have defined the IBDV abbreviation in our revised version of the manuscript (please, see line 73).

      (7) Line 88: Please define "pVP2" abbreviation.

      We apologize for the missing information. We have defined the pVP2 abbreviation in our revised version of the manuscript (please, see line 87).

      (8) Lines 101-105: Please change references (8, 9, 10) to be consistent with the rest of the manuscript (names, year).

      We apologize for this mistake. These citations are identifiable and consistent in the revised version of the manuscript (lines 100-105).

      (9) Line 125: For a broad audience, consider explaining that recombinant His-2xFYVE domain is known to exhibit PI3P-binding specificity and was used as a positive control.

      Thanks for the recommendation. We have incorporated a brief explanation supporting the use of His-2xFYVE as a positive control in our revised version of the manuscript. Please, see lines 127-129.

      (10) Lines 167-171: The quantitative data in Figure S3 shows that there was a non-significant co-localization coefficient of the R<sub>200</sub>D mutant. For transparency, this should be stated in the Results section when referenced.

      We agree with this recommendation. We have clearly mentioned it in the revised version of the manuscript. Please, see lines 177-179. Also, we have referred this fact when introducing the assays performed using the purified GST-2xFYVE, shown in Figure 3. Please, see lines 182-184.

      (11) Lines 156 and 173: These Results section titles have nearly identical wording. Consider rephrasing to make it distinct.

      We agree with the reviewer’s observation. In fact, we sought to do it on purpose as for them to be a “wordplay”, but we understand that could result in a awkwarded redundancy. So, in the revised version of the manuscript, both titles are:

      Role of VP3 P2 in the association of VP3 with the EE membrane (line 163).

      VP3 P2 mediates VP3-PI3P association to EE membranes (line 182).

      (12) Line 194: Is it alternatively possible that the R<sub>200</sub>D mutant lost its capacity to dimerize, and that in turn impacted PI3P interaction?

      Thanks for the relevant question. VP3 was crystallized and its structure reported in (Casañas et al., 2008) (DOI: 10.1016/j.str.2007.10.023). In that report, the authors showed that the two VP3 subunits associate in a symmetrical manner by using the crystallographic two-fold axes. Each subunit contributes with its 30% of the total surface to form the dimer, with 81 interprotomeric close contacts, including polar bonds and van der Waals contacts. The authors identified the group of residues involved in these interactions, among which the R<sub>200</sub> is not included. Addittionally, the authors determined that the interface of the VP3 dimer in crystals is biologically meaningful (not due to the crystal packing).

      To confirm that the lack of binding was not due to misfolding of the mutant, we compared the circular dichroism spectra of mutant and wild type proteins, without detecting significant differences (shown in Figure 4B). These observations do not exclude the possibility mentioned by the reviewer, but constitute solid evidences, we believe, to validate our observations.

      (13) Lines 231-243: Consider changing verbs to past tense (i.e., change "is" to "was") for the purposes of consistency and tempering.

      Thanks for the recommendation, we have proceeded as suggested. Please, see lines 249-262 in our revised version of the manuscript.

      (14) Lines 306-308: Is there any information about whether it is free VP3 (v. VP3 complexed in RNP) that binds to membrane? I am just trying to wrap my head around how these factories form during infection.

      Thanks for pointing this out. We first observed that in infected cell, all the components of the RNPs [VP3, VP1 (the viral polymerase) and the dsRNA] were associated to the endosomes. Since by this moment it had been already elucidated that VP3 "wrapped" de dsRNA within the RNPs (Luque et al., 2009) (DOI: 10.1016/j.jmb.2008.11.029), we sought that VP3 was most probably leading this association. We answered yes after studying its distribution, also endosome-associated, when ectopically expressed. These results were published in (Delgui et al., 2013) (DOI: 10.1128/jvi.03152-12).

      Thus, in our subsequent studies, we have worked with both, the infection-derived or the ectopically expressed VP3, to advance in elucidating the mechanism by which VP3 hijacks the endosomal membranes and its relevancy for viral replication, reported in this current manuscript.

      (15) Lines 320-334: This last paragraph discussing evolutionary links between birnaviruses and positive-strand RNA viruses seems tangential and distracting. Consider reducing or removing.

      Thanks for highlighting this aspect of our work. Maybe difficult to follow, but in the context of other evidences reported for the Birnaviridae family of viruses, we strongly believe that there is an evolutionary aspect in having observed that these dsRNA viruses replicate associated to membranous organelles, a hallmark of +RNA viruses. However, we agree with the reviewer that this might not be the main point of our manuscript, so we reduced this paragraph accordingly. Please, see lines 358-367 in our revised version of the manuscript.

      (16) Lines 322-324: Change "RdRd" to "RdRp" if keeping paragraph.

      Thanks. We have corrected this mistake in lines 360 and 361.

      (17) Figures 1A, 1B, and throughout: Again, please check and explain protein sizes and amounts. This would improve the clarity of the manuscript.

      All our flotation assays were performed using 1 mM concentration of purified protein in a final volume of 100 mL (mentioned in M&M section). The complete fusion protein His-2xFYVE (shown in Figs. 1A and 4A left panel) is 954 base pairs-long and contains 317 residues (~35 kDa). The complete fusion protein His-VP3 FL (shown in Figs. 1B and 1G left panel) is 861 base pairs-long and contains 286 residues (~32 kDa). The complete fusion protein His-VP3 DCt (shown in Fig. 1G, right panel) is 753 bp-long and contains 250 residues (~28 kDa). The complete fusion protein His-VP3 FL R<sub>200</sub>D (shown in Fig. 4A right panel) is 861 bp-long and contains 286 residues (~32 kDa). This latter information was incorporated in our revised version of the manuscript. Please, see lines 381-382, 396-397 and 399-400 from the M&M section, and lines in the corresponding figure legends.

      (18) Figures 1B and 1G show different results for PI3P(+) membranes. I see protein associated with the top fraction in 1B, but I don't see any such result in 1G.

      As already mentioned, liposome-based methods, such as the co-flotation assay, are well-established and widely regarded as the preferred approach for studying protein-phosphoinositide interactions. However, this approach is rather qualitative, as density gradient separation reveals whether the protein is located in the top fractions (bound to liposomes) or the bottom fractions (unbound). Our quantifications aim to demonstrate differences in the bound fraction between liposome populations with and without PI3P. Given the setting of the co-flotation assays, each protein-liposome system [2xFYVE-PI3P(-), 2xFYVE-PI3P(+), VP3-PI3P(-), or VP3-PI3P(+)] is assessed separately, and even if the conditions are homogeneous, it’s not surprising to observe differences in the protein level between each one. Indeed, the revised version of the manuscript include a membrane for Figure 1G, were His-VP3 FL associated with the top fraction is more clear. Please, see the new version of Figure 1G.

      (19) Figure 1C: Please include cryo-EM images of the liposome PI3P(-) variables to assess the visual differences of the liposomal membranes under these conditions.

      Thanks for the recommendation. it has been verified that there is no binding of gold particles to liposomes PI3P(-) when they are incubated solely with the gold-particle reagent, or when they are pre-incubated with the gold-particle reagent with either His-2xFYVE or His-VP3 FL. We have incorporated a new panel in Figure 1C showing a representative image of these results. Please, see lines 143-144 in the revised version of our manuscript and our revised version of Figure 1C.

      (20) Figures 2D, 2E, and 3A: The puncta are not obvious in these images. Consider adding Zoomed panels.

      We apologize for this aspect of Figures 2 and 3, also highlighted by reviewer #1. We believe that this was due to the low quality resulting from the PDF conversion of the original files. For Figure 3A, we have homogenized its aspect with those from 3B. Regarding Figure 2, we have incorporated zoomed panels, as suggested. Please, see the revised versions of both Figures.

      (21) Figure 4A: There is almost no protein in the control PI3P(+) blot. Why? Also, the quantification shows no significant membrane association for this control. This result is different from Figure 1A and very confusing (and concerning).

      We apologize for the confusion. We replaced membranes for Figure 4A (left panel) with more similar band intensities to that shown in Figure 1A. Please, visit our new version of Figure 4. The quantification shows no significant difference in the association to liposomes PI3P(+) compared to liposomes PI3P(+); it’s true and this is due to, once more, the intrinsically lack of homogeneity of co-flotation assays. However, this one shown in Figure 4A is a redundant control (has been shown in Figure 1A) and we believe that the new membrane is qualitative eloquent.

      Reviewer #3 (Recommendations For The Authors):

      (1) Overall, the title is general and does not summarize the study. I recommend making the title more specific. The current title is better suited for a review as opposed to a research article. This study provides further biophysical details on the interaction. This should be reflected in the title.

      We appreciate this recommendation, which was also expressed by reviewer #2. We have chosen a new title for the manuscript: “On the Role of VP3-PI3P Interaction in Birnavirus Endosomal Membrane Targeting”.

      (2) References 8,9,10 are important but they were not correctly cited in the work, this should be corrected.

      We apologize for this mistake. These citations are identifiable in our revised version of the manuscript. See lines 100-105.

      (3) Flotation experiments and cryo-EM convincingly show that VP3 binds to membranes in a PIP3-dependent manner. However, it would be advisable to include a control for cryo-EM using liposomes that do not contain PIP3 but are incubated with HIS-VP3-FL. This would allow us to rule out any unspecific binding that might not be detected on WB.

      Thanks for the advice, also given by reviewer #2. We confirmed that no gold particles were bound on liposomes PI3P(-) even when incubated with the Ni-NTA reagent alone or pre-incubated with His-2xFYVE of His-VP3 FL. We have incorporated a new panel to Figure 1C showing a representative image of these results. Please, see lines 143-144 in the revised version of the manuscript and see the revised version of Figure 1C.

      (4) It is not clear what is the difference between WB in B and WB in G. Figure 1G seems to show the same experiment as shown in B, is this a repetition? In both cases, plots next to WBs show quantification with bars, do they represent STD or SEM? Legend A mentions significance p>0.01 (**) but the plot shows ***. This should be corrected.

      The Western blot membrane in Figure 1B shows the result of co-flotation assay using His-VP3 FL protein, while the Western blot membrane in Figure 1G (left panel) shows a co-flotation assay using His-VP3 FL protein as a positive control. In another words, in 1B the His-VP3 FL protein is the question while in 1G (left panel) it’s the co-flotation positive control for His-VP3 DCt. The bar plots next to Western blots show quantification, the mean and the STD. Thanks for highlighting this inconsistency. We have now corrected it on the revised version of the manuscript.

      (5) It would be useful to indicate positively charged residues and P2 on the AF2 predicted structure in Fig 1.

      These are indicated in panels A and B of Figure 2.

      (6) Figure 1 legend: Change cryo-fixated liposomes to cryo-fixation or better to "liposomes were vitrified". There is a missing "o" in the cry-fixation in the methods section.

      Thanks for the recommendation. We have modified Figure 1. legend to "liposomes were vitrified" (line 758), and fixed the word cryo-fixation in the methods section (line 512).

      (7) Figure 2B. It is not clear how the punctated phenotype was unbiasedly characterized (Figure 2D). I see no difference in the representative images. Magnified images should be shown. This should be measured as colocalization (Pearson's and Mander's coefficient) with an early endosomal marker Rab5. Perhaps this figure could be consolidated with Figure 3.

      Unfortunately, the lack of clarity in Figure 2D was due to the PDF conversion of the original files. Please, observe the high-quality original image above in response to reviewer #1, where we have additionally included zoomed panels, as also suggested by the other reviewers. For quantification of the co-localization of VP3 and either EGFP-Rab5 orEGFP-2xFYVE, the Manders M2 coefficient was calculated out of approximately 30 cells per construct and experiment and were shown in Figure S3 and Figure 3A, respectively, in our previous version of the manuscript.

      (8) PIP3 antagonist drugs should be used to further substantiate the results. If PIP3 specifically recruits VP3, this interaction should be abolished in the presence of PIP3 drug and VP3 should show a diffused signal.

      We certainly agree with this point. These experiments were performed and the results were reported in (Gimenez et al., 2020). Briefly, in that work, we blocked the synthesis of PI3P in QM7 cells in a stable cell line overexpressing VP3, QM7-VP3, with either the pan-PI3Kinase (PI3K) inhibitor LY294002, or the specific class III PI3K Vps34 inhibitor Vps34-IN1. In Figure 4, we showed that 98% of the cells treated with these inhibitors had the biosensor GFP-2FYVE dissociated from EEs, evidencing the depletion of PI3P in EEs (Figure 4A). In QM7-VP3 cells, we showed that the depletion of PI3P by either inhibitor caused the dissociation of VP3 from EEs and the disaggregation of VP3 puncta toward a cytosolic distribution (Figure 4B). Moreover, since this observation was crucial for our hipothesis, these results were further confirmed with an alternative strategy to deplete PI3P in EEs. We employed a system to inducibly hydrolyze endosomal PI3P through rapamycin-induced recruitment of the PI3P-myotubularin 1 (MTM1) to endosomes in cells expressing MTM1 fused to the FK506 binding protein (FKBP) and the rapamycin-binding domain fused to Rab5, using the fluorescent proteins mCherry-FKBP-MTM1 and iRFP-FRB-Rab5, as described in (Hammond et al., 2014). These results, shown in Figures 5, 6 and 7 in the same manuscript, further reinforced the notion that PI3P mediates and is necessary for the association of VP3 protein with EEs.

      (9) The authors should show the localization of VP3 in IBDV-infected cells and treat cells with PI3P antagonists. The fact that R<sub>200</sub> is not rescued does not necessarily mean that this is because of the failed interaction with PI3P. As the authors wrote in the discussion: VP3 bears multiple essential roles during the viral life cycle (line 305).

      Indeed, after having confirmed that the VP3 lost its localization associated to the endosomes after the treatment of the cells with PI3P antagonists, we demonstrated that depletion of PI3P significantly reduced the production of IBDV progeny. For this aim, we used two approaches, the inhibitor Vps34-IN1 and an siRNA against VPs34. In both cases, we observed a significantly reduced production of IBDV progeny (Figures 9 and 10). Specifically related to the reviewer’s question, the localization of VP3 in IBDV-infected cells and treated with PI3P antagonists was shown and quantified in Figure 9a.

      (10) Could you provide adsorption-free energy profiles and MD simulations also for the R<sub>200</sub> mutant?

      Following the reviewer’s suggestion, we have added a new figure to the supplementary information (Figure S15). Instead of presenting a full free-energy profile for each protein, we focused on the adsorption free energy (i.e., the minimum of the adsorption free-energy profile) for VP3 ΔNt and its mutants, VP3 ΔNt R<sub>200</sub>D and VP3 ΔNt P2 Mut, as a function of salt concentration. The aim was to compare the adsorption free energy of the three proteins and evaluate the effect of electrostatic forces on it, which become increasingly screened at higher salt concentrations. As shown in the referenced figure, reducing the number of positively charged residues from VP3 ΔNt to VP3 ΔNt P2 Mut systematically weakens the protein’s binding to the membrane. This effect is particularly pronounced at lower salt concentrations, underscoring the importance of electrostatic interactions in the adsorption of the negatively charged VP3 onto the anionic membrane.

      (11) Liposome deformations in the presence of VP3 are interesting (Figure 6G), were these also observed in Figure 1C?

      Good question. The liposome deformations in the presence of VP3 shown in Figure 6G were a robust observation since, as mentioned, it was detectable in 36% of the liposomes PI3P(+), while they were completely absent in PI3P(-) liposomes. However, and unfortunately, the same deformations were not detectable in experiments performed using gold particles shown in Figure 1C. In this regard, we think that it might be possible that the procedure of gold particles incubation itself, or even the presence of the gold particles in the images, would somehow “mask” the deformations effect.

      Bibliography

      Boukhalfa A, Roccio F, Dupont N, Codogno P, Morel E. 2021. The autophagy protein ATG16L1 cooperates with IFT20 and INPP5E to regulate the turnover of phosphoinositides at the primary cilium. Cell Rep 35:109045. doi:10.1016/j.celrep.2021.109045

      Casañas A, Navarro A, Ferrer-Orta C, González D, Rodríguez JF, Verdaguer N. 2008. Structural Insights into the Multifunctional Protein VP3 of Birnaviruses. Structure 16:29–37. doi:10.1016/j.str.2007.10.023

      Delgui LR, Rodriguez JF, Colombo MI. 2013. The Endosomal Pathway and the Golgi Complex Are Involved in the Infectious Bursal Disease Virus Life Cycle. J Virol 87:8993–9007. doi:10.1128/JVI.03152-12

      Gimenez MC, Issa M, Sheth J, Colombo MI, Terebiznik MR, Delgui LR. 2020. Phosphatidylinositol 3-Phosphate Mediates the Establishment of Infectious Bursal Disease Virus Replication Complexes in Association with Early Endosomes. J Virol 95:e02313-20. doi:10.1128/jvi.02313-20

      Hammond GRV, Machner MP, Balla T. 2014. A novel probe for phosphatidylinositol 4-phosphate reveals multiple pools beyond the Golgi. J Cell Biol 205:113–126. doi:10.1083/jcb.201312072

      Khaldoun SA, Emond-Boisjoly MA, Chateau D, Carrière V, Lacasa M, Rousset M, Demignot S, Morel E. 2014. Autophagosomes contribute to intracellular lipid distribution in enterocytes. Mol Biol Cell 25:118. doi:10.1091/mbc.E13-06-0324

      Luque D, Saugar I, Rejas MT, Carrascosa JL, Rodríguez JF, Castón JR. 2009. Infectious Bursal Disease Virus: Ribonucleoprotein Complexes of a Double-Stranded RNA Virus. J Mol Biol 386:891–901. doi:10.1016/j.jmb.2008.11.029

      Morel E, Chamoun Z, Lasiecka ZM, Chan RB, Williamson RL, Vetanovetz C, Dall’Armi C, Simoes S, Point Du Jour KS, McCabe BD, Small SA, Di Paolo G. 2013. Phosphatidylinositol-3-phosphate regulates sorting and processing of amyloid precursor protein through the endosomal system. Nature Communications 2013 4:1 4:1–13. doi:10.1038/ncomms3250

      Qi X, Gao Y, Gao H, Deng X, Bu Z, Wang Xiaoyan, Fu C, Wang Xiaomei. 2007. An improved method for infectious bursal disease virus rescue using RNA polymerase II system. J Virol Methods 142:81–88. doi:10.1016/j.jviromet.2007.01.021

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      The authors aimed to confirm the association between the human leukocyte antigen (HLA)-II region and tuberculosis (TB) susceptibility within admixed African populations. Building upon previous findings from the International Tuberculosis Host Genetics Consortium (ITHGC), this study sought to address the limitations of small sample size and the inclusion of admixed samples by employing the Local Ancestry Allelic Adjusted (LAAA) model, as well as identify TB susceptibility loci in an admixed South African cohort. 

      Strengths: 

      The major strengths of this study include the use of six TB case-control datasets collected over 30 years from diverse South African populations and ADMIXTURE for global ancestry inference. The former represents comprehensive dataset used in this study and the later ensures accurate determination of ancestral contributions. In addition, the identified association in the HLA-DPB1 gene shows near-genomewide significance, enhancing the credibility of the findings. 

      Weaknesses: 

      The major weakness of this study includes insufficient significant discoveries and reliance on crossvalidation. This study only identified one variant significantly associated with TB status, located in an intergenic region with an unclear link to TB susceptibility. Despite identifying multiple lead SNPs, no other variants reached the genome-wide significance threshold, limiting the overall impact of the findings. The absence of an independent validation cohort, with the study relying solely on crossvalidation, is also a major limitation. This approach restricts the ability to independently confirm the findings and evaluate their robustness across different population samples. 

      Appraisal: 

      The authors successfully achieved their aims of confirming the association between the HLA-II region and TB susceptibility in admixed African populations. However, the limited number of significant discoveries, reliance on cross-validation, and insufficient discussion of model performance and SNP significance weaken the overall strength of the findings. Despite these limitations, the results support the conclusion that considering local ancestry is crucial in genetic studies of admixed populations. 

      Impact:  

      The innovative use of the LAAA model and the comprehensive dataset in this study make substantial contributions to the field of genetic epidemiology. 

      Reviewer #2 (Public review): 

      Summary: 

      This manuscript is about using different analytical approaches to allow ancestry adjustments to GWAS analyses amongst admixed populations. This work is a follow-on from the recently published ITHGC multi-population GWAS (https://doi.org/10.7554/eLife.84394), with a focus on the admixed South African populations. Ancestry adjustment models detected a peak of SNPs in the class II HLA DPB1, distinct from the class II HLA DQA1 loci significant in the ITHGC analysis. 

      Strengths: 

      Excellent demonstration of GWAS analytical pipelines in highly admixed populations. Further confirmation of the importance of the HLA class II locus in genetic susceptibility to TB. 

      Weaknesses: 

      Limited novelty compared to the group's previous existing publications and the body of work linking HLA class II alleles with TB susceptibility in South Africa or other African populations. This work includes only ~100 new cases and controls from what has already been published. High-resolution HLA typing has detected significant signals in both the DQA1 and DPB1 regions identified by the larger ITHGC and in this GWAS analysis respectively (Chihab L et al. HLA. 2023 Feb; 101(2): 124-137). Despite the availability of strong methods for imputing HLA from GWAS data (Karnes J et Plos One 2017), the authors did not confirm with HLA typing the importance of their SNP peak in the class II region. This would have supported the importance of this ancestry adjustment versus prior ITHGC analysis. 

      The populations consider active TB and healthy controls (from high-burden presumed exposed communities) and do not provide QFT or other data to identify latent TB infection. 

      Important methodological points for clarification and for readers to be aware of when reading this paper: 

      (1) One of the reasons cited for the lack of African ancestry-specific associations or suggestive peaks in the ITHGC study was the small African sample size. The current association test includes a larger African cohort and yields a near-genome-wide significant threshold in the HLA-DPB1 gene originating from the KhoeSan ancestry. The investigation is needed as to whether the increase in power is due to increased African samples and not necessarily the use of the LAAA model as stated on lines 295 and 296? 

      Thank you for your comment. The Manhattan plot in Figure 3 includes the results for all four models: the traditional GWAS model (GAO), the admixture mapping model (LAO), the ancestry plus allelic (APA) model and the LAAA model. In this figure, it is evident that only the LAAA model identified the association peak on chromosome 6, which lends support the argument that the increase in power is due to the use of the LAAA model and not solely due to the increase in sample size. 

      (2) In line 256, the number of SNPs included in the LAAA analysis was 784,557 autosomal markers; the number of SNPs after quality control of the imputed dataset was 7,510,051 SNPs (line 142). It is not clear how or why ~90% of the SNPs were removed. This needs clarification. 

      Thank you for your recommendation. In our manuscript (line 194), we mention that “…variants with minor allele frequency (MAF) < 1% were removed to improve the stability of the association tests.” A large proportion of imputed variants fell below this MAF threshold, and were subsequently excluded from this analysis. Below, we show the number of imputed variants across MAF bins for one of our datasets [RSA(A)] to substantiate this claim:  

      Author response image 1.

      (3) The authors have used the significance threshold estimated by the STEAM p-value < 2.5x10<sup>-6</sup> in the LAAA analysis. Grinde et al. (2019 implemented their significance threshold estimation approach tailored to admixture mapping (local ancestry (LA) model), where there is a reduction in testing burden. The authors should justify why this threshold would apply to the LAAA model (a joint genotype and ancestry approach). 

      Thank you for your recommendation. We describe in the methods (line 189 onwards) that the LAAA model is an extension of the APA model. Since the APA model itself simultaneously performs the null global ancestry only model and the local ancestry model (utilised in admixture mapping), we thus considered the use of a threshold tailored to admixture mapping appropriate for the LAAA model.  

      (4) Batch effect screening and correction (line 174) is a quality control check. This section is discussed after global and local ancestry inferences in the methods. Was this QC step conducted after the inferencing? If so, the authors should justify how the removed SNPs due to the batch effect did not affect the global and local ancestry inferences or should order the methods section correctly to avoid confusion. 

      Thank you for your comments. The batch effect correction method utilised a pseudo-case-control comparison which included global ancestry proportions. Thus, batch effect correction was conducted after ancestry inference. We excluded 36 627 SNPs that were believed to have been affected by the batch effect. We have amended line 186 to include the exact number of SNPs excluded due to batch effect. 

      The ancestry inference by RFMix utilised the entire merged dataset of 7 510 051 SNPs. Thus, the SNPs removed due to the batch effect make up a very small proportion of the SNPs used to conduct global and local ancestry inferences (less than 0.5%). As a result, we do not believe that the removed SNPs would have significantly affected the global and local ancestry inferences. However, we did conduct global ancestry inference with RFMix on each separate dataset as a sanity check. In the tables below, we show the average global ancestry proportions inferred for each separate dataset, the average global ancestry proportions across all datasets and the average global ancestry proportions inferred using the merged dataset. The SAC and Xhosa cohorts are shown in two separate tables due to the different number of contributing ancestral populations to each cohort. The differences between the combined average global ancestry proportions across the separate cohorts does not differ significantly to the global ancestry proportions inferred using the merged dataset. 

      Author response table 1.

      Comparison of global ancestry proportions across the separate SAC datasets and the merged cohort.

      Author response table 2.

      Comparison of global ancestry proportions in the Xhosa dataset and the merged cohort. 

      Reviewer #1 (Recommendations for the authors): 

      Suggestions for Improved or Additional Experiments, Data, or Analyses:   

      (1) It might be beneficial to consider splitting the data into separate discovery and validation cohorts rather than relying solely on cross-validation. This approach could provide a stronger basis for independently confirming the findings. 

      Thank you for your suggestion. However, we are hesitant to divide our already modest dataset (n=1544) into separate discovery and validation cohorts, as this would reduce the statistical power to detect significant associations.

      (2) Clearly stating the process of cross-validation in the methods section and reporting relevant validation statistics, such as accuracy, sensitivity, specificity, and area under the curve (AUC), would provide a more comprehensive assessment of the model's performance.  

      Thank you for your recommendation. We would like to highlight this article, “GWAS in the southern African context” (1), which evaluated the performance of the LAAA model compared to other models in three- and five-way admixed populations. Given the thorough evaluation of the model’s performance in that study, we did not find it necessary to reassess its performance in this manuscript.   

      (3) Analysing racial cohorts separately to see if you can replicate previous results and find significant markers in combined non-African populations that are not evident in African-only samples might be useful. 

      Thank you for your suggestion. We would like to respectfully note that race is a social construct, and its use as a proxy for genetic ancestry can be problematic (2). In our study, we rather rely on genetic ancestry inferred using ancestry inference software to provide a more accurate representation of our cohort's genetic diversity. Additionally, our cohort consists mostly of a highly admixed population group, with some individuals exhibiting ancestral contributions from up to five different global populations. Therefore, it is not possible to categorize our samples into distinct “Africanonly” or “non-African” groups.

      (4) It might be worthwhile to consider using polygenic risk scores (PRS) to combine multiple genetic influences. This approach could help in identifying cumulative genetic effects that are not apparent when examining individual SNPs.  

      Thank you for your recommendation. While constructing a polygenic risk score (PRS) is beyond the scope of the current study, but an ongoing interest in our group, we recognize its potential value and will consider incorporating this approach in future research endeavours or a separate publication. A recent publication by Majara et al showed that that PRS accuracy is low for all traits and varies across ancestrally and ethnically diverse South African groups (3).

      Recommendations for Improving the Writing and Presentation: 

      Including a more thorough discussion of the methodological limitations, such as the challenges of studying admixed populations and the potential limitations of the LAAA model, would provide a more balanced perspective. 

      Thank you for your suggestion. To provide a more balanced perspective, we included the limitations of our study in the discussion, from line 429 to like 451.

      Minor Corrections to the Text and Figures: 

      Including all relevant statistics would improve clarity. For example, providing confidence intervals for the odds ratios and discussing any observed trends or outliers would be beneficial. 

      Thank you for your recommendation. We have added 95% confidence intervals to all odds ratios reported in Table 3. However, beyond the association peak identified in the HL-II region associated with the phenotype, we do not observe any other trends or outliers in or LAAA analysis.  

      Reviewer #2 (Recommendations for the authors): 

      Points for improvement: 

      (1) Related to the different datasets and inclusions in previous publications, it would also be good to better understand the different numbers of cases and controls included across the previous and current analyses, or discussion thereof. For instance, the RSA(M) dataset includes 555/440 cases/controls for this analysis and only 410/405 cases/controls in the ITHGC analysis. Other discrepancies are noted across the other published datasets compared to those included in this analysis, and these always need to be detailed in a supplement or similar to better understand if this could have introduced bias or was in fact correct based on the additional ancestry-related restriction applied.  

      Thank you for your comments. Table 1 of our manuscript lists number of individuals in the RSA(M) dataset, including related individuals. As described in line 131, related individuals were subsequently excluded during quality control: “Individual datasets were screened for relatedness using KING software (Manichaikul et al., 2010) and individuals up to second degree relatedness were removed.” The ITHGC only reported the number of unrelated individuals included their analyses, which would account for the discrepancies in the reported number of cases and controls.  

      (2) The imbalance between cases and controls in this analysis is quite striking, and it is unusual to have the imbalance favour cases over controls. This contrasts with the ITHGC, where there are substantially more controls. There is no comment on how this could potentially impact this analysis. 

      Thank you for your comment. We have included a note on our case-control imbalance in the discussion:

      “While many studies discuss methods for addressing case-control imbalances with more controls than cases (which can inflate type 1 error rates (Zhou et al. 2018; Dai et al. 2021; Öztornaci et al. 2023), few address the implications of a large case-to-control ratio like ours (952 cases to 592 controls). To assess the impact of this imbalance, we used the Michigan genetic association study (GAS) power calculator (Skol et al. 2006). Under an additive disease model with an estimated prevalence of 0.15, a disease allele frequency of 0.3, a genotype relative risk of 1.5, and a default significance level of 7 × 10<sup>-6</sup>, we achieved an expected power of approximately 75%. With a balanced sample size of 950 cases and 950 controls, power would exceed 90%, but it would drop significantly with a smaller balanced cohort of 590 cases and 590 controls. Given these results, we proceeded with our analysis to maximize statistical power despite the case-control imbalance.” 

      Author response image 2.

      Minor comments 

      (1) Referencing around key points of TB epidemiology and disease states seems out of date, given recent epidemiology reviews and seminal nature or lancet review articles. Please update.  

      Thank you for your suggestion. We have included the following recent publications in the introductory paragraph: 

      Zaidi, S. M. A., Coussens, A. K., Seddon, J. A., Kredo, T., Warner, D., Houben, R. M. G. J., & Esmail, H. (2023). Beyond latent and active tuberculosis: a scoping review of conceptual frameworks. EClinicalMedicine, 66, 102332. https://doi.org/10.1016/j.eclinm.2023.102332

      Menzies, N. A., Swartwood, N., Testa, C., Malyuta, Y., Hill, A. N., Marks, S. M., Cohen, T., & Salomon, J. A. (2021). Time Since Infection and Risks of Future Disease for Individuals with Mycobacterium tuberculosis Infection in the United States. Epidemiology, 32(1), 70–78. https://doi.org/10.1097/EDE.0000000000001271  

      Cudahy, P. G. T., Wilson, D., & Cohen, T. (2020). Risk factors for recurrent tuberculosis after successful treatment in a high burden setting: a cohort study. BMC Infectious Diseases, 20(1), 789. https://doi.org/10.1186/s12879-020-05515-4  

      Escombe, A. R., Ticona, E., Chávez-Pérez, V., Espinoza, M., & Moore, D. A. J. (2019). Improving natural ventilation in hospital waiting and consulting rooms to reduce nosocomial tuberculosis transmission risk in a low resource setting. BMC Infectious Diseases, 19(1), 88. https://doi.org/10.1186/s12879-019-3717-9  

      Laghari, M., Sulaiman, S. A. S., Khan, A. H., Talpur, B. A., Bhatti, Z., & Memon, N. (2019). Contact screening and risk factors for TB among the household contact of children with active TB: a way to find source case and new TB cases. BMC Public Health, 19(1), 1274. https://doi.org/10.1186/s12889-0197597-0  

      Matose, M., Poluta, M., & Douglas, T. S. (2019). Natural ventilation as a means of airborne tuberculosis infection control in minibus taxis. South African Journal of Science, 115(9/10). https://doi.org/10.17159/sajs.2019/5737

      Smith, M. H., Myrick, J. W., Oyageshio, O., Uren, C., Saayman, J., Boolay, S., van der Westhuizen, L., Werely, C., Möller, M., Henn, B. M., & Reynolds, A. W. (2023). Epidemiological correlates of overweight and obesity in the Northern Cape Province, South Africa. PeerJ, 11, e14723. https://doi.org/10.7717/peerj.14723  

      (2) Lines 46 to 48 appear to have two contradictory statements next to each other. The first says there are numerous GWAS investigating TB susceptibility; the second says there are sparse. Please clarify.

      Thank you for bringing this to our attention. We have amended the lines as follows: 

      “Numerous genome-wide association studies (GWASs) investigating TB susceptibility have been conducted across different population groups. However, findings from these studies often do not replicate across population groups (Möller & Kinnear, 2020; Möller et al., 2018; Uren et al., 2017).”

      (3) Add ref in line 69 for two SAC populations.

      Thank you for your recommendation. We have included the citation for the ITHGC meta-analysis paper here: 

      “The authors described possible reasons for the lack of associations, including the smaller sample size compared to the other ancestry-specific meta-analyses, increased genetic diversity within African individuals and population stratification produced by two admixed cohorts from the South African Coloured (SAC) population (Schurz et al. 2024).”

      (4) Write out abbreviations the first time they appear (Line 121).

      Thank you for your recommendation. We have corrected the sentence as follows: 

      “Monomorphic sites were removed. Individuals were screened for deviations in Hardy-Weinberg Equilibrium (HWE) for each SNP and sites deviating from the HWE threshold of 10-5 were removed.”

      (5) It would be good in the supplement to see if there is a SNP peak in chromosome 20 with a hit that reached significance in the Bantu-speaking African ancestry.

      Thank you for your recommendation. We have included a regional plot for the lead variant identified on chromosome 20 originating from Bantu-speaking African ancestry in the supplementary material (Supplementary Figure 3).

      (6) It would be good to mention the p-values of rs28383206 from the ITHGC paper in this cohort for KhoeSan and Bantu-speaking African ancestries. 

      Thank you for your suggestion. We have included the following paragraph from line 352:

      “The lead variant identified in the ITHGC meta-analysis, rs28383206, was not present in our genotype or imputed datasets. The ITHGC imputed genotypes using the 1000 Genomes (1000G) reference panel (4). Variant rs28383206 has an alternate allele frequency of 11.26% in the African population subgroup within the 1000G dataset (https://www.ncbi.nlm.nih.gov/snp/rs28383206). However, rs28383206 is absent from our in-house whole-genome sequencing (WGS) datasets, which include Bantu-speaking African and KhoeSan individuals. This absence suggests that rs28383206 might not have been imputed in our datasets using the AGR reference panel, potentially due to its low alternate allele frequency in southern African populations. Our merged dataset contained two variants located within 800 base pairs of r_s28383206: rs482205_ (6:32576009) and rs482162 (6:32576019). However, these variants were not significantly associated with TB status in our cohort (Supplementary Table 1).” Supplementary Table 1 can be found in the supplementary material:

      (7) It would improve the readability of the ancestry proportions listed on lines 236 and 237 if these population groups were linked with the corresponding specific population used in Figure 1, as has been done in Table 2.

      Thank you for your suggestion. We have amended Figure 1 to include the corresponding population labels mentioned in Table 2.  

      (8) In line 209, it is not clear why the number of alleles of a specific ancestry at a locus is referred to as a covariate in admixture mapping when the corresponding marginal effect is the parameter of interest. 

      Thank you for bringing this to our attention. We have amended the description as follows: 

      “(2) Local ancestry (LA) model:

      This model is used in admixture mapping to identify ancestry-specific variants associated with a specific phenotype. The LA model evaluates the number of alleles of a specific ancestry at a locus and includes the corresponding marginal effect as a covariate in association analyses.”

      (9) Table 3 would benefit from a column on whether the SNP was genotyped or imputed. 

      Thank you for your suggestion. We have included a column indicating whether the SNP was genotyped or imputed, as well as an additional column with the INFO score for imputed genotypes. 

      (10) The authors should remove the print and download icons in Figure 1 on lines 240 and 241.

      Thank you for your suggestion. We have amended the figure as requested.  

      (11) In the quality control, the authors use a more relaxed threshold for missingness in individuals (90%) and genotypes (5%) and have strayed away from the conventional 97%-98%. An explanation of the choice of these thresholds will be helpful to the reader.

      Thank you for your suggestion. We aimed to use similar genotype and individual missingness thresholds outline by the ITHGC meta-analysis (which utilised a threshold of 10% for both genotype and individual missingness) and the previous LAAA analysis paper performed by Swart et al. in 2021. We have amended line 116 for more clarity: 

      “Individuals with genotype call rates less than 90% and SNPs with more than 5% missingness were removed as described previously (5).”

      References  

      (1) Swart Y, van Eeden G, Uren C, van der Spuy G, Tromp G, Moller M. GWAS in the southern African context. Cold Spring Harbor Laboratory. 2022;

      (2) Byeon YJJ, Islamaj R, Yeganova L, Wilbur WJ, Lu Z, Brody LC, et al. Evolving use of ancestry, ethnicity, and race in genetics research-A survey spanning seven decades. Am J Hum Genet. 2021 Dec 2;108(12):2215–23.

      (3) Majara L, Kalungi A, Koen N, Tsuo K, Wang Y, Gupta R, et al. Low and differential polygenic score generalizability among African populations due largely to genetic diversity. HGG Adv. 2023 Apr 13;4(2):100184.

      (4) Schurz H, Naranbhai V, Yates TA, Gilchrist JJ, Parks T, Dodd PJ, et al. Multi-ancestry metaanalysis of host genetic susceptibility to tuberculosis identifies shared genetic architecture. eLife. 2024 Jan 15;13.

      (5) Swart Y, Uren C, van Helden PD, Hoal EG, Möller M. Local ancestry adjusted allelic association analysis robustly captures tuberculosis susceptibility loci. Front Genet. 2021 Oct 15;12:716558.

    1. Author response:

      The following is the authors’ response to the original reviews.

      We thank the reviewers for their time and the thoughtful reviews on our manuscript. The reviewers brought good points regarding the sample size, and the low exposure in the South Asian cohort owing to their unique cultural and social practices. We recognize these as limitations of the paper and discussed these in the revised version. In the revised manuscript, we have taken the key suggestions by reviewers to 1) better illustrate the analytical flow and statistical methods, in particular, to show which datasets had been used in discovery, validation, and testing of the score – as a main figure in the manuscript and in the graphical abstract; 2) demonstrate there is no possibility of overfitting in our approach using statistical metrics of performance; 3) emphasize the goal was not for discovery (e.g. our own EWAS was not used for deriving the score), but to compare with existing EWASs and contrast the results from the white European and SA populations; 4) and supplement the analysis with previously derived maternal smoking, smoking and air pollution methylation score and to explore additional health outcomes in relation to lung health in newborns. Finally, we would also like to take this opportunity to re-iterate that it was not our objective to derive the most powerful methylation score of smoking nor to demonstrate the causal role of maternal smoking on birth weight via DNAm. We have restructure the manuscript as well as the discussion to clarify this. Please find below a point-by-point response to the comments below.

      Reviewer #1:

      The manuscript could benefit from a more detailed description of methods, especially those used to derive MRS for maternal smoking, which appears to involve overfitting. In particular, the addition of a flow chart would be very helpful to guide the reader through the data and analyses. The FDR correction in the EWAS corresponds to a fairly liberal p-value threshold. 

      We thank the reviewer for these good suggestions. In the revised manuscript, we have provided a flow chart as the new Figure 1, more detailed description of the method (added a subsection “Statistical analysis” under Materials and Methods) as well as metrics including measures of fit indices such as AUC and adjusted R2 for each validation and testing dataset to illustrate there is no danger of overfitting (in new Supplementary Table 5).

      The choice of use FDR was indeed arbitrary as there has been no consensus on what significance threshold, if any, should be used in the context of EWAS. Here we simply followed the convention in previous studies to contrast the top associated signals for their effects between different populations and with reported effect sizes. Throughout the manuscript, we have removed the notion of significant associations and used the phrase “top associated signals” or “top associations” when discussion EWAS results for individual CpGs.

      Reviewer #2:

      (1) The number of mothers who self-reported any smoking was very low, much lower than in the general population and practically non-existent in the South Asian population. As a result, all analyses appeared to have been underpowered. It is possibly for this reason that the authors chose to generate their DNA methylation model using previously published summary statistics. The resulting score is not of great value in itself due to the low-powered dataset used to estimate covariance between CpG sites. In fact, a score was generated for a much larger, better-powered dataset several years ago (Reese, EHP, 2017, PMID 27323799). 

      We thank the reviewer for pointing out the low exposure in the South Asian population, which we believe is complementary to the literature on maternal smoking that almost exclusively focused on white Europeans. However, the score was validating in the white European cohort (CHILD; current smoking 3.1%), which was reasonably similar to the trend that maternal cigarettes smoking is on the decline from 2016 to 2021, from 7.2% to 4.6% (Martin, Osterman, & Driscoll, 2023). This is also consistent with the fact that CHILD participants were recruited from major metropolitans of Canada with relatively high SES and education as compared to FAMILY.

      We do agree with the reviewers that a higher prevalence of maternal smoking in the validating sample could potential improve the power of the score. Our original analytical pipeline focused on CHILD as the validation dataset; FAMILY (see the new Figure 1) was used as the testing data. We alternatively provided an analytical scheme using FAMILY as the validation dataset, as it had a higher proportion of current smokers, however, this is limited by the number of CpGs available (128 in FAMILY vs. 2,619 in CHILD out of the 2,620 CpGs from (Joubert et al., 2016)). The results of all possible combinations of validation vs. testing and restriction of targeted array vs. HM450 are summarized in the new new Supplementary Table 5 and Supplementary Figure 5.

      To clarify, our choice to construct DNAm score using published summary statistics was not an ad-hoc decision due to the observed low power from CHILD EWAS. We agree with the reviewer that our study was indeed underpowered and was not originally intended for EWAS discovery. Thus, we specifically proposed to adopt a multivariate strategy from the literature of polygenic risk scores. This approach enabled us to leverage well-powered association signals without individual-level access to data with a sample size of n > 5,000 (Joubert et al., 2016). In comparison, the Reese maternal smoking score (Reese et al., 2017) had a discovery sample size of only n = 1,057. Our score was not out-performed, in fact, the AUC in both FAMILY (external validating dataset; n=411) and CHILD (external testing dataset; n=352) and was larger than that based on the Reese score as tabulated below (part of the new Supplementary Table 5).

      Author response table 1.

      Further, regarding the comment on the covariance matrix. Indeed, lassosum via elastic-net and summary data requires a reference covariance matrix that is consistent between the discovery data and external validation data. In fact, for moderately sized correlation/covariance values (r2 > 0.1), a sample size of >100 is sufficiently powered to detect it being different from 0 and thus used for estimation. Similar to the linkage disequilibrium of genotype data, the CpGs also exhibit a block-wise correlation structure and thus the theoretical framework of lassosum extends naturally to MRS.

      In the revised manuscript, we included the Reese score, as well as a few additional scores to compare their predictiveness of smoking phenotypes in white European cohorts. We note that the applicability was limited in the FAMILY cohort that was profiled using a targeted array and only 7 out of 28 of the CpGs in the Reese score were available. As a result, though the Reese score had similar performance than our derived score in CHILD (0.94 vs. 0.95), its performance in FAMILY was compromised (0.72 vs. 0.89).

      (2) The conclusion that "even minimal smoking exposure in South Asian mothers who were not active smokers showed a DNAm signature of small body size and low birthweight in newborns" is not warranted because no analyses were performed to show that the association between DNA methylation and birth size/weight was driven by maternal smoking. 

      We thank the reviewer for this subtle point – it was not our intention to suggest there was a causal relationship between DNA methylation and birth size that was mediated by maternal smoking. We meant to suggest that the maternal smoking methylation score was consistently associated with negative outcomes in newborns of both white European and South Asian mothers despite no maternal smoking was present in South Asian mothers. It is possible that maternal smoking MRS was capturing a lot more than just smoking and second-hand smoking, such as other environmental exposures that also lead to oxidative stress. These together are associated with reduced birth size/weight.

      In the revised manuscript, we have modified the conclusion above to:

      “Notably, these results indicate a consistent association between the DNAm signature of maternal smoking and a small body size and low birthweight in newborns, in both white European mothers who exhibited some amount of smoking and in South Asian mothers who themselves were not active smokers.”

      (3) Although it was likely that some mothers were exposed to second-hand smoke and/or pollution, data on this was either non-existent or not included in this study. Including this would have allowed a more novel investigation of the effects of smoke exposure on the pregnancies of non-smoking mothers.

      We agree with this comment – second-hand smoking was captured by self-reported weekly smoking exposure by the mothers. We reported the association with smoking exposure and found that it was not consistently associated with our methylation scores across the cohorts (cohort specific association p-values of 5.4×10-5, 3.4×10-5, and 0.58, for CHILD, FAMILY, and START; original Table 3), possibly due to the low exposure in South Asian population (max weekly exposure was 42 hrs in contrast to 168 hrs in FAMILY and 98 hrs in CHILD). Meanwhile, air pollution data are currently not available. Here we additionally performed the association between maternal smoking and air pollution methylation score, using key CpGs from the largest air pollution EWAS to-date (Gondalia et al., 2021). However, there was no association between the air pollution score and any maternal smoking phenotypes (ps > 0.4).

      (4) One of the European cohorts and half of the South Asian cohort had DNA methylation measured on only 2500 CpG sites. This set of sites included only 125 sites previously linked to prenatal smoking. The resulting model of prenatal smoking was small (only 11 CpG sites). It is possible that a large model may have been more powerful.

      That is correct – also see our response to R2 comment #1. In our previous analysis, we validated two scores (one based on CpGs on the < 3,000 CpGs array and the other one for the full HM450K). The score with more CpGs indeed had slightly better performance. We included this as one of the limitations of the paper. Nevertheless, it does not impact the conclusion that the scores (based on a larger or smaller model) are transferrable to diverse populations and can be used to comparatively study the DNAm influence of maternal smoking in newborns.

      The following was added in the discussion:

      “First, the customized array with a limited number of CpGs (<3,000) was designed in 2016 and many large EWASs on smoking and maternal smoking conducted more recently had not been included.”

      (5) The health outcomes investigated are potentially interesting but there are other possibly more important outcomes of interest such as birth complications, asthma, and intellectual impairment which are known to be associated with prenatal smoking.

      We thank the reviewer for bring up this point. One of the key health outcomes in the CHILD study was asthma, and data at later time points are available. However, we do not have similar outcomes collected in the other two studies (FAMILY and START), which focused on cardiometabolic health in young children. Thus, we did not initially include outcomes that were not available across all cohorts as the intention was to contrast the effects between populations.

      We recognize that this is an important question and decided to provide the association results for asthma and allergy at available time points in CHILD, FAMILY, and START. We also included mode of delivery via emergency C-section as an additional proxy outcome of birth complications. However, none of these were marginally (p < 0.05) associated with the DNAm smoking score. These are now included in the updated Supplementary Table 8.

      Reviewer #1 (Recommendations For The Authors):

      (1) The number of samples in the South Asian birth cohort given in the abstract (n = 887) does not match the sample size of the START cohort from the results section (results, page 7, line 139, n = 880). It is also different from the final analytical dataset size from the methods section (page 17, line 386, n = 890). Please clarify. 

      We thank the reviewer for pointing this out. In the abstract, it was the final sample sized used for EWAS (no missingness in smoking history). The 880 in result was a typo for 890, which contains three individuals with missing smoking data. These have been updated with the correct sample size for START cohort that had full epigenome-wide methylation data (n = 504, and 503 with non-missing smoking history).

      (2) Page 3, line 54: "consistent signal from the GFI1 gene (ps < 5×10-5)". Is ps a typo? If not then it might be clearer to state how many sites this included. 

      No, these summarized the six CpG sites in the GFI1 gene as outlined in Table 2. We have clarified in the abstract to show the number of CpG sites included.

      (3) Please report effect sizes together with information about the statistical significance (p values). 

      We have updated the manuscript with (standardized) effect sizes whenever possible along with p-values.

      (4) Page 4, line 80. This paragraph could be improved by adding a sentence explaining DNA methylation. 

      We thank the reviewer for this suggestion. A sentence was included to introduce DNAm at the beginning of the second paragraph:

      “DNA methylation is one of the most commonly studied epigenetic mechanisms by which cells regulate gene expression, and is increasingly recognized for its potential as a biomarker (13).”

      (5) Page 4, line 84. Sentence difficult to understand, please rephrase: "Our recent systematic review of 17 cord blood epigenome-wide association studies (EWAS) demonstrated that out of the 290 CpG sites reported, 19 sites were identified in more than one study; all of them associated with maternal smoking". 

      We have revised to clarify the review was on cord blood EWAS with five outcomes: maternal diabetes, pre-pregnancy body mass index, diet during pregnancy, smoking, and gestational age.

      “Our recent systematic review of 17 cord blood epigenome-wide association studies (EWAS) found that out of the 290 CpG sites reported to be associated with at least one of the following: maternal diabetes, pre-pregnancy body mass index (BMI), diet during pregnancy, smoking, and gestational age, 19 sites were identified in more than one study and all of them associated with maternal smoking.”

      (6) Page 5, line 93. The second part of the sentence is not necessary: "The majority of cohort studies have focused on participants of European ancestry, but few were designed to assess the influence of maternal exposures on DNA methylation changes in non-Europeans". 

      We have revised accordingly to:

      “Only a handful of cohort studies were designed to assess the influence of maternal exposures on DNA methylation changes in non-Europeans.”

      (7) Page 5, line 95. "It has been suggested that ancestral background could influence both systematic patterns of methylation (27), such as cell composition and smoking behaviours (28)". The sentence is slightly unclear. Could it be rephrased to say that cell composition differences may be present by ancestry, which can lead to differential DNAm patterns? 

      We have revised accordingly to:

      “It has been suggested that systematic patterns of methylation (Elliott et al., 2022), such as cell composition, could differ between individuals of different ancestral backgrounds, which could in turn confound the association between differential DNAm and smoking behaviours (Choquet et al., 2021).”

      (8) Page 5, line 108. How does reducing the number of predictors lead to more interpretable effect sizes? 

      This was meant as a general comment in the context of variable selection, whereby the fewer predictors there are, the effect size of each predictor becomes more interpretable. However, we recognize this comment might be irrelevant to the specific approaches we adopted. We have revised it to motivate methylation score as a powerful instrument for analysis:

      “Reducing the number of predictors and measurement noise in the data can lead to better statistical power and a more parsimonious instrument for subsequent analyses.”

      (9) Page 5, line 112. Health consequences seem a bit strong, given that the analysis describes correlations/associations. 

      We have revised it to “association with”:

      “In this paper, we investigated the epigenetic signature of maternal smoking on cord blood DNA methylation in newborns, as well as its influence on newborn and later life outcomes in one South Asian which refers to people who originate from the Indian subcontinent, and two predominantly European-origin birth cohorts.”

      Results

      (10) It would be very helpful to have a flow diagram to detail all of your analyses.

      We thank the reviewer for this suggestion. In the revised manuscript, we have provided a flow chart as the new Figure 1, updated the summary of analysis in . Table 3, and added a new Supplementary Table 5 for the DNAm score derivation, as well as more detailed description of the statistical analysis in the Materials and Methods under the subsection “Statistical analysis”.

      (11) Page 7, line 138. Please add a reference to the CHILD study. 

      We have added a reference of the CHILD study.

      (12) Tables in results and in supplemental data a) contain a mixture of fields describing the newborn and its mother (this is not true for Supplementary Table 2), b) lack column descriptions, c) lack descriptions of abbreviations and formatting used in tables, d) use different font types, e) lack descriptions of statistical tests that were used to obtain p-values, f) use inconsistent rounding. Please correct and add the missing information.

      We have consolidated the notation and nomenclature in all Tables and text. All numerical results are now rounded to 2 decimal places. The tests used were included in the Table headers as well as described in the Materials and Methods:

      “For continuous phenotypes, an analysis of variance (ANOVA) using the F-statistics or a two-sample t-test was used to compare the mean difference across the three cohorts or two groups, respectively. For categorical phenotypes, a chi-square test of independence was used to compare the difference in frequencies of observed categories. Note that three of the categories under smoking history in the START cohort had expected cell counts less than 5, and was thus excluded from the comparison, the reported p-value was for CHILD and FAMILY.”

      (13) Table 1. Sample sizes given in column descriptions do not add up to 1,650 (legend text).

      We thank the reviewer for pointing this out. The updated sample size is 1,267, based on the 352 CHILD samples, 411 FAMILY samples, and 352 START samples. Notice that we did not remove those without full smoking history data as Table 1 was intended for the epigenetic subsamples.

      (14) Page 7, line 156. Supplementary Tables are incorrectly numbered. In the text, Supplementary Table 4 comes after Supplementary Table 2.

      We thank the reviewer for catching this and have corrected the ordering of the Supplementary Tables and Figures. 

      (15) Page 7, line 158. "cell compositions" - do you mean estimated white cell proportions? 

      We have revised it to “estimated cord blood cell proportions” in the text throughout.

      (16) Smoking EWAS - do you see any overlap/directional consistency with the top findings from adult EWASs of smoking such as AHRR? 

      We annotated the top EWAS signals from the literature in the meta-analysis (new Figure 2; Supplementary Figures 1 and 3), but was only able to confirm associations in the GFI1 gene. The AHRR signals were also annotated, but below the FDR correction threshold as seen in new Figure 2 at the start of chromosome 5. We further added a new Supplementary Figure 3 to show the directional consistency with top findings (2,620 CpGs reported and 128 CpGs overlapped with our meta-analysis) from Joubert et al., 2016. The Pearson’s correlation coefficient with meta-analyzed effect for maternal smoking was 0.72 and for smoking exposure was 0.60.

      We added the following to Results:

      “Further, we observed consistency in the direction of association for the 128 CpGs that overlapped between our meta-analysis and the 2,620 CpGs with evidence of association for maternal smoking (19) (Supplementary Figure 3). Specifically, the Pearson’s correlation coefficient for maternal smoking and weekly smoking exposure was 0.72 and 0.60, respectively.”

      (17) Page 8, line 169. "also coincided with the GFI1 gene" this is a bit imprecise. Please report the correlation with the CpG from the maternal smoking analysis. 

      The CpG was inside the GFI1 gene, we have included the Pearson’s correlation with the top hit in the text below:

      “There were no CpGs associated with the ever-smoker status at an FDR of 0.05, though the top signal (cg09935388) was also mapped to the GFI1 gene (Pearson’s r2 correlation with cg12876356 = 0.75 and 0.68 in CHILD and FAMILY, respectively; Supplementary Figure 1).”

      (18) Page 8, line 171. Typo "ccg": "ccg01798813". 

      It has been corrected to “cpg01798813”.

      (19) Page 8, line 176. Please be clear about the phenotype used in these analyses. 

      The EWAS of weekly smoking exposure in START was removed in this version of the manuscript, in reflection of the results and the reviewer’s comments, as a result of this phenotyping being skewed and possibly leading to only spurious results (also see response to comment #20).

      We have clarified the phenotypes for these results under “Epigenetic Association of Maternal Smoking in White Europeans” below:

      “The maternal smoking and smoking exposure EWASs in CHILD did not yield any CpGs after FDR correction (Supplementary Figure 3).”

      (20) What was the genomic inflation for the EWASs? 474 loci in the South Asian EWAS seems like a lot of findings. Perhaps a more robust method (e.g., OSCA MOMENT) might help to control the false positive rate. 

      The genomic inflation factor was moderately across the cohorts for smoking exposure: 1.02 in CHILD, 0.94 in FAMILY, and 1.00 in START. However, there was more inflation in the tail of the distribution in START than the European cohorts. The empirical type I error rates at 0.01, 0.001, 0.00001, were high in START (x1.7, x5.7, and x165 times at each respective threshold), in contrast to CHILD (x1.06, x1.05, and x0.6) or FAMILY (x1.6, x1.9, and 0). The smoking exposure EWAS based on START was thus removed as these are likely false positives and there was very low smoking exposure to start with (11 reported weekly exposure between 2–42 hrs/week out of 462 with non-missing data). We have added the QQ-plots as well as the genomic inflation factor for the reported meta-analysis in the new Supplementary Figure 2. The following was added to the Results:

      “There was no noticeable inflation of empirical type I error in the association p-values from the meta-analysis, with the median of the observed association test statistic roughly equal to the expected median (Supplementary Figure 2).”

      (21) What is the targeted array? I don't think it has been introduced prior to this point. 

      We introduced it in the Materials and Methods under subsection “Methylation data processing and quality controls”. Considering this comment and previous comments on the ordering of Tables and Figures, we have decided to place Materials and Methods after Introduction and before Results.

      (22) The MRS section is described poorly in the results section. It is not clear where the 11 or 114 CpGs come from.

      We now include an analytical summary of all scores (derived or external from literature) in the new Supplementary Table 5. Further, we updated the description of scores in Materials and Methods under the subsection “Using DNA Methylation to Construct Predictive Models for Maternal Smoking” to clarify the source and types of MRSs derived:

      “To evaluate whether the targeted GMEL-EPIC array design has comparable performance as the epigenome-wide array to evaluate the epigenetic signature of maternal smoking, a total of three MRSs were constructed, two using the 128 CpGs available in all cohorts – across the HM450K and targeted GMEL-EPIC arrays – and with either CHILD (n = 347 with non-missing smoking history) or FAMILY (n = 397) as the validation cohort, and another using 2,107 CpGs that were only available in CHILD and START samples with CHILD as the validation cohort. Henceforth, we referred to these derived maternal smoking scores as the FAMILY targeted MRS, CHILD targeted MRS, and the HM450K MRS, respectively.”

      (23) Page 9, line 187. "There was no statistically significant difference between the two scores in all samples (p = 1.00) or among non-smokers (p = 0.24).". How was the significance assessed? Please describe the models (outcome, covariates, model type) used for comparing the two models. It would also be good to report the correlation between the scores.

      We have added a subsection “Statistical analysis” under Materials and Methods that described the tests. The correlation between scores is now summarized as a heatmap across all cohorts in the new Supplementary Figure 6.

      “For each cohort, we contrasted the three versions of the derived scores using an analysis of variance analysis (ANOVA) along with pairwise comparisons using a two-sample t-test to examine how much information might be lost due to the exclusion of more than 10-fold CpGs at the validation stage. We also examined the correlation structure between all derived and external MRSs using a heatmap summarizing their pairwise Pearson’s correlation coefficient.”

      (24) Please include the number of samples in the training/validation and in the test set in the methods and in the results.

      We thank the reviewer for this suggestion. In the revised manuscript, we have provided a flow chart as the new Figure 1 and more detailed description of the method in the Materials and Methods. Please also see response to comment #22. The training sample size is based on Joubert et al., (2016), which is 5,647. For our main analyses, the validation sample with non-missing phenotypes remained the CHILD cohort (n=347), while the FAMILY (n=397) and START (n=503) samples were the independent testing data. We alternatively provided another scenario, in which the FAMILY sample was the validation cohort, while CHILD and START were the testing cohorts. The exact sample size and performance metrics for each scenario and score are clearly summarized in the new Supplementary Table 5.

      (25) Table 3. Please clarify the type of information contained in the four last columns (p-value?).

      Yes – these are the individual cohort p-values. We have taken the suggestion from comment #12 to fully describe all columns and fields.

      (26) Page 10, line 215: "The meta-analysis revealed no heterogeneity in the direction nor the effect size of associations between populations". Please quote/refer to the results. 

      In the revision, the heterogeneity p-values were quoted and the relevant tables (Supplementary Table 8) were added to this sentence.

      (27) Figure 2 has issues with x labels. Due to the low number of ever smokers in START, the boxplot may not be the best visualisation method. It would also benefit from listing n's per group.

      We appreciate this comment to improve the figure presentation. We increased the font size for the X-labels. The sample size for each group in START was also labeled in the new Figure 3 (previously Figure 2).

      Discussion

      (28) Studying the association between maternal smoking and cord blood DNAm is interesting from a biological perspective as it allows for assessing the immediate and long-term effects of maternal smoking on newborn health. However, in terms of calculating the MRS, what are the benefits of using cord blood over the mother's blood? We know that blood-based DNAm smoking score is a powerful predictor of long-term smoking status. 

      The reviewer raises an interesting point – abundant literature supports that DNAm changes are tissue-specific. While mother’s blood DNAm smoking score reflect the long-term exposure to smoking in mothers, the cord blood DNAm captures the consequence of such long-term exposure for newborn health. One of the key results of our study is showing that established DNAm signatures of maternal smoking, which is known to mediate birth size and weight in white Europeans (these references were cited in the original manuscript), carries the same effect of reducing birth weight and size in the South Asian population. This is a critical finding from a DoHaD and public health perspective, as DNAm signatures of maternal smoking, irrespective of the smoking status of the mother, can influence the health trajectory of the newborns.

      We have expanded our discussion based on this suggestion to highlight the unique features of studying maternal smoking via different tissues and their implications. The following was added to the discussion:

      “There are several advantages of using a cord blood based biomarker from the DoHaD perspective. Firstly, cord blood provides a direct reflection of the in utero environment and fetal exposure to maternal smoking. Additionally, since cord blood is collected at birth, it eliminates potential confounding factors such as postnatal exposures that may affect maternal blood samples. Furthermore, studying cord blood DNAm allows for the assessment of epigenetic changes specifically relevant to the newborn, offering valuable information on the potential long-term health implications.”

      (29) Page 13, line 285: "Fourth" without "third".

      It has been revised accordingly.

      Methods 

      (30) The methods section does not contain all the details required to replicate the analysis. Whenever statistical analysis is conducted, this section should clearly describe the type of the analysis (linear regression, t-test, etc.) and name the dependent and independent variables. Sample sizes should also be given. 

      We added further details of test used and sample size for each analysis. We have also included a new “Statistical analysis” subsection under Materials and Methods.

      (31) Please describe MRS testing in the methods.

      We tested MRS with respect to binary and continuous smoking phenotypes using a logistic and linear regression, respectively. The predictive value was assessed using area under the roc curve for the binary outcome and an adjusted R2 for the continuous outcome. These were added to the new “Statistical analysis” subsection under Materials and Methods. See response to comments #22-24, and #30.

      (32) Please describe the methods used to compare the two versions of MRS for maternal

      smoking.

      It was a two-sample t-test, which was described in the Figure legends. We have now added this to the new “Statistical analysis” subsection under Materials and Methods.

      (33) Please describe testing the associations between MRS and Offspring Anthropometrics in more detail.

      We added further details on the regression model and the test for association in the methods. We have now added this to the new “Statistical analysis” subsection under Materials and Methods.

      (34) Meta analysing the 450k and GMEL arrays is going to substantially reduce the number of CpGs under investigation.

      We agree with the reviewer that this is not optimal for signal discovery. However, this is the only way we could synthesize evidence across the cohorts as FAMILY samples were only processed using the customized array. We added the following as a limitation of the study in the discussion.

      “First, the customized array with a limited number of CpGs (<3,000) was designed in 2016 and many large EWASs on smoking and maternal smoking conducted more recently had not been included.”

      (35) Page 16, line 364: GDM abbreviation was used in the results section (line 145), yet it is introduced in line 364. 

      Thank you for catching this, we have removed the duplicate.

      (36) Page 17, line 381: Given the stated importance of ancestry, why not restrict the sample to genetically confirmed groups?

      The reviewer has a valid point that ancestry, either perceived or genetic, can introduce additional heterogeneity due to potential differences in genetics, cultural and social practices, and lifestyles. Genetic data are indeed available for a subset of the individuals. In the original version of the manuscript, we used a stringent ancestry calling method by mapping all individuals with the 1000 Genomes samples from continental populations. The final definition was based on a combination of self-reported and genetically confirmed ancestry. However, if we restricted only to genetically confirmed groups, the sample size would be reduced to 312 (vs. 411), 268 (vs. 352), and 488 (vs. 504) in FAMILY, CHILD, and START, respectively.

      We compared the mean difference in the beta-values of the top associated CpGs and the derived MRS between those genetically confirmed vs. self-reported ancestral groups, and observed no material difference. These results are now included in the Supplementary Materials as part of the sensitivity analysis. Thus, given these considerations, we decided to use this complementary approach to retain the maximum number of samples while ensuring some aspect of ancestral homogeneity.

      “To maximize sample size in FAMILY and CHILD, we retained either self-identified or genetically confirmed Europeans based on available genetic data (Supplementary Table 1).”

      (37) Page 18, line 397: sensitivity analysis not sensitive analysis.

      Thank you for catching this, we have revised accordingly.

      (38) Page 18, line 409: smoking was rank transformed however, it would be good to see regression diagnostics for the lead loci in the EWAS to check that assumptions were met. 

      We thank the reviewer for this suggestion. Smoking exposure is indeed skewed and in fact very much zero-inflated across the cohorts. The raw phenotype violated several model assumptions in terms of variance heteroskedasticity, outlying values (influential points), and linearity. The diagnostics suggested improved deviation from model assumption, yet some aspects of the violation remained at a lesser degree. We included a comparison of results before and after transformation and model diagnostics for the lead CpG using CHILD and FAMILY data in the Supplementary Materials. The following was added to the results:

      “As a sensitivity analysis, we repeated the analysis for the continuous smoking exposure under rank transformation vs. raw phenotype for the associated CpG in GFI1 and examined the regression diagnostics (Supplementary Material), and found that the model under rank-transformation deviated less from assumptions.”

      (39) Page 19, line 418: FDR seems quite a lenient threshold, especially when genome-wide significance thresholds exist. I would be inclined to view the EWAS findings as null.

      The choice of use FDR to was indeed arbitrary as there has been no consensus on what significance threshold, if any, should be used in the context of EWAS. The significance threshold for GWAS (Pe’er et al., 2008) probably does not apply directly to EWAS as the number of effective tests will likely differ between genome-wide genetic variants and CpGs. The Bonferroni corrected p-value threshold in this context would be 0.05/200,050=2.5´10-7, which is still less stringent than the GWAS significance threshold. We originally decided to follow the convention of previous studies and use FDR to filter out a subset of plausible associations to contrast the top association signals for their effects between different populations and with reported effect sizes.

      We have revised the manuscript throughout by removing the notion of significant associations, and instead used the phrase “top associated signals” or “top associations” when discussion EWAS results for individual CpGs. The following was added to Materials and Methods to clarify the choice of our threshold:

      “For each EWAS or meta-analysis, the false discovery rate (FDR) adjustment was used to control multiple testing and we considered CpGs that passed an FDR-adjusted p-value < 0.05 to be relevant for maternal smoking.”

      (40) I do not understand Supplementary Figure 6 - how have the data been standardised? Why not plot the CpGs on the beta-value scale?

      The standardized values were plotted as the reported p-values for the mean and variance equality tests (i.e. ANOVA F-test, Levene’s test, Anderson-Darling test) were based on these transformed values to reduce inflation due to non-normality. We have since removed this comparison and kept only the comparison of the overall score as the number of CpGs in the HM450k score (143 CpGs) for comparison is too high to be visually interpretable.

      (41) It is my understanding, that the MRS for maternal smoking was constructed using external weights projected and regularised using elastic net (effectively trained) in CHILD cohort. The results section discusses associations between maternal smoking history and outcomes in CHILD, FAMILY, and START. Training and testing the score in the same sample (cohort) may result in overfitting and therefore should not be implemented.

      The original MRS was constructed using external weights from an independent discovery sample (Joubert et al., 2016; n > 5,000) and the LASSO validation was done in CHILD (n = 352), external testing was in FAMILY and START. This was the lassosum framework whereby we leverage larger sample size from external studies to select more plausible CpGs as candidates to include in the model. Thus, training, validation, and testing were not done in the same samples. We have included a Figure 1 to illustrate the updated analytical flow and a graphical abstract to summarize the methods.

      (42) Is it a concern that the findings don't seem to replicate Joubert's results, which came from a much larger study?

      Replication is usually done in samples much larger than the discovery samples, thus it is not a concern that we were unable to confirm all signals from Joubert et al., (2016). However, 6/7 of the top associations (FDR adjusted p-value < 0.05) in the meta-analysis were declared as significant in Joubert et al. (2016). In addition, the fact that using Joubert’s summary statistics, we were able to derive MRSs that were strongly associated with both smoking history and weekly exposure suggests shared signals. Also see response to  R1 comment #16 for a comparison of effect consistency.

      (43) Please check that all analysis scripts have been uploaded to Github and that the EWAS results are publicly available.

      We thank the reviewer for this suggestion. All updated scripts and EWAS results are available on Github. We are working to have the results also submitted to EWAS catalog.

      Reviewer #2 (Recommendations For The Authors):

      The impact of this study is reduced due to previous findings:

      (1) Previous studies have already shown that DNA methylation may mediate the effect of maternal smoking on birth size/weight (see e.g.https://doi.org/10.1098/rstb.2018.0120https://doi.org/10.1093/ije/dyv048).

      We thank the reviewer for this point and would like to take the opportunity to clarify that it was not our objective to examine whether there was a causal relationship, between DNA methylation and birth size that was mediated by maternal smoking. One of the key messages of our study is to evaluate whether epigenetic associations – at individual CpGs and aggregated as a score – are consistent between white European and South Asian populations. One way to examine this is through using established DNAm signatures of maternal smoking, which is known to mediate birth size and weight in white Europeans (these references were cited in the original manuscript), and confirm whether they also carry the same effect on birth outcomes in the South Asian population.

      Indeed, our results support that maternal smoking methylation score was consistently associated with negative outcomes in newborns of both white European and South Asian mothers despite no maternal smoking was present in South Asian mothers. These collective point to the possibility that the maternal smoking MRS was capturing a lot more than just smoking and second-hand smoking, but potentially other environmental exposures that also lead to oxidative stress. These together are associated with health consequences, including reduced birth size/weight. One of the candidates for such exposure is air pollution as some of the maternal smoking CpGs were previously linked to air pollution. However, we were unable to assess this hypothesis directly without the air pollution data, and the air pollution methylation score was not associated with smoking history (Supplementary Figure 5) nor smoking exposure (p > 0.4 in CHILD, FAMILY and START).

      The following was added to Materials and Methods under the subsection Using DNA Methylation to Construct Predictive Models for Maternal Smoking:

      “To benchmark and compare with existing maternal smoking MRSs, we calculated the Reese score using 28 CpGs (48,49),  Richmond score using 568 CpGs (49), Rauschert score using 204 CpGs (50), Joubert score using all 2,620 CpGs with evidence of association for maternal smoking (19), and finally a three-CpG score for air pollution (51). The details of these scores and score weight can be found in Supplementary Table 4.”

      The following was added to Results

      “Both produced methylation scores that were significantly associated with maternal smoking history (ANOVA F-test p-values =1.0×10-6 and 2.4×10-14 in CHILD and  6.9×10-16 and <2.2×10-16 in FAMILY), and the best among alternative scores for CHILD and FAMILY (Supplementary Table 5). With the exception of the air pollution MRS, all remaining scores were marginally associated with smoking history in both CHILD and FAMILY (Supplementary Figure 5).”

      (2) Due to the small study size and low levels of prenatal smoke exposure, the model derived here is of little value and is, in fact, superseded by a previously published model (PMID: 27323799). At the very least, the model should be evaluated here. A novel aspect of this study is the inclusion of a South Asian cohort. Unfortunately, smoke exposure is practically non-existent, so it is unclear how it can be used. The more interesting finding in this study is the possibility that environmental factors such as second-hand smoke or pollution may have similar effects on pregnancies as maternal smoking. Are these available? If so, they could be evaluated for associations with DNA methylation. This would be novel. 

      In the revised manuscript, we included the Reese score (Reese et al., 2017) and a few other maternal smoking scores for comparison. In the CHILD cohort, the performance was comparable to our derived score (AUC of 0.95 vs. 0.94 for Reese score), but its applicability was limited since the FAMILY dataset was profiled using a targeted array and only 7 out of 28 of the CpGs in the Reese score were available (AUC of 0.89 vs. 0.72 for Reese). As compared to the remaining scores from literature (see the new Supplementary Table 5 for complete results), Reese’s score has generally favorable performance.

      We did examine second-hand smoking in the original manuscript, showing a significant association with weekly maternal smoking exposure (original Table 3 and Supplementary Table 8). However, air pollution data is not available for assessment.

      (3) The other novel aspect is the evaluation of associations with outcomes later in life. Height and weight are interesting but impact could be gained by including other relevant outcomes such as birth complications, asthma, and intellectual impairment which are known to be associated with prenatal smoking. 

      We thank the reviewer for bring up this point. One of the key health outcomes in the CHILD study was asthma, and data at later time points are available. However, we do not have similar outcomes collected in the other two studies (FAMILY and START), which focused on cardiometabolic health in young children. Thus, we did not initially include outcomes that were not available across all cohorts as the intention was to contrast the effects between populations.

      We recognize that this is an important question and decided to provide the association results for mother reported asthma and allergy, but based on different definitions as these outcomes cannot be harmonized across the cohorts. We also included mode of delivery via emergency C-section as an additional proxy outcome of birth complication.

      The following was added to Materials and Methods:

      “Mode of delivery (emergency c-section vs. other) was collected at the time of delivery.”

      “Additional phenotypes included smoking exposures (hours per week) at home, potential allergy based on mother reporting any of: eczema, hay fever, wheeze, asthma, food allergy (egg, cow milk, soy, other) for her child in FAMILY and START, and asthma based on mother’s opinion in CHILD (“In your opinion, does the child have any of the following? Asthma”).”

      The following was added to Results:

      “The maternal smoking MRS was consistently associated with increasing weekly smoking exposure in children reported by mothers at the 1-year (0.51±0.15, FDR adjusted p= 0.0052) , 3-year (0.53±0.16, FDR adjusted p= 0.0052), and 5-year (0.40±0.15, FDR adjusted p= 0.021) visits with similar effects.”

      “We did not find any association with self-reported allergy or asthma in children at later visits (Supplementary Table 8). Further, there was no evidence of association between the MRS and any maternal outcomes (Supplementary Table 8).”

      REFERENCES:

      Gondalia, R., Baldassari, A., Holliday, K. M., Justice, A. E., Stewart, J. D., Liao, D., . . . Whitsel, E. A. (2021). Epigenetically mediated electrocardiographic manifestations of sub-chronic exposures to ambient particulate matter air pollution in the Women's Health Initiative and Atherosclerosis Risk in Communities Study. Environ Res, 198, 111211. doi:10.1016/j.envres.2021.111211

      Joubert, B. R., Felix, J. F., Yousefi, P., Bakulski, K. M., Just, A. C., Breton, C., . . . London, S. J. (2016). DNA Methylation in Newborns and Maternal Smoking in Pregnancy: Genome-wide Consortium Meta-analysis. Am J Hum Genet, 98(4), 680-696. doi:10.1016/j.ajhg.2016.02.019

      Martin, J. A., Osterman, M. J. K., & Driscoll, A. K. (2023). Declines in Cigarette Smoking During Pregnancy in the United States, 2016-2021. NCHS Data Brief(458), 1-8. Retrieved from https://www.ncbi.nlm.nih.gov/pubmed/36723453

      Reese, S. E., Zhao, S., Wu, M. C., Joubert, B. R., Parr, C. L., Haberg, S. E., . . . London, S. J. (2017). DNA Methylation Score as a Biomarker in Newborns for Sustained Maternal Smoking during Pregnancy. Environ Health Perspect, 125(4), 760-766. doi:10.1289/EHP333

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The individual roles of both cosolvents and intrinsically disordered proteins (IDPs) in desiccation have been well established, but few studies have tried to elucidate how these two factors may contribute synergistically. The authors quantify the synergy for the model and true IDPs involved with desiccation and find that only the true IDPs have strong desiccation tolerance and synergy with cosolvents. Using these as model systems, they quantify the local (secondary structure vis-a-vi CD spectroscopy) and global dimensions (vis-a-vi the Rg of SAXS experiments) and find no obvious changes with the co-solvents. Instead, they focus on the gelation of one of the IDPs and, using theory and experiments, suggest that the co-solvents may enable desiccation tolerance, an interesting hypothesis to guide future in vivo desiccation studies. A few minor points that remain unclear to this reviewer are noted.

      Strengths:

      This paper is quite extensive and has significant strengths worth highlighting. Notably, the number and type of methods employed to study IDPs are quite unusual, employing CD spectroscopy, SAXS measurements, and DSC. The use of the TFE is an exciting integration of the physical chemistry of cosolvents into the desiccation field is a nice approach and a clever way of addressing the gap of the lack of conformational changes depending on the cosolvents. Furthermore, I think this is a major point and strength of the paper; the underlying synergy of cosolvents and IDPs may lie in the thermodynamics of the dehydration process.

      Figure S6A is very useful. I encourage readers who are confused about the DSC analysis, interpretation, and calculation to refer to it.

      Weaknesses:

      Overall, the paper is sound and employs strong experimental design and analysis. However, I wish to point out a few minor weaknesses.

      Perhaps the largest, in terms of reader comprehension, focuses on the transition between the model peptides and real IDPs in Figures 1 and 2. Notably, little is discussed with respect to the structure of the IDPs and what is known. Notably, I was confused to find out when looking at Table 1 that many of the IDPs are predicted to be largely unordered, which seemed to contrast with some of the CD spectroscopy data. I wonder if the disorder plots are misleading for readers. Can the authors comment more on this confusion? What are these IDPs structurally?

      We apologize for the confusion caused here and thank the reviewer for this astute observation. Our CD spectroscopy data suggests all LEA proteins are almost entirely disordered under aqueous conditions, with a single major minimum at 200 nm, although most have a small inflection around 220 nm, indicating a small proportion of helicity (Fig. 3A). The notable exception here is CAHS D, which – in line with our work and the work of many others – possesses a substantial degree of transient helicity in the linker region (residues 100-200), giving rise to a more pronounced minimum at 220 nm. These conclusions are consistent with our SAXS data (Fig. 4), which predict a radius of gyration far larger than a globular folded protein of the same number of residues should have (15-20 Å). The structural predictions (both Metapredict and AlphaFold2), however, imply several of the proteins to be ordered; AvLEA1C and HeLEA68614 are both predicted to have large folded regions based on metapredict disorder scores. We believe this is an erroneous prediction driven by these regions' propensity to acquire helicity in the context of desiccation (Fig 3B) and/or when interacting with clients. As such, our computational analysis is at odds with the experimental data because these proteins are all poised to undergo a coil-to-helix transition, an effect our parallel work has proposed is important for their function (see Biswas et al. Prot. Sci. 2024). The ability of AlphaFold2 to predict bound-state or transient helices has been previously documented (Alderson et al PNAS 2023)

      To address this discrepancy, the caption for Table 1 reads: “We note that the reason many of these profiles contain large folded regions is because the amphipathic LEA and CAHS proteins are predicted to form helices, which metapredict infers and incorrectly highlights these regions as ‘folded’ when really they are disordered in isolation”. We have also added additional context and information to the caption for Fig. S9 “We note that the structural predictions from AlphaFold2 contain largely ordered structures. We believe this is due to the propensity of these proteins to form helices in the context of drying or when interacting with a client. This has been shown in cases where an IDR contains residual helicity or is folded upon binding [70].”

      Related to the above thoughts, the alpha fold structures for the LEA proteins are predicted (unconfidently) as being alpha-helical in contrast to the CD data. Does this complicate the TFE studies and eliminate the correlation for the LEA proteins?

      AlphaFold2 predicted helicity in disordered regions is commonly observed, and thought to indicate a possible “bound” helical state (Alderson et al. PNAS 2023). As shown by the CD data, in aqueous conditions no secondary structure exists. It is only in the desiccated state - the path to which requires proteins to reach excessively high concentrations - that this secondary structure appears. Underlying our TFE model is that the AlphaFold2 predicted secondary structure is indicative of the state the proteins are in at high abundance, which occurs as cells ramp up protectant expression and as water is removed from the system. Under these assumptions, the CD data is in agreement with the AlphaFold2 predictions, and our analysis holds. This is explained in the methods under “Transfer Free Energy (TFE) Calculations” - but we have now also added an additional sentence to this effect in the main text: “Using a similar AlphaFold2-based approach for LEA proteins and for BSA, we can make correlations between the Gtr of the disorder-to-order transition and synergy (Fig. S8F-K). Interestingly, AlphaFold2 predictions of our LEA proteins were broadly helical, which is in contrast to our experimental characterization of these proteins in aqueous solutions. However, this is not unusual for AlphaFold2 predictions and could possibly represent a “bound” conformation for the proteins [70].”

      Additionally, the notation that the LEA and BSA proteins do not correlate is unclear to this reviewer, aren't many of the correlations significant, having both a large R^2 and significant p-value?

      We thank the reviewer for pointing this out. While BSA and some LEA proteins have values that correlate with synergy, there’s more to consider in assessing the relevance of these correlations. For example, we cannot claim that the value is physiologically relevant without observing an actual structural change in the protein. Furthermore, several of these proteins (BSA and AvLEA1C) were found to be not significantly synergistic in the LDH assay, and any correlation should, therefore, also be considered non-significant. We have added a sentence to the results to clarify this: “For a subset of these proteins, we see a statistically significant correlation between G and synergy. However, this data is purely computational. For CAHS D, we saw our predictions recapitulated in changes in the protein structure, and for the LEA proteins we do not. Thus, we conclude that cosolutes do not induce synergy in our LEA proteins through a change in folding.”

      The calculation of synergy seems too simplistic or even problematic to me. While I am not familiar with the standards in the desiccation field, I think the approach as presented may be problematic due to the potential for higher initial values of protection to have lower synergies (two 50%s for example, could not yield higher than 100%).

      We acknowledge the reviewer’s concern about our synergy calculation. We would like to highlight the use of sub-optimal protective concentrations in our synergy assays similar to studies previously reported in the desiccation field (Nguyen et al. 2022; Kim et al. 2018).

      As the reviewer pointed out, we agree that there is a theoretical 100% threshold in our experiments which if we hit, we cannot distinguish between individual additive vs synergistic effects. To avoid the situation of reaching the near maximal protection levels (~100%), we intentionally select a sub-optimal concentration of the protectants that are below the maximum efficacy level for individual protectants to use in our assays. This limits the potential for initial higher values of the protectants so that their combined effect is not maximized, and there is always the potential for synergy. We would also like to point out that we never actually hit that 100% threshold in any of our synergy experiments, which warrants that any observed increase in protection is attributed to a true synergistic effect between the protectants.

      Instead, I would think one would need to really think of it as an apparent equilibrium constant between functional and non-functional LDH (Kapp = [Func]/[Not Func] and frac = Kapp/(1+Kapp) or Kapp = frac/(1-frac) ) Then after getting the apparent equilibrium constants for the IDP and cosolvent (KappIDP and KappCS), the expected additive effect would be frac = (KappIDP+KappCS)/(1+KappIDP+KappCS).

      Consequently, the extent of synergy could be instead calculated as KappBOTH-KappIDP-KappCS. Maybe this reviewer is misunderstanding. It is recommended that the authors clarify why the synergy calculation in the manuscript is reasonable.

      We thank the reviewer for this suggestion. In the desiccation field, the synergy calculations that we used is the standard method that people use, so that’s what we present in our main manuscript. However, we have now quantified synergy through two new approaches: one, as suggested by the reviewer, using the equilibrium constant (Kapp) as a metric, and the other using the Bliss Independent model, which is a common approach for calculating synergy in drug combination studies. We see minimal differences in terms of the synergy scores using these different methods. We have included the results for these additional methods in supplemental figure S3.

      Related to the above, the authors should discuss the utility of using molar concentration instead of volume fraction or mass concentration. Notably, when trehalose is used in concentration, the volume fraction of trehalose is much smaller compared to the IDPs used in Figure 2 or some in Figure 1. Would switching to a different weighted unit impact the results of the study, or is it robust to such (potentially) arbitrary units?

      We thank the reviewer for this comment. Indeed, in studies of cosolute effect, concentration units can alter the conclusions of the study (Auton and Bolen 2004). In our case, the relevant figures where we use a concentration scale (1B and 2B) are not germane to the main conclusions: The only use of these PD50 values is to determine a sub-optimal concentration at which ~30% of the LDH is protected. While it is true that the number for the concentration of e.g., trehalose will be dramatically different if we were to use mass fraction units, the rest of the work and all our conclusions would be exactly the same.

      Additionally, our use of a molar ratio when discussing synergy is a direct result of the way we think about such synergy: Since the concentration of both protein and cosolute can change by orders of magnitude during drying, it is the copy numbers of both proteins and cosolute that are conserved in this process, and it is this unit that we think is important to the protective effect (rather than the partial molar volume, for example, which would be changing as the system dries).

      Reviewer #2 (Public Review):

      Summary:

      The paper aims to investigate the synergies between desiccation chaperones and small molecule cosolutes, and describe its mechanistic basis. The paper reports that IDP chaperones have stronger synergies with the cosolutes they coexist with, and in one case suggests that this is related to oligomerization propensity of the IDP.

      Strengths:

      The study uses a lot of orthogonal methods and the experiments are technically well done. They are addressing a new question that has not really been addressed previously.

      Weaknesses:

      The conclusions are based on a few examples and only partial correlations. While the data support mechanistic conclusions about the individual proteins studied, it is not clear that the conclusions can be generalized to the extent proposed by the authors due to small effect sizes, small numbers of proteins, and only partial correlations.

      Thank you for bringing this up. We agree that we should not generalize our results to other systems based on the evidence we have for the proteins used in our study. We have altered our discussion to highlight that this may apply to other IDPs, and that future experiments must be done to support this: “Additionally, we want to point out that our results cannot necessarily be generalized to all desiccation-related IDPs. More experiments will be needed to assess the relevance of cosolute effects to functional synergy and IDP folding in the context of desiccation and beyond. This remains an important future direction for the field.”

      The authors pose relevant questions and try to answer them through a systematic series of experiments that are all technically well-conducted. The data points are generally interpreted appropriately in isolation, however, I am a little concerned about a tendency to over-generalize their findings. Many of the experiments give negative or non-conclusive results (not a problem in itself), which means that the overall storyline is often based on single examples.

      We agree with the reviewer’s point. As mentioned earlier, we have modified our manuscript to reflect that our findings are based on the six proteins that we studied, and we can only speculate about other desiccation-related IDPs based on our results.

      For example, the central conclusion that IDPs interact synergistically with their endogenous co-solute (Figure 2E) is largely driven by one outlier from Arabidopsis. The rest are relatively close to the diagonal, and one could equally well suggest that the cosolutes affect the IDPs equally (which is also the conclusion in 1F).

      We appreciate the reviewer’s concern regarding our conclusion in Figures 2E and 1F. We would like to highlight that our conclusions that IDPs interact synergistically with their endogenous cosolute are based on statistical analysis. Our data shows that full-length proteins that were synergistic with both cosolutes are always significantly more synergistic with the endogenous cosolute (Fig. 2E, Fig. S2C-E). For example, the nematode protein is synergistic with both trehalose and sucrose, but is significantly more synergistic with trehalose, the endogenous nematode cosolute, than with sucrose (Fig S2D).

      This is not the case in 1F. In Fig. 1F, it is to note that not only are the points close to the diagonal, but most points are close to zero along both axes indicating no synergy. In fact, many points have negative synergy (antagonistic effect).

      We do recognize that our conclusions are based on the study of a specific set of six IDPs, and we do not want to overreach in our conclusions. To acknowledge this, we have now added text to emphasize that our conclusion is based on the six proteins that we tested, and we speculate it might apply to other systems: “Our data shows that these six IDPs synergize best with their endogenous cosolute to promote desiccation tolerance and we speculate that this may apply to other desiccation-related IDPs”.

      Similarly, the mechanistic explanations tend to be based on single examples. This is somewhat unavoidable as biophysical studies cannot be done on thousands of proteins, but the text should be toned down to reflect the strength of the conclusions.

      We acknowledge the reviewer’s concern. We have modified our manuscript accordingly to reflect that the mechanistic insights we gained are for the six proteins we tested empirically. These changes can be found throughout the manuscript. None of our experiments rule out the possibility that other LEA proteins or CAHS proteins may show different structural transitions, or that other IDPs may take on structural changes in response to the cosolutes.

      The central hypothesis revolves around the interplay between cosolutes and IDP chaperones comparing chaperones from species with different complements of cosolutes. In Table 1, it is mentioned that Arabidopsis uses both trehalose and sucrose as a cosolute, yet experiments are only done with either of these cosolutes and Arabidopsis is counted in the sucrose column. While it makes sense to compare them separately from a biophysical point of view, the ability to test the co-evolution of these systems is somewhat diminished by this. At least it should be discussed clearly.

      We appreciate the reviewer’s comment. As is mentioned in Table 1, Arabidopsis uses both trehalose and sucrose as cosolute. As such, we would predict that the Arabidopsis proteins would respond positively to both cosolutes. We would like to point out that Arabidopsis is counted in both trehalose and sucrose columns.

      We would also like to emphasize that multiple osmolytes exist in all organisms as a desiccation response and a simple IDP-cosolute system is far from a true recapitulation of a desiccating system. We have touched on this in the discussion and explicitly addressed the presence of both cosolutes in Arabidopsis and the need for further experiments to test for synergistic interactions using both or multiple mediators to illustrate synergy in multiple cosolute systems: “It is important to note that desiccation-tolerant organisms employ multiple cosolutes to counteract the effects of desiccation. The use of a single cosolute-IDP system in our in vitro experiments does not accurately mirror the diverse cosolute changes in desiccating systems. For instance, Arabidopsis seeds enrich both trehalose and sucrose, among other cosolutes. This demands the necessity of future experiments that incorporate both or multiple cosolutes and assess their synergistic effects, thus elucidating the intricate synergy in multi-cosolute systems.”

      It would be helpful if the authors could spell out the theoretical basis of how they quantify synergy. I understand what they are doing - and maybe there are no better ways to do it - but it seems like an approach with limitations. The authors identify one in that the calculation only works far from 100%, but to me, it seems there would be an equally strict requirement to be significantly above 0%. This would suggest that it is used wrongly in Figure 6H, where there is no effect of betaine (at least as far as the color scheme allows one to distinguish the different bars). In this case, the authors cannot really conclude synergy or not, it could be a straight non-synergistic inhibition by betaine.

      We appreciate the reviewer’s concern about the theoretical basis of how we quantify synergy. We do acknowledge the limitation of our LDH protection/synergy assay only produces interpretable data when our protectant/mixture yields protection levels within the range 0 and below 100%. Betaine was not protective in any of the concentrations we tested in this study. In line with the reviewer’s comment, we also acknowledge that within our experimental procedures, the inhibitory effects of betaine cannot be accurately captured, considering that LDH activity is ~0% without protectants. However, in our positive control in which LDH is co-incubated with betaine or betaine and CAHS D overnight in the hydrated state, we do not see a loss of enzymatic function of LDH nullifying a direct inhibition by betaine. We have added this text in our manuscript: “Glycine betaine on its own is not protective to LDH during drying nor does it inhibit LDH activity (Fig. S8E)”.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      The conclusion in lines 195-196 seems overstated as the length dependence could be strongly changed in non-tested concentrations or those that are not possible experimentally. Notably, the IDPs in Figure 2 are around 200AA and only transition in the ranges tested for these peptides. Some other conclusions around this point seem a little overstated.

      We acknowledge the reviewer’s concern about the potential variability of the length dependence of the motifs at concentrations beyond those tested. However, we would like to highlight that higher concentrations of the tandem repeats (At22 and At44) inactivated LDH during the incubation period, as was seen with  the 11-mer motifs. This meant we could not evaluate protection by these motifs at concentrations beyond those plotted in Fig. 1A. This behavior was not observed for the full-length proteins. Regardless, we have toned down the conclusion in lines 195-196 to only reflect our results for the 2X and 4X repeats of At11 which now reads “We synthesized 2X (At22) and 4X (At44) tandem repeats of the A. thaliana 11-mer LEA_4 motif (At11). At22 and At44 show minimal potency in preserving in vitro LDH function during drying (Fig. 1A, Fig. S1A).”

      Reviewer #2 (Recommendations For The Authors):

      Figure 3: The focus on the ratio 222/210 seems inappropriate. That would indeed be useful for telling apart e.g. an alpha-to-beta transition, or formation of coiled coils. However, for a helix-to-coil equilibrium, which is likely to dominate here, it will not be especially sensitive as demonstrated e.g. by BSA in the dry state.

      We thank the reviewer for this comment. The use of ratios to measure structural transition is primarily to eliminate the effects of concentrations on the graph. It is clear from Fig. 3A and Fig. 3B that a structural transition occurs between the aqueous and the desiccated state. This is also very clear from the 222/210 ratio that we use (Fig. 3C), for every construct other than BSA - which indeed does not seem to undergo a dramatic structural change in the desiccated state. We have clarified this now in the description of the results: “Using this metric, all LEAs and CAHS D display a clear increase in helical propensity upon being desiccated (Fig. 3C). On the other hand, the helical propensity of BSA remains very similar to its hydrated state, indicating that no dramatic structural change took place (Fig. 3C).

      Minor comments:

      Figure 1F is not mentioned in the text.

      We have included Fig. 1F in the text.

      Some technical details missing for SAXS experiments.

      We thank the reviewer for pointing this out. We’ve added additional technical details to the main text, and directed readers to the methods for more information.

      It is well known that BSA is in a monomer-dimer equilibrium and this is normally taken into account in data analysis as this is often a calibration sample.

      We’ve calculated for BSA, and correlated the resulting data with synergy. This can be found in figure S7M and figure S8I.

      Line 247: "BSA, which comes from cows, which of course have no capacity for anhydrobiosis" - This seems like a rather strong statement without a reference. Did the authors consider reanimating beef jerky by soaking it in water? ;-)

      This is a great idea, and we hope to assign this project to our next rotation student.

      Minor suggestions for figures (that are generally very well done):

      Figure 1-4: Consider using the color scheme to indicate what the endogenous cosolutes are. Even though this info is in table one, it would still improve readability.

      We have added the colored organismal icons for all figures in which the plain black ones were previously used, including supplementals.

      Figure 4: consider adding some white space between the two concentration series of solutes to avoid being read as a single concentration series.

      We have updated this figure to clearly separate each sample by osmolyte.

      Figure 6H: Consider changing the colors for Betaine and CAHS D, so they are easier to distinguish. They are hard to tell apart on a printout.

      We have adjusted the colors for betaine and CAHS D.

    1. Authorr Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      The objective of this investigation was to determine whether experimental pain could induce alterations in cortical inhibitory/facilitatory activity observed in TMS-evoked potentials (TEPs). Previous TMS investigations of pain perception had focused on motor evoked potentials (MEPs), which reflect a combination of cortical, spinal, and peripheral activity, as well as restricting the focus to M1. The main strength of this investigation is the combined use of TMS and EEG in the context of experimental pain. More specifically, Experiment 1 investigated whether acute pain altered cortical excitability, reflected in the modulation of TEPs. The main outcome of this study is that relative to non-painful warm stimuli, painful thermal stimuli led to an increase on the amplitude of the TEP N45, with a larger increase associated with higher pain ratings. Because it has been argued that a significant portion of TEPs could reflect auditory potentials elicited by the sound (click) of the TMS, Experiment 2 constituted a control study that aimed to disentangle the cortical response related to TMS and auditory activity. Finally, Experiment 3 aimed to disentangle the cortical response to TMS and reafferent feedback from muscular activity elicited by suprathreshold TMS applied over M1. The fact that the authors accompanied their main experiment with two control experiments strengthens the conclusion that the N45 TEP peak could be implicated in the perception of painful stimuli.

      Perhaps, the addition of a highly salient but non-painful stimulus (i.e. from another modality) would have further ruled out that the effects on the N45 are not predominantly related to intensity/saliency of the stimulus rather than to pain per se.

      We thank the reviewer for their comment on the possibility of whether stimulus intensity influences the N45 as opposed to pain per se. We agree that the ideal experiment would have included multiple levels of stimulation. We would argue, however, that that in Experiment 1, despite the same level of stimulus intensity for all participants (46 degrees), individual differences in pain ratings were associated with the change in the N45 amplitude, suggesting that the results cannot be explained by stimulus intensity, but rather by pain intensity.

      Reviewer #2 (Public Review):

      The authors have used transcranial magnetic stimulation (TMS) and motor evoked potentials (MEPs) and TMS-electroencephalography (EEG) evoked potentials (TEPs) to determine how experimental heat pain could induce alterations in these metrics.
In Experiment 1 (n = 29), multiple sustained thermal stimuli were administered over the forearm, with the first, second, and third block of stimuli consisting of warm but non-painful (pre-pain block), painful heat (pain block) and warm but non-painful (post-pain block) temperatures respectively. Painful stimuli led to an increase in the amplitude of the fronto-central N45, with a larger increase associated with higher pain ratings. Experiments 2 and 3 studied the correlation between the increase in the N45 in pain and the effects of a sham stimulation protocol/higher stimulation intensity. They found that the centro-frontal N45 TEP was decreased in acute pain. The study comes from a very strong group in the pain fields with long experience in psychophysics, experimental pain, neuromodulation, and EEG in pain. They are among the first to report on changes in cortical excitability as measured by TMS-EEG over M1. While their results are in line with reductions seen in motor-evoked responses during pain and effort was made to address possible confounding factors (study 2 and 3), there are some points that need attention. In my view the most important are:

      1) The method used to calculate the rest motor threshold, which is likely to have overestimated its true value : calculating highly abnormal RMT may lead to suprathreshold stimulations in all instances (Experiment 3) and may lead to somatosensory "contamination" due to re-afferent loops in both "supra" and "infra" (aka. less supra) conditions.

      The method used to assess motor threshold was the TMS motor threshold Assessment Tool (MTAT) which estimates motor threshold using maximum likelihood parametric estimation by sequential testing (Awiszus et al., 2003; Awiszus and Borckardt, 2011). This was developed as a quicker alternative for calculating motor threshold compared to the traditional Rossini-Rothwell method which involves determining the lowest intensity that evokes at least 5/10 MEPs of at least 50 microvolts. The method has been shown to achieve the same accuracy of determining motor threshold as the traditional Rossini-Rothwell method, but with fewer pulses (Qi et al., 2011; Silbert et al., 2013).

      We have now made this clearer in the manuscript:

      “The RMT was determined using the TMS motor thresholding assessment tool, which estimates the TMS intensity required to induce an MEP of 50 microvolts with a 50% probability using maximum likelihood parametric estimation by sequential testing (Awiszus, 2003; Awiszus & Borckardt, 2011). This method has been shown to achieve the accuracy of methods such as the Rossini-Rothwell method (Rossini et al., 1994; Rothwell et al., 1999) but with fewer pulses (Qi, Wu, & Schweighofer, 2011; Silbert, Patterson, Pevcic, Windnagel, & Thickbroom, 2013). The test stimulus intensity was set at 110% RMT to concurrently measure MEPs and TEPs during pre-pain, pain and post-pain blocks.”

      Therefore, the high RMTs in our study cannot be explained by the threshold assessment method. Instead, they are likely explained by aspects of the experimental setup that increased the distance between the TMS coil and the scalp, including the layer of foam placed over the coil, the EEG cap and the fact that the electrodes we used had a relatively thick profile. This has been explained in the paper:

      “We note that the relatively high RMTs are likely due to aspects of the experimental setup that increased the distance between the TMS coil and the scalp, including the layer of foam placed over the coil, the EEG cap and relatively thick electrodes (6mm)”

      Awiszus, F. (2003). TMS and threshold hunting. In Supplements to Clinical neurophysiology (Vol. 56, pp. 13-23). Elsevier.

      Qi, F., Wu, A. D., & Schweighofer, N. (2011). Fast estimation of transcranial magnetic stimulation motor threshold. Brain stimulation, 4(1), 50-57.

      Silbert, B. I., Patterson, H. I., Pevcic, D. D., Windnagel, K. A., & Thickbroom, G. W. (2013). A comparison of relative-frequency and threshold-hunting methods to determine stimulus intensity in transcranial magnetic stimulation. Clinical Neurophysiology, 124(4), 708-712.

      2) The low number of pulses used for TEPs (close to ⅓ of the usual and recommended)

      We agree that increasing the number of pulses can increase the signal to noise ratio. During piloting, participants were unable to tolerate the painful stimulus for long periods of time and we were required to minimize the number of pulses per condition.

      We note that there is no set advised number of trials in TMS-EEG research. According to the recommendations paper, the number of trials should be based on the outcome measure e.g., TEP peaks vs. frequency domain measures vs. other measures and based on previous studies investigating test-retest reliability (Hernandez-Pavon et al., 2023). The choice of 66 pulses per condition was based on the study by Kerwin et al., (2018) showing that optimal concordance between TEP peaks can be found with 60-100 TMS pulses delivered in the same run (as in the present study). The concordance was particularly higher for the N40 peak at prefrontal electrodes, which was the key peak and electrode cluster in our study. We have made this clearer:

      “Current recommendations (Hernandez-Pavon et al., 2023) suggest basing the number of TMS trials per condition on the key outcome measure (e.g., TEP peaks vs. frequency measures) and based on previous test-retest reliability studies. In our study the number of trials was based on a test-retest reliability study by (Kerwin, Keller, Wu, Narayan, & Etkin, 2018) which showed that 60 TMS pulses (delivered in the same run) was sufficient to obtain reliable TEP peaks (i.e., sufficient within-individual concordance between the resultant TEP peaks of each trial).”

      Further supporting the reliability of the TEP data in our experiment, we note that the scalp topographies of the TEPs for active TMS at various timepoints (Figures 5, 7 and 9) were similar across all three experiments, especially at 45 ms post-TMS (frontal negative activity, parietal-occipital positive activity).

      In addition to this, the interclass correlation coefficient (Two-way fixed, single measure) for the N45 to active suprathreshold TMS across timepoints for each experiment was 0.90 for Experiment 1 (across pre-pain, pain, post-pain time points), 0.74 for Experiment 2 (across pre-pain and pain conditions), and 0.95 for Experiment 3 (across pre-pain conditions). This suggests that even with the fluctuations in the N45 induced by pain, the N45 for each participant was stable across time, further supporting the reliability of our data. These ICCs are now reported in the supplementary material (subheading: Test-retest reliability of N45 Peaks).

      Hernandez-Pavon, J. C., Veniero, D., Bergmann, T. O., Belardinelli, P., Bortoletto, M., Casarotto, S., ... & Ilmoniemi, R. J. (2023). TMS combined with EEG: Recommendations and open issues for data collection and analysis. Brain Stimulatio, 16(3), 567-593

      Kerwin, L. J., Keller, C. J., Wu, W., Narayan, M., & Etkin, A. (2018). Test-retest reliability of transcranial magnetic stimulation EEG evoked potentials. Brain stimulation, 11(3), 536-544.

      Lack of measures to mask auditory noise.

      In TMS-EEG research, various masking methods have been proposed to suppress the somatosensory and auditory artefacts resulting from TMS pulses, such as white noise played through headphones to mask the click sound (Ilmoniemi and Kičić, 2010), and a thin layer of foam placed between the TMS coil and EEG cap to minimize the scalp sensation (Massimini et al., 2005). However, recent studies have shown that even when these methods are used, sensory contamination of TEPs is still present, as shown by studies that show commonalities in the signal between active and sensory sham conditions that mimic the auditory/somatosensory aspects of real TMS (Biabani et al., 2019; Conde et al., 2019; Rocchi et al., 2021). This has led many authors (Biabani et al., 2019; Conde et al., 2019) to recommend the use of sham conditions to control for sensory contamination. To separate the direct cortical response to TMS from sensory evoked activity, Experiment 2 included a sham TMS condition that mimicked the auditory/somatosensory aspects of active TMS to determine whether any alterations in the TEP peaks in response to pain were due to changes in sensory evoked activity associated with TMS, as opposed to changes in cortical excitability. Therefore, the lack of auditory masking does not impact the main conclusions of the paper.

      We have made this clearer:

      “… masking methods have been used to suppress these sensory inputs, (Ilmoniemi and Kičić, 2010; Massimini et al., 2005). However recent studies have shown that even when these methods are used, sensory contamination of TEPs is still present, as shown by commonalities in the signal between active and sensory sham conditions that mimic the auditory/somatosensory aspects of real TMS (Biabani et al., 2019; Conde et al., 2019; Rocchi et al., 2021). This has led many leading authors (Biabani et al., 2019; Conde et al., 2019) to recommend the use of sham conditions to control for sensory contamination.”

      Ilmoniemi, R. J., & Kičić, D. (2010). Methodology for combined TMS and EEG. Brain topography, 22, 233-248.

      Massimini, M., Ferrarelli, F., Huber, R., Esser, S. K., Singh, H., & Tononi, G. (2005). Breakdown of cortical effective connectivity during sleep. Science, 309(5744), 2228-2232.

      Biabani, M., Fornito, A., Mutanen, T. P., Morrow, J., & Rogasch, N. C. (2019). Characterizing and minimizing the contribution of sensory inputs to TMS-evoked potentials. Brain stimulation, 12(6), 1537-1552.

      Conde, V., Tomasevic, L., Akopian, I., Stanek, K., Saturnino, G. B., Thielscher, A., ... & Siebner, H. R. (2019). The non-transcranial TMS-evoked potential is an inherent source of ambiguity in TMS-EEG studies. Neuroimage, 185, 300-312.

      Rocchi, L., Di Santo, A., Brown, K., Ibáñez, J., Casula, E., Rawji, V., ... & Rothwell, J. (2021). Disentangling EEG responses to TMS due to cortical and peripheral activations. Brain stimulation, 14(1), 4-18.

      3) A supra-stimulus heat stimulus not based on individual HPT, that oscillates during the experiment and that lead to large variations in pain intensity across participants is unfortunate.

      The choice of whether to calibrate or fix stimulus intensity is a contentious question in experimental pain research. A recent discussion by Adamczyk et al., (2022) explores the pros and cons of each approach and recommends situations where one method may be preferred over the other. That paper suggests that the choice of the methodology is related to the research question – when the main outcome of the research is objective (neurophysiological measures) and researchers are interested in the variability in pain ratings, the fixed approach is preferrable. Given we explored the relationship between MEP/N45 modulation by pain and pain intensity, this question is better explored by using the same stimulus intensity for all participants, as opposed to calibrating the intensity to achieve a similar level of pain across participants.

      We have made this clearer:

      “Given we were interested in the individual relationship between pain and excitability changes, the fixed temperature of 46ºC ensured larger variability in pain ratings as opposed to calibrating the temperature of the thermode for each participant (Adamczyk et al., 2022).”.

      Adamczyk, W. M., Szikszay, T. M., Nahman-Averbuch, H., Skalski, J., Nastaj, J., Gouverneur, P., & Luedtke, K. (2022). To calibrate or not to calibrate? A methodological dilemma in experimental pain research. The Journal of Pain, 23(11), 1823-1832.

      So is the lack of report on measures taken to correct for a fortuitous significance (multiple comparison correction) in such a huge number of serial paired tests.

      Note that we used a Bayesian approach for all analyses as opposed to the traditional frequentist approach. In contrast to the frequentist approach, the Bayesian approach does not require corrections for multiple comparisons (Gelman et al., 2000) given that they provide a ratio representing the strength of evidence for the null vs. alternative hypotheses as opposed to accepting or rejecting the null hypothesis based on p-values. As such, throughout the paper, we frame our interpretations and conclusions based on the strength of evidence (e.g. anecdotal/weak, moderate, strong, very strong) as opposed to referring to the significance of the effects.

      Gelman A, Tuerlinckx F. (2000). Type S error rates for classical and Bayesian single and multiple comparison procedures. Computational statistics, 15(3):373-90.

      Reviewer #3 (Public Review):

      The present study aims to investigate whether pain influences cortical excitability. To this end, heat pain stimuli are applied to healthy human participants. Simultaneously, TMS pulses are applied to M1 and TMS-evoked potentials (TEPs) and pain ratings are assessed after each TMS pulse. TEPs are used as measures of cortical excitability. The results show that TEP amplitudes at 45 msec (N45) after TMS pulses are higher during painful stimulation than during non-painful warm stimulation. Control experiments indicate that auditory, somatosensory, or proprioceptive effects cannot explain this effect. Considering that the N45 might reflect GABAergic activity, the results suggest that pain changes GABAergic activity. The authors conclude that TEP indices of GABAergic transmission might be useful as biomarkers of pain sensitivity.

      Pain-induced cortical excitability changes is an interesting, timely, and potentially clinically relevant topic. The paradigm and the analysis are sound, the results are mostly convincing, and the interpretation is adequate. The following clarifications and revisions might help to improve the manuscript further.

      1) Non-painful control condition. In this condition, stimuli are applied at warmth detection threshold. At this intensity, by definition, some stimuli are not perceived as different from the baseline. Thus, this condition might not be perfectly suited to control for the effects of painful vs. non-painful stimulation. This potential confound should be critically discussed.

      In Experiment 3, we also collected warmth ratings to confirm whether the pre-pain stimuli were perceived as different from baseline. This detail has been added to them methods:

      “In addition to the pain rating in between TMS pulses, we collected a second rating for warmth of the thermal stimulus (0 = neutral, 10 = very warm) to confirm that the participants felt some difference in sensation relative to baseline during the pre-pain block. This data is presented in the supplementary material”.

      We did not include these data in the initial submission but have now included it in the supplemental material. These data showed warmth ratings were close to 2/10 on average. This confirms that the non-painful control condition produced some level of non-painful sensation.

      2) MEP differences between conditions. The results do not show differences in MEP amplitudes between conditions (BF 1.015). The analysis nevertheless relates MEP differences between conditions to pain ratings. It would be more appropriate to state that in this study, pain did not affect MEP and to remove the correlation analysis and its interpretation from the manuscript.

      The interindividual relationship between changes in MEP amplitude and individual pain rating is statistically independent from the overall group level effect of pain on MEP amplitude. Therefore, conclusions for the individual and group level effects can be made independently.

      It is also important to note that in the pain literature, there is now increasing emphasis placed on investigating the individual level relationship between changes in cortical excitability and pain as opposed to the group level effect (Seminowicz et al., 2019; Summers et al., 2019). As such, it is important to make these results readily available for the scientific community.

      We have made this clearer:

      ‘As there is now increasing emphasis placed on investigating the individual level relationship between changes in cortical excitability and pain and not only the group level effect, (Chowdhury et al., 2022; Seminowicz et al., 2018; Seminowicz, Thapa, & Schabrun, 2019; Summers et al., 2019) we also investigated the correlations between pain ratings and changes in MEP (and TEP) amplitude”

      Chowdhury, N. S., Chang, W. J., Millard, S. K., Skippen, P., Bilska, K., Seminowicz, D. A., & Schabrun, S. M. (2022). The Effect of Acute and Sustained Pain on Corticomotor Excitability: A Systematic Review and Meta-Analysis of Group and Individual Level Data. The Journal of Pain, 23(10), 1680-1696.

      Summers, S. J., Chipchase, L. S., Hirata, R., Graven-Nielsen, T., Cavaleri, R., & Schabrun, S. M. (2019). Motor adaptation varies between individuals in the transition to sustained pain. Pain, 160(9), 2115-2125.

      Seminowicz, D. A., Thapa, T., & Schabrun, S. M. (2019). Corticomotor depression is associated with higher pain severity in the transition to sustained pain: a longitudinal exploratory study of individual differences. The Journal of Pain, 20(12), 1498-1506.

      3) Confounds by pain ratings. The ISI between TMS pulses is 4 sec and includes verbal pain ratings. Considering this relatively short ISI, would it be possible that verbal pain ratings confound the TEP? Moreover, could the pain ratings confound TEP differences between conditions, e.g., by providing earlier ratings when the stimulus is painful? This should be carefully considered, and the authors might perform control analyses.

      It is unlikely that the verbal ratings contaminated the TEP response as the subsequent TMS pulse was not delivered until the verbal rating was complete and given that each participant was cued by the experimenter to provide the pain rating after each pulse (rather than the participant giving the rating at any time). As such, it would not be possible for participants to provide earlier ratings to more painful stimuli.

      We have made this clearer:

      "To avoid contamination of TEPs by verbal ratings, the subsequent TMS pulse was not delivered until the verbal rating was complete, and the participant was cued by the experimenter to provide the pain rating after each pulse.”

      4) Confounds by time effects. Non-painful and painful conditions were performed in a fixed order. Potential confounds by time effects should be carefully considered.

      Previous research suggests that pain alters neural excitability even after pain has subsided. In a recent meta-analysis (Chowdhury et al., 2022) we found effect sizes of 0.55-0.9 for MEP reductions 0-30 minutes after pain had resolved. As such, we avoided intermixing pain and warm blocks given subsequent warm blocks would not serve as a valid baseline, as each subsequent warm block would have residual effects from the previous pain blocks.

      Chowdhury, N. S., Chang, W. J., Millard, S. K., Skippen, P., Bilska, K., Seminowicz, D. A., & Schabrun, S. M. (2022). The Effect of Acute and Sustained Pain on Corticomotor Excitability: A Systematic Review and Meta-Analysis of Group and Individual Level Data. The Journal of Pain, 23(10), 1680-1696.

      At the same time, given there was no conclusive evidence for a difference in N45 amplitude between pre-pain and post-pain conditions of Experiment 1 (Supplementary Figure 1), it is unlikely that the effect of pain was an artefact of time i.e., the explanation that successive thermal stimuli applied to the skin results an increase in the N45, regardless of whether the stimuli are painful or not. We will make this point in our next revision.

      We have discussed this issue:

      “Lastly, future research should consider replicating our experiment using intermixed pain and no pain blocks, as opposed to fixed pre-pain and pain blocks, to control for order effects i.e., the explanation that successive thermal stimuli applied to the skin results an increase in the N45 peak, regardless of whether the stimuli are painful or not. However, we note that there was no conclusive evidence for a difference in N45 peak amplitude between pre-pain and post-pain conditions of Experiment 1 (Supplementary Figure 1), suggesting it is unlikely that the observed effects were an artefact of time.”

      5) Data availability. The authors should state how they make the data openly available.

      We have uploaded the MEP, TEP and pain data on the Open science framework https://osf.io/k3psu/

      Reviewer #1 (Recommendations For The Authors):

      I think the study is quite solid and I only have very minor recommendations for the authors:

      • Introduction, p. 3: "Functional magnetic resonance imaging has helped us understand where in the brain pain is processed". This is an overstatement. fMRI provides us with potential biomarkers (e.g. "the pain signature"), but the specificity of these responses for pain is debated and we still do not know where in the brain pain is processed.

      We have amended to:

      “functional magnetic resonance imaging has assisted in the localization of brain structures implicated in pain processing”

      • Introduction, p. 5: "neural baseline" should be "neutral baseline"?

      We thank the reviewer for identifying this – this has now been amended.

      Reviewer #2 (Recommendations For The Authors):

      INTRODUCTION

      The introduction mentions how important extra-motor areas can be explored by TMS-EEG, then the effects of DLPFC rTMS on TEPs ... but you do not explore the DLPFC... Perhaps the introduction should be reframed.

      The current work explores cortical excitability throughout the brain (as shown in our cluster-based permutation and source localization analyses), so our investigations are in line with the introductions statement about the importance of studying non-motor areas.

      The reference to DLPFC rTMS was to highlight current existing research that has applied TMS-EEG to understand pain. It was not used as a methodological rationale to investigate the DLPFC in the present study. To make the research gap clearer, we state:

      “While these studies assist us in understanding whether TEPs might mediate rTMS-induced pain reductions, no study has investigated whether TEPs are altered in direct response to pain”

      Lignes 63-65 the term "TMS" is used to refer to motor corticospinal excitability measures, in contrast to TMS-EEG measures of TEPs. Then the authors come back to TMS-EEG and then again back to MEPs. This is rather confusing: TMS means TMS... the concept of MEP/ motor corticospinal excitability measures is not intuitive when using the term "TMS". I suggest using motor corticospinal excitability measures when referring to MEP/MEP-based measures of cortical excitability...) and M1TMS-EEG-evoked potentials (usually abbreviated to TEPs) to refer to TMS-EEG responses as measured here.

      Throughout the manuscript, we now use the term TEPs when referring to TMS-EEG measures, and MEPs when referring to TMS-EMG measure. The use of TEPs vs. MEPs will make it easier for readers to follow which measures we are referring to.

      Line 83: "As such, the precise origin of the pain mechanism cannot be localized." Please rephrase, the sentence conveys the idea that it is indeed possible to localize the origin of a pain mechanism with a different approach, and we know this is not currently possible, irrespective of the methodological setup.

      We have replaced this with:

      “This makes it unclear as to whether pain processes occur at the cortical, spinal or peripheral level.”

      How can one predetermine the temperature that will be perceived as painful by someone else, and not base it on individual HPT? This is against principles of psychophysics. Please comment. Attesting all participants had HPT below 46 is important, but then being stimulated at 46C when our HPT is 45C is different from when our HPT is 39C. Please explain why the pain intensity was not standardised based on individual HPT.

      Please refer to our response to the public review related to the issue

      Line 38: "if we had used an alternative design with blocks of warm stimuli intermixed with blocks of painful stimuli, the warm stimuli blocks would not serve as a valid non-painful baseline". I do not understand why it is not possible to have a pain-free baseline, followed by a pain/warm sequence.

      In our study, we had the choice of either intermixing blocks or to use a fixed sequence. Previous research suggests that pain alters neural excitability even after pain has subsided. In a recent meta-analysis (Chowdhury et al., 2022) we found effect sizes of 0.55-0.9 for MEP reductions 0-30 minutes after pain had resolved. As such, we avoided intermixing pain and warm blocks given subsequent warm blocks would not serve as a valid baseline, as each subsequent warm block would have residual effects from the previous pain blocks.

      We have updated the manuscript to be clearer about why we used a fixed sequence:

      “The pre-pain/pain/post-pain design has been commonly used in the TMS-MEP pain literature, as many studies have demonstrated strong changes in corticomotor excitability that persist beyond the painful period. Indeed, in a systematic review, we showed effect sizes of 0.55-0.9 for MEP reductions 0-30 minutes after pain had resolved (Chowdhury et al., 2022). As such, if we had used an alternative design with blocks of warm stimuli intermixed with blocks of painful stimuli, the warm stimuli blocks would not serve as a valid non-painful baseline”

      Chowdhury, N. S., Chang, W. J., Millard, S. K., Skippen, P., Bilska, K., Seminowicz, D. A., & Schabrun, S. M. (2022). The Effect of Acute and Sustained Pain on Corticomotor Excitability: A Systematic Review and Meta-Analysis of Group and Individual Level Data. The Journal of Pain, 23(10), 1680-1696.

      Please explain, and provide evidence that stimulation of people with predetermined temperatures is able to create warm/pain/warm sensations, without entraining pain in the last warm stimulation.

      A previous study by Dube et al. (2011) used sequences of warm (36°C), painful and neutral (32° C) and found that participants did not experience pain at any time when the temperature was at a warm temperature of 36°C. We have now cited this study:

      “Based on a previous study (Dubé & Mercier, 2011) which also used sequences of painful (50ºC) and warm (36°C) thermal stimuli, we did not anticipate that the stimulus in the pain block would entrain pain in the post-pain block”

      Dubé, J. A., & Mercier, C. (2011). Effect of pain and pain expectation on primary motor cortex excitability. Clinical neurophysiology, 122(11), 2318-2323.

      METHODS

      It is not clear if participants with chronic pain, present in 20% of the general population, were excluded. If they were, please provide "how" in methods.

      We excluded participants with a history or presence of acute/chronic pain. This has now been clarified:

      “Participants were excluded if they had a history of chronic pain condition or any current acute pain”

      Line 489: the definition of warm detection threshold is unusual, please provide a reference.

      We used an identical method to Furman et al., (2020). We have made the reference to this clearer: “Warmth, cold and pain thresholds were assessed in line with a previous study (Furman et al., 2020)”

      Furman, A. J., Prokhorenko, M., Keaser, M. L., Zhang, J., Chen, S., Mazaheri, A., & Seminowicz, D. A. (2020). Sensorimotor peak alpha frequency is a reliable biomarker of prolonged pain sensitivity. Cerebral Cortex, 30(12), 6069-6082.

      In Experiment 2, please explain how the lack of randomisation between "pre-pain" and "pain" may have influenced results.

      Given we tried to replicate Experiment 1’s methodology as close as possible (to isolate the source of the effect from Experiment 1) we chose to repeat the same sequence of blocks as Experiment 1: pre-pain followed by pain.

      Given there was no conclusive evidence for a difference in N45 amplitude between pre-pain and post-pain conditions of Experiment 1 (Supplementary Figure 1), it is unlikely that the effect of pain was an order effect i.e., the explanation that successive thermal stimuli applied to the skin results an increase in the N45, regardless of whether the stimuli are painful or not.

      We now discuss the issue of randomization:

      “Lastly, future research should consider replicating our experiment using intermixed pain and no pain blocks, as opposed to fixed pre-pain and pain blocks, to control for order effects i.e. the explanation that successive thermal stimuli applied to the skin results an increase in the N45 peak, regardless of whether the stimuli are painful or not. However, we note that there was no conclusive evidence for a difference in N45 peak amplitude between pre-pain and post-pain conditions of Experiment 1 (Supplementary Figure 1), suggesting it is unlikely that the observed effects were an artefact of time”

      Also, in Methods in general, disclose how pain intensity was assessed, and how.

      Pain intensity was assessed using a verbal rating scale (0 = no pain, and 10 = most pain imaginable). We have provided more detail:

      “During each 40 second thermal stimulus, TMS pulses were manually delivered, with a verbal pain rating score (0 = no pain, and 10 = worst pain imaginable) obtained between pulses. To avoid contamination of TEPs by verbal ratings, the subsequent TMS pulse was not delivered until the verbal rating was complete, and the participant was cued by the experimenter to provide the pain rating after each pulse”

      Please explain how auditory masking was made during data collection.

      Auditory masking noise was not played through the headphones, given that Experiment 2 controlled for auditory evoked potentials. We have made this clearer:

      “Auditory masking was not used. Instead, auditory evoked potentials resulting from the TMS click sound were controlled for in Experiment 2”

      Please explain if online TEP monitoring was used during data collection

      Online TEP monitoring was not available with our EEG software. We have made this clearer in the manuscript:

      “Online TEP monitoring was not available with the EEG software”

      Line 499: what is subthreshold TMS here? You are measuring TEPs, and not MEPs initially, so you may have a threshold for MEPs and TEPs, which are not the same.

      The intensity was calibrated relative to the MEP response (rather than TEP response) - this has now been clarified:

      “… and the inclusion of a subthreshold TMS (90% of resting motor threshold) condition intermixed within both the pre-pain and pain blocks.”

      Please provide a reference and a figure to illustrate the electric stimulation used in the sham procedure in Study 2

      The apparatus for the electrical stimulation is shown in Figure 7A, and was based on previous papers using electrical stimulation over motor cortex to simulate the somatosensory aspect of real TMS (Chowdhury et al., 2022; Gordon et al., 2022; Rocchi et al., 2021). We have made this clearer:

      “Electrical stimulation was based on previous studies attempting to simulate the somatosensory component of active TMS (Chowdhury et al., 2022; Gordon et al., 2022; Rocchi et al., 2021)”

      Gordon, P. C., Jovellar, D. B., Song, Y., Zrenner, C., Belardinelli, P., Siebner, H. R., & Ziemann, U. (2021). Recording brain responses to TMS of primary motor cortex by EEG–utility of an optimized sham procedure. Neuroimage, 245, 118708.

      Chowdhury, N. S., Rogasch, N. C., Chiang, A. K., Millard, S. K., Skippen, P., Chang, W. J., ... & Schabrun, S. M. (2022). The influence of sensory potentials on transcranial magnetic stimulation–Electroencephalography recordings. Clinical Neurophysiology, 140, 98-109.

      Rocchi, L., Di Santo, A., Brown, K., Ibánez, J., Casula, E., Rawji, V., ... & Rothwell, J. (2021). Disentangling EEG responses to TMS due to cortical and peripheral activations. Brain stimulation, 14(1), 4-18.

      It is not so common to use active electrodes for TMS-EEG. Please confirm the electrodes used and if they are c-ring TMS compatible and provide reference if otherwise (or actual papers recommending active ones)

      To be more specific about the electrode type we have indicated:

      “Signals were recorded from 63 TMS-compatible active electrodes (6mm height, 13mm width), embedded in an elastic cap (ActiCap, Brain Products, Germany), in line with the international 10-10 system”

      A paper directly comparing TEPs between active and passive electrodes found no difference between the two and concluded TEPs can be reliably obtained using active electrodes (Mancuso et al., 2021). There is also evidence that active electrodes have better signal quality than passive electrodes at higher impedance levels (Laszlo et al., 2014).

      This information has now been added to the paper:

      “Active electrodes result in similar TEPs (both magnitude and peaks) to more commonly used passive electrodes (Mancuso et al., 2021). There is also evidence that active electrodes have higher signal quality than passive electrodes at higher impedance levels (Laszlo, Ruiz-Blondet, Khalifian, Chu, & Jin, 2014).”

      There is a growing literature showing that monophonic pulses are not reliable for TEPs when compared to biphasic ones, please provide references. https://doi.org/10.1016/j.brs.2023.02.009

      The reference provided by the reviewer states that biphasic and monophasic pulses both have advantages and disadvantages, rather than stating “monophonic pulses are not reliable for TEPs”. While there is some evidence that the artefacts resulting from monophasic pulses are larger than biphasic pulses, the EEG signal still returns to baseline levels within 5ms of the TMS pulse (Rogasch et al., 2013). Moreover, one paper (Casula et al. 2018) found that the resultant TEPs evoked by monophasic pulses are larger than those resulting from biphasic pulses. The authors postulated that monophasic pulses are more effective at activating widespread cortical areas than biphasic pulses. Ultimately the reference provided by the reviewer concludes that “effect of pulse shape on TEPs has not been systematically investigated and more studies are needed”.

      Rogasch, N. C., Thomson, R. H., Daskalakis, Z. J., & Fitzgerald, P. B. (2013). Short-latency artifacts associated with concurrent TMS–EEG. Brain stimulation, 6(6), 868-876.

      Casula, E. P., Rocchi, L., Hannah, R., & Rothwell, J. C. (2018). Effects of pulse width, waveform and current direction in the cortex: A combined cTMS-EEG study. Brain stimulation, 11(5), 1063-1070.

      In most heads, a pulse in the PA direction is not obtained by a coil oriented 45o to the midline. The later induced later-medial pulses, good to obtain MEPs

      We followed previous studies measuring MEPs from the ECRB elbow muscle (Schabrun et al., 2016; de Martino et al., 2019) whereby the TMS coil handle was angled at 45 degrees relative to the midline in order to induce a posterior-anterior current. We are not aware of literature that shows that the 45 degrees orientation does not induce a posterior anterior current in most heads.

      Schabrun, S. M., Christensen, S. W., Mrachacz-Kersting, N., & Graven-Nielsen, T. (2016). Motor cortex reorganization and impaired function in the transition to sustained muscle pain. Cerebral Cortex, 26(5), 1878-1890.

      De Martino, E., Seminowicz, D. A., Schabrun, S. M., Petrini, L., & Graven-Nielsen, T. (2019). High frequency repetitive transcranial magnetic stimulation to the left dorsolateral prefrontal cortex modulates sensorimotor cortex function in the transition to sustained muscle pain. Neuroimage, 186, 93-102.

      The definition of RMT is (very) unusual. RMT provides small 50microV MEPs in 50% of times. If you obtain MEPs at 50microV you are supra threshold!

      The TMS motor threshold assessment tool calculates threshold in the same manner as other threshold tools – it calculates the intensity that elicits an MEP of 50 microvolts, 50% of the time. We have made this clearer:

      “The RMT was determined using the TMS motor thresholding assessment tool, which estimates the TMS intensity required to induce an MEP of 50 microvolts with a 50% probability using maximum likelihood parametric estimation by sequential testing (Awiszus and Borckardt, 2011). This method has been shown to achieve the accuracy of methods such as the Rossini-Rothwell method (Rossini et al., 1994; Rothwell et al., 1999) but with fewer pulses (Qi et al., 2011; Silbert et al., 2013).”

      Please inform the inter TMS pulse interval used of TEPs and whether they were randomly generated.

      The pulses were delivered manually – the interval was not randomly generated – as stated:

      “As TMS was delivered manually, there was no set interpulse interval. However, the 40 second stimulus duration allowed for 11 pulses for each heat stimulus …. (~ 4 seconds in between …)”

      Why have you stimulated suprathreshold on M1 when assessing TEP´s? The whole idea is that large TEPs can be obtained at lower intensities below real RMT and that prevents re-entering loops of somatosensory and joint movement inputs that insert "noise" to the TEPs.

      The suprathreshold intensity was used to concurrently measure MEPs during pre-pain, pain and post-pain blocks.

      We have made this clearer:

      “The test stimulus intensity was set at 110% RMT to concurrently measure MEPs and TEPs during pre-pain, pain and post-pain blocks.”

      The influence of re-afferent muscle activity was controlled for in Experiment 3.

      Did you assess pain intensity after each of the TEP pulses? Please discuss how such a cognitive task may have influenced results

      Pain intensity was assessed after each TMS pulse, as stated:

      “TMS pulses were manually delivered, with a verbal pain rating score (0 = no pain, and 10 = most pain imaginable) obtained between pulses”

      Reviewer 3 also brought up a concern of whether the verbal rating task might have influenced the TEPs. However, it is unlikely that the task contaminated the TEP response as the subsequent TMS pulse was not delivered until the verbal rating was complete and given that each participant was cued by the experimenter to provide the pain rating after each pulse (rather than the participant giving the rating at any time). We have made this clearer where we state:

      “To avoid contamination of TEPs by verbal ratings, the subsequent TMS pulse was not delivered until the verbal rating was complete, and the participant was cued by the experimenter to provide the pain rating after each pulse”

      The QST approach is unusual. Please confirm the sequence of CDT, WDT and HPT were not randomised and that no interval beyond 6sec were used. Proper references are welcome.

      In line with a previous study (Furman et al., 2020), the sequence of the CPT, WDT and HPT were not randomized, and the interval was not more than 6 seconds.

      We have made this clearer:

      “A total of three trials was conducted for each test to obtain an average, with an interstimulus interval of six seconds. The sequence of cold, warmth and pain threshold was the same for all participants (Furman et al. 2020)”

      Performing 60 pulses for TEPs is unusual, and against the minimum number in recommendations

      Please explain and comment.https://doi.org/10.1016/j.brs.2023.02.009

      Please refer to our previous response to this concern in the public reviews.

      Line 578: when you refer to "heat" the reader may confound warm/heat with heat meaning suprathreshold. Please revise the wording.

      We have now replaced the word heat stimulus with thermal stimulus.

      Why were Bayesian statistics used instead as frequentist ones?

      We have made this clearer:

      “Given we were interested in determining the evidence for pain altering TEP peaks in certain conditions (e.g., active TMS) and pain not altering TEP peaks in other conditions (sham TMS), we used a Bayesian approach as opposed to a frequentist approach, which considers the strength of the evidence for the alternative vs. null hypothesis”

      RESULTS

      There is a huge response with high power after 100ms- Please discuss if you believe auditory potentials may have influenced it.

      It is indeed possible that auditory potentials were present at 100ms. We now state:

      “Indeed, the signal at ~100ms post-TMS from Experiment 1 may reflect an auditory N100 response”

      The presence of auditory contamination does not impact the main conclusions of the paper given this was controlled for in Experiment 2.

      Please discuss how pain ranging from 3-10 may have influenced results in the "PAIN" situation,

      It is anticipated that the fixed thermal stimulus intensity approach would lead to large variations in pain ratings (Adamczyk et al., 2022). This is a recommended approach when the aim of the research is to determine relationships between neurophysiological measures and individual differences in pain sensitivity (Adamczyk et al., 2022). Indeed, we were interested in whether alterations in neurophysiological measures were associated with pain intensity, and we found that higher pain ratings were associated with smaller reductions in MEP amplitude and larger increases in N45 amplitude.

      Adamczyk, W. M., Szikszay, T. M., Nahman-Averbuch, H., Skalski, J., Nastaj, J., Gouverneur, P., & Luedtke, K. (2022). To calibrate or not to calibrate? A methodological dilemma in experimental pain research. The Journal of Pain, 23(11), 1823-1832.

      Please indicate if any participants offered pain after warm stimulation ( possible given secondary hyperalgesia after so many plateaux of heat stimulation).

      As stated in the results “All participants reported 0/10 pain during the pre-pain and post-pain blocks”.

      Please discuss the potential effects of having around 10% of "bad channels) In average per experiment per participants, its impacts in source localisation and in TEP measurement. Same for >5 epochs excluded by participant.

      The number of bad channels has been incorrectly stated by the reviewer as being 10% on average per experiment per participant, whereas the correct number of reported bad channels was 3%, 4.7% and 9.8% for Experiment 1, 2 and 3 respectively (see supplementary material). These numbers are below the accepted number of bad channels to interpolate (10%) in EEG pipelines (e.g., Debnath et al., 2020; Kayhan et al., 2022), so it is unlikely that our channel exclusions significantly influenced the quality of our source localization an TEP data.

      Debnath, R., Buzzell, G. A., Morales, S., Bowers, M. E., Leach, S. C., & Fox, N. A. (2020). The Maryland analysis of developmental EEG (MADE) pipeline. Psychophysiology, 57(6), e13580.

      Kayhan, E., Matthes, D., Haresign, I. M., Bánki, A., Michel, C., Langeloh, M., ... & Hoehl, S. (2022). DEEP: A dual EEG pipeline for developmental hyperscanning studies. Developmental cognitive neuroscience, 54, 101104.

      The number of excluded epochs is unlikely to have influenced the results given there was evidence for no difference in the number of rejected epochs between conditions (E1 BF10 = 0.145, E2 BF10 = 0.27, E3 BF10 = 0.169 – these BFs have now been reported in the supplementary material), and given the reliability of the N45 was high (see response to previous comment on the number of trials per condition).

      HPT of 42.9 {plus minus} 2.5{degree sign}C means many participants had HPT close to 46oC. Please discuss

      While some participants did indeed have pain thresholds close to 46 degrees, they nonetheless reported pain during the test blocks. While such participants may have reported less pain compared to others, we aimed for larger variations in pain ratings, given one of the research questions was to determine why pain intensity differs between individuals (given the same noxious stimulus). Indeed, we showed that this variation was meaningful (pain intensity was related to alterations in N45 and MEP amplitude).

      Please explain the sentence : line 139 "As such, if we had used an alternative design with blocks of warm stimuli intermixed with blocks of painful stimuli, the warm stimuli blocks would not serve as a valid non-painful baseline." I cannot see why.

      Please refer to our previous point on why the fixed sequence was included.

      And on the top of that heat was not individualised according to HPT.

      Please refer to our previous point on why we used a fixed stimulus approach.

      Sequences of warm/heat were not randomised. Please refer to our previous point on the why the sequence of blocks was not randomized.

      Line 197: "However, as this is the first study investigating the effects of experimental pain on TEPsamplitude, there were no a priori regions or timepoints of interest to compare betweenconditions". This is not clear. It means you have not measured the activity (size of the N45) under the electrode closest to the TMS coil? The TEP is supposed to by higher under the stimulated target/respective corresponding electrode…

      We are not aware of any current recommendations that state that the region of interest should be based on the site of stimulation. The advantage of TMS-EEG is that it allows characterisation of cortical excitability changes throughout the brain, not just the site of stimulation. We based our region of interest on a cluster-based permutation analysis, as recommended by Frömer, Maier, & Abdel Rahman, (2018)

      Frömer, R., Maier, M., & Abdel Rahman, R. (2018). Group-level EEG-processing pipeline for flexible single trial-based analyses including linear mixed models. Frontiers in neuroscience, 12, 48.

      Please explain where N45 values came from.

      The N45 was calculated using the TESA peak function (Rogasch et al., 2017) which identifies a data point which is larger/smaller than +/- 5 data points within a specified time window (e,g, 40-70ms post-TMS as in the present study). Where multiple peaks are found, the amplitude of the largest peak is returned. Where no peak is found, the amplitude at the specified latency is returned.

      Rogasch, N. C., Sullivan, C., Thomson, R. H., Rose, N. S., Bailey, N. W., Fitzgerald, P. B., ... & Hernandez-Pavon, J. C. (2017). Analysing concurrent transcranial magnetic stimulation and electroencephalographic data: A review and introduction to the open-source TESA software. Neuroimage, 147, 934-951.

      If only the cluster assessment was made please provide the comparison between P45 from the target TMS channel location in pre pain vs pain.

      We assume the reviewer is referring to the N45 rather than P45, and that by “target” TMS channel they are referring to the stimulated region.

      We first clarify that there is no “target” channel given the motor hotspot differs between individuals and so the channel that is closest to the site of stimulation will always differ.

      Secondly, as stated above, we are not aware of any current recommendations in TMS-EEG research that states that the region of interest for TEP analysis should be based on the site of stimulation. The advantage of TMS-EEG is that it allows characterisation of cortical excitability throughout the brain, not just the site of stimulation. If we based our ROI on the target channel only, we would lose valuable information about excitability changes occurring in other brain regions.

      Lastly, the N45 was localized at frontocentral electrodes, which is also where the cluster differences emerged. As such, we do not believe it would be informative to compare N45 peak amplitude at the region of stimulation.

      Also explain how correction for multiple comparisons was made

      Please refer to our response to the public review related to this issue.

      And report data from pain vs post-pain.

      The pain vs. post-pain comparisons are now reported in the Supplementary material.

      There is a strong possibility the response at N85 is an auditory /muscle signal. Please provide the location of this response.

      We have opted not to include the topography at 85ms in the main paper as it would introduce too much clutter into the figures (which are already very dense), and because the topography was very similar to the topography at 100ms. As an example, for the reviewer, in Author response image 1 we have shown the topography for the pre-pain condition of Experiment 1.

      Author response image 1.

      Experiment 2: I have a strong impression both active TEPs and sham TEPs were contaminated by auditory (and muscle) noise. Please explain.

      While it possible that auditory noise may have influenced TEPs in the active and sham groups, it does not impact the main conclusions of the paper, given that the purpose of the sham condition was to control for auditory and somatosensory stimulation resulting from TMS.

      While muscle activity may also affect have influenced the TEPs in active and sham conditions, we used fastICA in all conditions to suppress muscle activity. The fastICA algorithm (Rogasch et al., 2017) runs an independent component analysis on the data, and classifies components as neural, TMS-evoked muscle, eye movements and electrode noise, based on a set of heuristic thresholding rules (e.g., amplitude, frequency and topography of the components). Components classified as TMS-evoked muscle/other muscle artefacts are then removed. In the supplementary material, we further report that the number of components removed did not differ between conditions, suggesting the impact of muscle artefacts are not larger in some conditions vs. others.

      Rogasch, N. C., Sullivan, C., Thomson, R. H., Rose, N. S., Bailey, N. W., Fitzgerald, P. B., ... & Hernandez-Pavon, J. C. (2017). Analysing concurrent transcranial magnetic stimulation and electroencephalographic data: A review and introduction to the open-source TESA software. Neuroimage, 147, 934-951.

      Experiment 3: One interpretation can be that both supra and sub-threshold TMS were leading to somatosensory re-afferent responses, based on the way RMT was calculated, which hyper estimate the RMT and delivers in reality 2 types of supra-threshold stimulations. Please discuss

      Please refer to our response to the public review related to this issue.

      Please provide correlation between N45 size and MEPs amplitudes.

      This has now been included:

      “There was no conclusive evidence of any relationship between alterations in MEP amplitude during pain, and alterations in N100, N45 and P60 amplitude during pain (see supplementary material).”<br /> The supporting statistics for these analyses have been included in the supplementary material.

      DISCUSSION

      Line 303: " The present study determined whether acute experimental pain induces alterations in cortical inhibitory and/or facilitatory activity observed in TMS-evoked potentials".

      Well, no. The study assessed the N45, and was based on it. It did not really explore other metrics in a systematic fashion. P60 and N100 changes were not replicated in experiments 2 and 3..

      We assume the reviewer is stating that we did not assess other TEP peaks (such as the N15, P30 and P180). However, we did indeed assess these peaks in a systematic fashion. First, we identified the ROI by using a cluster-based analysis. This is a recommended approach when the ROI is unclear (Frömer, Maier, & Abdel Rahman, 2018). We then analysed the TEP representing the mean voltage across the electrodes within the cluster, and then identified any differences in all peaks between conditions (not just the N45). This has been made clearer in the manuscript.

      This has now been included:

      “For all experiments, the mean TEP waveform of any identified clusters from Experiment 1 were plotted, and peaks (e.g., N15, P30, N45, P60, N100) were identified using the TESA peak function (Rogasch et al., 2017)”

      Frömer, R., Maier, M., & Abdel Rahman, R. (2018). Group-level EEG-processing pipeline for flexible single trial-based analyses including linear mixed models. Frontiers in neuroscience, 12, 48.

      And the N45 is not related to facilitatory or inhibitory activity, it is a measure of an evoked response indicating excitability

      Evidence suggests the N45 is mediated by GABAAergic neurotransmission (inhibitory activity), as drugs which increase GABAA receptor activity increase the amplitude of the N45 (Premoli et al., 2014) and drugs which decrease GABAA receptor activity decrease the amplitude of the N45 (Darmani et al., 2016). As such, we and various other empirical papers (e.g., Bellardinelli et al., 2021; Noda et al., 2021; Opie at 2019 ) and review papers (Farzan & Bortoletto, 2022; Tremblay et al., 2019) have interpreted changes in the N45 peak as reflecting changes in cortical inhibitory/GABAA mediated activity.

      Premoli, I., Castellanos, N., Rivolta, D., Belardinelli, P., Bajo, R., Zipser, C., ... & Ziemann, U. (2014). TMS-EEG signatures of GABAergic neurotransmission in the human cortex. Journal of Neuroscience, 34(16), 5603-5612.

      Belardinelli, P., König, F., Liang, C., Premoli, I., Desideri, D., Müller-Dahlhaus, F., ... & Ziemann, U. (2021). TMS-EEG signatures of glutamatergic neurotransmission in human cortex. Scientific reports, 11(1), 8159.

      Darmani, G., Zipser, C. M., Böhmer, G. M., Deschet, K., Müller-Dahlhaus, F., Belardinelli, P., ... & Ziemann, U. (2016). Effects of the selective α5-GABAAR antagonist S44819 on excitability in the human brain: a TMS–EMG and TMS–EEG phase I study. Journal of Neuroscience, 36(49), 12312-12320.

      Noda, Y., Barr, M. S., Zomorrodi, R., Cash, R. F., Lioumis, P., Chen, R., ... & Blumberger, D. M. (2021). Single-pulse transcranial magnetic stimulation-evoked potential amplitudes and latencies in the motor and dorsolateral prefrontal cortex among young, older healthy participants, and schizophrenia patients. Journal of Personalized Medicine, 11(1), 54.

      Farzan, F., & Bortoletto, M. (2022). Identification and verification of a'true'TMS evoked potential in TMS-EEG. Journal of neuroscience methods, 378, 109651.

      Opie, G. M., Foo, N., Killington, M., Ridding, M. C., & Semmler, J. G. (2019). Transcranial magnetic stimulation-electroencephalography measures of cortical neuroplasticity are altered after mild traumatic brain injury. Journal of Neurotrauma, 36(19), 2774-2784.

      Tremblay, S., Rogasch, N. C., Premoli, I., Blumberger, D. M., Casarotto, S., Chen, R., ... & Daskalakis, Z. J. (2019). Clinical utility and prospective of TMS–EEG. Clinical Neurophysiology, 130(5), 802-844.

      Line 321: why have you not measured SEPs in experiment 3?

      It is not possible to directly measure the somatosensory evoked potentials resulting from a TMS pulse, given that the TMS pulse produces a range of signals including cortical activity, muscle/eye blink responses, auditory responses, somatosensory responses and other artefacts. While some researchers attempt to isolate the SEP from TMS using pre-processing methods such as ICA, others use control conditions such as sensory sham conditions (to control for the “tapping” artefact) or subthreshold intensity conditions (to control for reafferent muscle activity), as we have done in Experiment 2 and 3 of our study.

      We have now stated this in the manuscript:

      “As it is extremely challenging to isolate and filter these auditory and somatosensory evoked potentials using pre-processing pipelines, masking methods have been used to suppress these sensory inputs, (Ilmoniemi and Kičić, 2010; Massimini et al., 2005). However recent studies have shown that even when these methods are used, sensory contamination of TEPs is still present, as shown by commonalities in the signal between active and sensory sham conditions that mimic the auditory/somatosensory aspects of real TMS (Biabani et al., 2019; Conde et al., 2019; Rocchi et al., 2021). This has led many leading authors (Biabani et al., 2019; Conde et al., 2019) to recommend the use of sham conditions to control for sensory contamination”

      Line 365: SICI is dependent on GABAa activity. But the way the text is written if conveys the idea that TMS pulses "activate" GABA receptors, which is weird...Please rephrase.

      This has now been reworded.

      “SICI refers to the reduction in MEP amplitude to a TMS pulse that is preceded 1-5ms by a subthreshold pulse, with this reduction believed to be mediated by GABAA neurotransmission (Chowdhury et al., 2022)”

      Reviewer #3 (Recommendations For The Authors):

      -Key references Ye et al., 2022 and Che et al., 2019 need to be included in the reference list.

      These references have now been included in the reference list.

      -Heat pain stimuli and TMS stimuli are applied simultaneously. Sometimes the term "stimulus" is used without specifying whether it refers to TMS pulses or heat pain stimuli. Clarifying this whenever the word "stimulus" is used would enhance clarity for the reader.

      We have now clarified the use of the word “stimulus” throughout the paper.

      -Panels A-D in Figure 6 should be correctly labeled in the text and the figure legend.

      Figure 6 Panel labels have now been amended.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1:

      Watanuki et al used metabolomic tracing strategies of U-13C6-labeled glucose and 13C-MFA to quantitatively identify the metabolic programs of HSCs during steady-state, cell-cycling, and OXPHOS inhibition. They found that 5-FU administration in mice increased anaerobic glycolytic flux and decreased ATP concentration in HSCs, suggesting that HSC differentiation and cell cycle progression are closely related to intracellular metabolism and can be monitored by measuring ATP concentration. Using the GO-ATeam2 system to analyze ATP levels in single hematopoietic cells, they found that PFKFB3 can accelerate glycolytic ATP production during HSC cell cycling by activating the rate-limiting enzyme PFK of glycolysis. Additionally, by using Pfkfb3 knockout or overexpressing strategies and conducting experiments with cytokine stimulation or transplantation stress, they found that PFKFB3 governs cell cycle progression and promotes the production of differentiated cells from HSCs in proliferative environments by activating glycolysis. Overall, in their study, Watanuki et al combined metabolomic tracing to quantitatively identify metabolic programs of HSCs and found that PFKFB3 confers glycolytic dependence onto HSCs to help coordinate their response to stress. Even so, several important questions need to be addressed as below:

      We sincerely appreciate the constructive feedback from the reviewer. Additional experiments and textual improvements have been made to the manuscript based on your valuable suggestions. In particular, the major revisions are as follows: First, we investigated the extent to which other metabolites, not limited to the glycolytic system, affect metabolism in HSCs after 5-FU treatment. Second, the extent to which PFKFB3 contributes to the expansion of the HSPC pool in the bone marrow was adjusted to make the description more accurate based on the data. Finally, we overexpressed PFKFB3 in HSCs derived from GO-ATeam2 mice and confirmed that PRMT1 inhibition did not reduce the ATP concentration. We believe that the reviewer's valuable comments have further deepened our knowledge of the significance of glycolytic activation by PFKFB3 that we have demonstrated. Our response to the "Recommendations for Authors" is listed first, followed by our responses to all "Public Review" comments as follows:

      (Recommendations For The Authors):

      1. The methods used in key experiments should be described in more detail. For example, in the section on ‘Conversion of GO-ATeam2 fluorescence to ATP concentration’, the knock-in strategy for GO-ATeam2 should be described, as well as U-13C6 -glucose tracer assays.

      As per your recommendation, we have described the key experimental method in more detail in the revised manuscript: the GO-ATeam2 knock-in method was reported by Yamamoto et al. 1. Briefly, they used a CAG promoter-based knock-in strategy targeting the Rosa26 locus to generate GO-ATeam2 knock-in mice. A description of the method has been added to Methods and the reference has been added to the citation.

      For the U-13C6-glucose tracer analysis, the following points were added to describe the details of the analysis: First, a note was added that the number of cells used for the in vitro tracer analysis was the number of cells used for each sample. Second, we added the solution from which the cells were collected by sorting. We added that the incubation was performed under 1% O2 and 5% CO2.

      1. Confusing image label of Supplemental Figure 1H should be corrected in line 253.

      We have corrected the incorrect figure caption on line 217 in the revised manuscript to "Supplemental Figure 1N" as you suggested.

      1. The percentage of the indicated cell population should also be shown in Figure S1B.

      As you indicated, we have included the percentages for each population in Supplemental Figure 1B.

      Author response image 1.

      1. Please pay attention to the small size of the marks in the graph, such as in Figure S1F and so on.

      As you indicated, we have corrected the very small text contained in Figure S1F. Similar corrections have been made to Figures S1B and S5A.

      1. Please pay attention to the label of line in Figure S6A-D.

      Thank you very much for the advice. We have added line labels to the graph in the original Figures S6A–D.

      (Specific comments)

      1. Based on previous reports, the authors expanded the LSK gate to include as many HSCs as possible (Supplemental Figure 1B). However, while they showed the gating strategy on Day 6 after 5-FU treatment, results from other time-points should also be displayed to ensure the strict selection of time-points.

      Thank you for pointing this out. First, we did not enlarge the Sca-1 gating in this study. We apologize for any confusion caused by the incomplete description. The gating of c-Kit is based on that shown by Umemoto et al (Figure EV1A) 2, who used 250 mg/kg 5-FU, so their c-Kit reduction is more pronounced than ours.

      We followed this study and compared c-Kit expression in Lin-Sca-1+CD150+CD48-EPCR+ gates to BMMNCs on day 6 after 5-FU administration (150 mg/kg). The results are shown below.

      Author response image 2.

      Since the MFI of c-Kit was downregulated, we used gating that extended the c-Kit gate to lower-expression regions on day 6 after 5-FU administration (revised Figure S1C). At other time points, LSK gating was the same as in the PBS-treated group, as noted in the Methods.

      1. In Figure 1, the authors examined the metabolite changes on Day 6 after 5-FU treatment. However, it is important to consider whether there are any dynamic adjustments to metabolism during the early and late stages of 5-FU treatment in HSCs compared to PBS treatment, in order to coordinate cell homeostasis despite no significant changes in cell cycle progression at other time-points.

      Thank you for pointing this out. Below are the results of the GO-ATeam2 analysis during the very early phase (day 3) and late phase (day 15) after 5-FU administration (revised Figures S7A–H).

      Author response image 3.

      In the very early phase, such as day 3 after 5-FU administration, cell cycle progression had not started (Figure S1C) and was not preceded by metabolic changes. Meanwhile, in the late phase, such as day 15 after 5-FU administration, the cell cycle and metabolism returned to a steady state. In summary, the timing of the metabolic changes coincided with that of cell cycle progression. This point is essential for discussing the cell cycle-dependent metabolic system of HSCs and has been newly included in the Results (page 11, lines 321-323).

      1. As is well known, ATP can be produced through various pathways, including glycolysis, the TCA cycle, the PPP, NAS, lipid metabolism, amino acid metabolism and so on. Therefore, it is important to investigate whether treatment with 5-FU or oligomycin affects these other metabolic pathways in HSCs.

      As the reviewer pointed out, ATP production by systems other than the glycolytic system of HSCs is also essential. In this revised manuscript, we examined the effects of the FAO inhibitor (Etomoxir, 100 µM) and the glutaminolysis inhibitor 6-diazo-5-oxo-L-norleucine (DON, 2mM) alone or in combination on the ATP concentration of HSCs after PBS or 5-FU treatment. As shown below, there was no apparent decrease in ATP concentration (revised Figures S7J–M).

      Author response image 4.

      Fatty acid β-oxidation activity was also measured in 5-FU-treated HSCs using the fluorescent probe FAOBlue and was unchanged compared to PBS-treated HSCs (revised Figure S7N).

      Author response image 5.

      Notably, the addition of 100 µM etomoxir plus glucose and Pfkfb3 inhibitors resulted in a rapid decrease in ATP concentration in HSCs (revised Figures S7O–P). This indicates that etomoxir partially mimics the effect of oligomycin, suggesting that at a steady state, OXPHOS is driven by FAO, but can be compensated by the acceleration of the glycolytic system by Pfkfb3. Meanwhile, the exposure of HSCs to Pfkfb3 inhibitors in addition to 2 mM DON, which is an extremely high dose considering that the Ki value of DON for glutaminase is 6 µM, did not reduce ATP (revised Figures S7O–P). This suggests that ATP production from glutaminolysis is limited in HSCs at a steady state.

      Author response image 6.

      These points suggest that OXPHOS is driven by fatty acids at a steady state, but unlike the glycolytic system, FAO is not further activated by HSCs after 5-FU treatment. The results of these analyses and related descriptions are included in the revised manuscript (page 11, lines 332-344).

      1. In part 2, they showed that oligomycin treatment of HSCs exhibited activation of the glycolytic system, but what about the changes in ATP concentration under oligomycin treatment? Are other metabolic systems affected by oligomycin treatment?

      Thank you for your thoughtful comments. The relevant results we have obtained so far with the GO-ATeam2 system are as follows: First, OXPHOS inhibition in the absence of glucose significantly decreases the ATP concentration of HSCs (Figure 4C). Meanwhile, OXPHOS inhibition in the presence of glucose maintains the ATP concentration of HSCs (Figure 5B). Since it is difficult to imagine a completely glucose-free environment in vivo, it is thought that ATP concentration is maintained by the acceleration of the glycolytic system even under hypoxic or other conditions that inhibit OXPHOS.

      Meanwhile, glucose tracer analysis shows that OXPHOS inhibition suppresses nucleic acid synthesis (NAS) except for the activation of the glycolytic system (Figures 2C–F). This is because phosphate groups derived from ATP are transferred to nucleotide mono-/di-phosphate in NAS, but OXPHOS, the main source of ATP production, is impaired, along with the enzyme conjugated with OXPHOS in the process of NAS (dihydroorotate dehydrogenase, DHODH). We have added a new paragraph in the Discussion section (page 17, lines 511-515) to provide more insight to the reader by summarizing and discussing these points.

      1. In Figure 5M, it would be helpful to include a control group that was not treated with 2-DG. Additionally, if Figure 5L is used as the control, it is unclear why the level of ATP does not show significant downregulation after 2-DG treatment. Similarly, in Figure 5O, a control group with no glucose addition should be included.

      Thank you for your advice. The experiments corresponding to the control groups in Figures 5M and O were in Figures 5L and N, respectively, but we have combined them into one graph (revised Figures 5L–M). The results more clearly show that PFKFB3 overexpression enhances sensitivity to 2-DG, but also enhances glycolytic activation upon oligomycin administration.

      Author response image 7.

      1. In this study, their findings suggest that PFKFB3 is required for glycolysis of HSCs under stress, including transplantation. In Figure 7B, the results showed that donor-derived chimerism in PB cells decreased relative to that in the WT control group during the early phase (1 month post-transplant) but recovered thereafter. Although the transplantation cell number is equal in two groups of donor cells, it is unclear why the donor-derived cell count decreased in the 2-week post-transplantation period and recovered thereafter in the Pfkgb3 KO group. Therefore, they should provide an explanation for this. Additionally, they only detected the percentage of donor-derived cells in PB but not from BM, which makes it difficult to support the argument for Increasing the HSPC pool.

      As pointed out by the reviewer, it is interesting to note that the decrease in peripheral blood chimerism in the PFKFB3 knockout is limited to immediately after transplantation and then catches up with the control group (Figure 7B). We attribute this to the fact that HSPC proliferation is delayed immediately after transplantation in PFKFB3 deficiency, but after a certain time, PB cells produced by the delayed proliferating HSPCs are supplied. In support of this, the PFKFB3 knockout HSPCs did not exhibit increased cell death after transplantation (Figure 7K), while a delayed cell cycle was observed (Figures 7G–J). A description of this point has been added to the Discussion (page 19, lines 573-579).

      In addition, the knockout efficiency in bone marrow cells could not be verified because the number of cells required for KO efficiency analysis was not available. Therefore, we have added a statement on this point and have toned down our overall claim regarding the extent to which PFKFB3 is involved in the expansion of the HSPC pool (page 15, lines 474-476).

      1. In Figure 7E, they collected the BM reconstructed with Pfkfb3- or Rosa-KO HSPCs two months after transplantation, and then tested their resistance to 5-FU. However, the short duration of the reconstruction period makes it difficult to draw conclusions about the effects on steady-state blood cell production.

      We agree that we cannot conclude from this experiment alone that PFKFB3 is completely unnecessary in steady state because, as you pointed out, the observation period of the experiment in Figure 7E is not long. We have toned down the claim by stating that PFKFB3 is only less necessary in steady-state HSCs compared to proliferative HSCs (page 15, lines 460-461).

      1. PFK is allosterically activated by PFKFB, and other members of the PFKFB family could also participate in the glycolytic program. Therefore, they should investigate their function in contributing to glycolytic plasticity in HSCs during proliferation. Additionally, they should also analyze the protein expression and modification levels of other members. Although PFKFB3 is the most favorable for PFK activation, the role of other members should also be explored in HSC cell cycling to provide sufficient reasoning for choosing PFKFB3.

      To further justify why we chose PFKFB3 among the PFKFB family members, we reviewed our data and the publicly available Gene Expression Commons (GEXC) 3. PFKFB3 is the most highly expressed member of the PFKFB family in HSCs (revised Figure 4F), and its expression increases with proliferation (Author response image 9). In addition to this, we have also cited the literature 4 indicating that AZ PFKFB3 26 is a Pfkfb3-specific inhibitor that we used in this paper, and added a note to this point (that it is specific) (page 11, lines 327-329). Through these revisions, we sought to strengthen the rationale for Pfkfb3 as the primary target of the analysis.

      Author response image 8.

      Author response image 9.

      1. In this study, the authors identified PRMT1 as the upstream regulator of PFKFB3 that is involved in the glycolysis activation of HSCs. However, PRMT1 is also known to participate in various transcriptional activations. Thus, it is important to determine whether PRMT1 affects glycolysis through transcriptional regulation or through its direct regulation of PFKFB3? Additionally, the authors should investigate whether PRMT1i inhibits ATP production in normal HSCs. Moreover, could we combine Figure 6I and 6J for analysis. Finally, the authors could conduct additional rescue experiments to demonstrate that the effect of PRMT1 inhibitors on ATP production can be rescued by overexpression of PFKFB3.

      Although PRMT1 inhibition reduced m-PFKFB3 levels in HSCs, 5-FU treatment also reduced or did not alter Pfkfb3 transcript levels (Figures 6B, G) and the expression of genes such as Hoxa7/9/10, Itga2b, and Nqo1, which are representative transcriptional targets of PRMT1, in proliferating HSCs after 5-FU treatment (revised Figure S9).

      Author response image 10.

      These results suggest that PRMT1 promotes PFKFB3 methylation, which increases independently of transcription in HSCs after 5-FU treatment.

      A summary analysis of the original Figures 6I and 6J is shown below (revised Figure 6I).

      Author response image 11.

      Finally, we tested whether the inhibition of the glycolytic system and the decrease in ATP concentration due to PRMT1 inhibition could be rescued by the retroviral overexpression of PFKFB3. We found that PFKFB3 overexpression did not decrease the ATP concentration in HSCs due to PRMT1 inhibition (revised Figure 6J). Therefore, PFKFB3 overexpression mitigated the decrease in ATP concentration caused by PRMT1 inhibition. These data and related statements have been added to the revised manuscript (page 14, lines 427-428).

      Author response image 12.

      Reviewer #2:

      In the manuscript Watanuki et al. want to define the metabolic profile of HSCs in stress/proliferative (myelosuppression with 5-FU), and mitochondrial inhibition and homeostatic conditions. Their conclusions are that during proliferation HSCs rely more on glycolysis (as other cell types) while HSCs in homeostatic conditions are mostly dependent on mitochondrial metabolism. Mitochondrial inhibition is used to demonstrate that blocking mitochondrial metabolism results in similar features of proliferative conditions.

      The authors used state-of-the-art technologies that allow metabolic readout in a limited number of cells like rare HSCs. These applications could be of help in the field since one of the major issues in studying HSCs metabolism is the limited sensitivity of the“"standard”" assays, which make them not suitable for HSC studies.

      However, the observations do not fully support the claims. There are no direct evidence/experiments tackling cell cycle state and metabolism in HSCs. Often the observations for their claims are indirect, while key points on cell cycle state-metabolism, OCR analysis should be addressed directly.

      We sincerely appreciate the reviewer's constructive comments. Thank you for highlighting the importance of the highly sensitive metabolic assay developed in this study and the findings based on it. Meanwhile, the reviewer's comments have made us aware of areas where we can further improve this manuscript. In particular, in the revised manuscript, we have performed further studies to demonstrate the link between the cell cycle and metabolic state. Specifically, we further subdivided HSCs by the uptake of in vivo-administered 2-NBDG and performed cell cycle analysis. Next, HSCs after PBS or 5-FU treatment were analyzed by a Mito Stress test using the Seahorse flux analyzer, including ECAR and OCR, and a more direct relationship between the cell cycle state and the metabolic system was found. We believe that the reviewer's valuable suggestions have helped us clarify more directly the importance of the metabolic state of HSCs in response to cell cycle and stress that we wanted to show and emphasize the usefulness of the GO-ATeam2 system. Our response to "Recommendations For The Authors" is listed first, followed by our responses to all comments in "Public Review" as follows:

      (Recommendations For The Authors):

      In general, I believe it would be important:

      1. to directly associate cell cycle state with metabolic state. For example, by sorting HSC (+/- 5FU) based on their cell cycle state (exploiting the mouse model presented in the manuscript or by defining G0/G1/G2-S-M via Pyronin/Hoechst staining which allow to sort live cells) and follow the fate of radiolabeled glucose.

      Thank you for raising these crucial points. Unfortunately, it was difficult to perform the glucose tracer analysis by preparing HSCs with different cell cycle states as you suggested due to the amount of work involved. In particular, in the 5-FU group, more than 60 mice per group were originally required for an experiment, and further cell cycle-based purification would require many times that number of mice, which we felt was unrealistic under current technical standards. As an alternative, we administered 2-NBDG to mice and fractionated HSCs at the 2-NBDG fluorescence level for cell cycle analysis. The results are shown below (revised Figure S1M). Notably, even in the PBS-treated group, HSCs with high 2-NBDG uptake were more proliferative than those with low 2-NBDG uptake and are comparable to HSCs after 5-FU treatment, although the overall population of HSCs exiting the G0 phase and entering the G1 phase increased after 5-FU treatment. In both PBS/5-FU-treated groups, these large differences in cell cycle glucose utilization suggest a direct link between HSC proliferation and glycolysis activation. If a more sensitive type of glucose tracer analysis becomes available in the future, it may be possible to directly address the reviewer's comments. We see this as a topic for the future. The descriptions of the above findings and perspectives have been added to the Results and Discussion section (page 7, lines 208-214, page 20, lines 607-610).

      Author response image 13.

      1. Use other radio labeled substrates (fatty acid, glutamate)

      Thank you very much for your suggestion. While this is an essential point for future studies, we believe it is not the primary focus of the paper. We are planning another research project on tracer analysis using labeled fatty acids and glutamates, which we will report on in the near future. We have clearly stated in the Abstract and Introduction of the revised manuscript, that the focus of this study is on changes in glucose metabolism when HSCs are stressed (page 3, line 75 and 87, page 5, lines 135).

      Instead, we added the following analyses of metabolic changes in fatty acids and glutamate using the GO-ATeam2 system. HSCs derived from GO-ATeam2 mice treated with PBS or 5-FU were used to measure changes in ATP concentrations after exposure to the fatty acid beta-oxidation (FAO) inhibitor etomoxir and the glutaminolysis inhibitor 6-diazo-5-oxo-L-norleucine (DON). Etomoxir was used at 100 µM, a concentration that inhibits FAO without inhibiting mitochondrial electron transfer complex I, as previously reported 5. DON was used at 2 mM, a concentration that sufficiently inhibits the enzyme as the Ki for glutaminase is 6 µM. In this experiment, etomoxir alone, DON alone, or etomoxir and DON in combination did not decrease the ATP concentration of HSCs in the PBS and 5-FU groups (revised Figures S7J–M), suggesting that FAO and glutaminolysis were not essential for ATP production in HSCs in the short term. Thus, according to the analysis using the GO-Ateam2 system, HSCs exposed to acute stresses change the efficiency of glucose utilization (accelerated glycolytic ATP production) rather than other energy sources. Since there are reports that FAO and glutaminolysis are required for HSC maintenance in the long term 5,6, compensatory pathways may be able to maintain ATP levels in the short term. A description of these points has been added to the Discussion (page 11, lines 332-344).

      Author response image 14.

      1. Include OCR analyses.

      In addition to the ECAR data of the Mito Stress test (original Figures 2G–H), OCR data were added to the revised manuscript (revised Figures 2H, S3D). Compared to c-Kit+ myeloid progenitors (LKS- cells), HSC showed a similar increase in ECAR, while the decrease in OCR was relatively limited. A possible explanation for this is that glycolytic and mitochondrial metabolism are coupled in c-Kit+ myeloid progenitors, whereas they are decoupled in HSCs. This is also suggested by the glucose plus oligomycin experiment in Figures 5B, C, and S6A–D (orange lines). In summary, in HSCs, glycolytic and mitochondrial ATP production are decoupled and can maintain ATP levels by glycolytic ATP production alone, whereas in progenitors including GMPs, the two ATP production systems are constantly coupled, and glycolysis alone cannot maintain ATP concentration. We have added descriptions of these points in the Results and Discussion section (page 8, lines 240-243, page 18, lines 558-561).

      Author response image 15.

      Next, a Mito Stress test was performed using HSCs derived from PBS- or 5-FU-treated mice in the presence or absence of oligomycin (revised Figures 1G–H, S3A–B). Without oligomycin treatment, ECAR in 5-FU-treated HSCs was higher than in PBS-treated HSCs, and OCR was unchanged. Oligomycin treatment increased ECAR in both PBS- and 5-FU-treated HSCs, whereas OCR was unchanged in PBS-treated HSCs, but significantly decreased in 5-FU-treated HSCs. Changes in ECAR in response to oligomycin differed between HSC proliferation or differentiation: ECAR increased in 5-FU-treated HSCs but not in LKS- progenitors (original Figures 2G–H). This suggests a metabolic feature of HSCs in which the coupling of OXPHOS with glycolysis seen in LKS- cells is not essential in HSCs even after cell cycle entry. The results and discussion of this experiment have been added to page 7, lines 194-201 and page 18, lines 558-561).

      Author response image 16.

      1. Correlate proliferation-mitochondrial inhibition-metabolic state

      We agree that it is important to clarify this point. First, OXPHOS inhibition and proliferation similarly accelerate glycolytic ATP production with PFKFB3 (Figures 4G, I, and 5F–I). Meanwhile, oligomycin treatment rapidly decreases ATP in HSCs with or without 5-FU administration (Figure 4C). These results suggest that OXPHOS is a major source of ATP production both at a steady state and during proliferation, even though the analysis medium is pre-saturated with hypoxia similar to that in vivo. This has been added to the Discussion section (page 17, lines 520-523).

      1. Tune down the claim on HSCs in homeostatic conditions since from the data it seems that HSCs rely more on anaerobic glycolysis.

      Thanks for the advice. The original Figures S2C, D, F, and G show that HSC is dependent on the anaerobic glycolytic system even at a steady state, so we have toned down our claims (page 7, lines 192-194).

      1. For proliferative HSCs mitochondrial are key. When you block mitochondria with oligomycin there's the biggest drop in ATP.

      In the revised manuscript, we have tried to highlight the key findings that you have pointed out. First, we mentioned in the Discussion (page 17, lines 523-525) that previous studies suggested the importance of mitochondria in proliferating HSCs. Meanwhile, the GO-ATeam2 and glucose tracer analyses in this study newly revealed that the glycolytic system activated by PFKFB3 is activated during the proliferative phase, as shown in Figure 4C. We also confirmed that mitochondrial ATP production is vital in proliferating HSCs, and we hope to clarify the balance between ATP-producing pathways and nutrient sources in future studies.

      1. To better clarify this point authors, authors should do experiments in hypoxic conditions and compare it to oligomycin treatment and showing that mito-inhibition acts differently on HSCs (considering that all these drugs are toxic for mitochondria and induce rapidly stress responses ex: mitophagy).

      We apologize for any confusion caused by not clearly describing the experimental conditions. As pointed out by the reviewer, we also recognize the importance of experiments in a hypoxic environment. All GO-ATeam2 analyses were performed in a medium saturated sufficiently under hypoxic conditions and analyzed within minutes, so we believe that the medium did not become oxygenated (page S5-S6, lines 160-163 in the Methods). Despite being conducted under such hypoxic conditions, the substantial decrease in ATP after oligomycin treatment is intriguing (original Figures 4C, 5B, 5C). The p50 value of mitochondria (the partial pressure of oxygen at which respiration is half maximal) is 0.1 kPa, which is less than 0.1% of the oxygen concentration at atmospheric pressure 7. Thus, biochemically, it is consistent that OXPHOS can maintain sufficient activity even in a hypoxic environment like the bone marrow. We are currently embarking on a study to determine ATP concentration in physiological hypoxic conditions using in vivo imaging within the bone marrow, which we hope to report in a separate project. We have discussed these points, technical limitations, and perspectives in the Discussion section (page 20, lines 610-612).

      • In Figure 1 C, D, E and F, the comparison should be done as unpaired t test and the control group should not be 1 as the cells comes from different individuals.

      Thank you very much for pointing this out. We have reanalyzed and revised the figures (revised Figures 1C–F)

      Author response image 17.

      • In Figure S2A, the post-sorting bar of 6PG, R5P and S7P are missing.

      Metabolites below the detection threshold (post-sorting samples of 6PG, R5P, and S7P) are now indicated as N.D. (not detected) (revised Figure S2A).

      Author response image 18.

      • In the 2NBDG experiments, authors should add the appropriate controls, since it has been shown that 2NBDG cellular uptake do not correctly reflect glucose uptake (Sinclair LV, Immunometabolism 2020) (a cell type dependent variations) thus inhibitors of glucose transporters should be added as controls (cytochalasin B; 4,6-O-ethylidene-a-D-glucose) it would be quite challenging to test it in vivo but it would be sufficient to show that in vitro in the different HSPCs analyzed.

      We appreciate the essential technical point raised by the reviewer. In the revised manuscript, we performed a 2-NBDG assay with cytochalasin B and phloretin as negative controls. After PBS treatment, 2-NBDG uptake was higher in 5-FU-treated HSCs compared to untreated HSCs. This increase was inhibited by both cytochalasin B and phloretin. In PBS-treated HSCs, cytochalasin B did not downregulate 2-NBDG uptake, whereas phloretin did. Although cytochalasin B inhibits glucose transporters (GLUTs), it is also an inhibitor of actin polymerization. Therefore, its inhibitory effect on GLUTs may be weaker than that of phloretin. We have revised the figure (revised Figure S1L) and added the corresponding description (page 7, lines 207-208).

      Author response image 19.

      • S5C: authors should show the cell number for each population. If there's a decreased in % in Lin- that will be reflected in all HSPCs. Comparing the proportion of the cells doesn't show the real impact on HSPCs.

      Thank you for your insightful point. In the revision, we compared the numbers, not percentages, of HSPCs and found no difference in the number of cells in the major HSPC fractions in Lin-. The figure has been revised (revised Figure S6C) and the corresponding description has been added (page 10, lines 296-299).

      Author response image 20.

      Minor:

      1. In S1 F-G is not indicated in which day post 5FU injection is done the analysis. I assume on day 6 but it should be indicated in the figure legend and/or text.

      Thank you for pointing this out. As you assumed, the analysis was performed on day 6. The description has been added to the legend of the revised Figure S1G.

      1. S1K is not described in the text. What are proliferative and quiescence-maintaining conditions? The analyses are done by flow using LKS SLAM markers after culture? How long was the culture?

      Thank you for your comments. First, the figure citation on line 250 was incorrect and has been corrected to Figure S1N. Regarding the proliferative and quiescence-maintaining conditions, we have previously reported on these 8. In brief, these are culture conditions that maintain HSC activity at a high level while allowing for the proliferation or maintenance of HSCs in quiescence, achieved by culturing under fatty acid-rich, hypoxic conditions with either high or low cytokine concentrations. Analysis was performed after one week of culture, with the HSC number determined by flow cytometry based on the LSK-SLAM marker. While these are mentioned in the Methods section, we have added a description in the main text to highlight these points for the reader (page 7, lines 214-217).

      1. In Figure 5G, why does the blue line (PFKFB3 inhibitor) go up in the end of the real-time monitoring? Does it mean that other compensatory pathway is turned on?

      As you have pointed out, we cannot rule out the possibility that other unknown compensatory ATP production pathways were activated. We have added a note in the Discussion section to address this (page 18, lines 555-556).

      1. In Figure S6H&J, the reduction is marginal. Does it mean that PKM2 is not important for ATP production in HSCs?

      The activity of the inhibitor is essential in the GO-ATeam2 analysis. The commercially available PKM2 inhibitors have a higher IC50 value (IC50 = 2.95 μM in this case). Nevertheless, the effect of reducing the ATP concentration was observed in progenitor cells, but not in HSCs. The report by Wang et al. 9 on the analysis using a PKM2-deficient model suggests a stronger effect on progenitor cells than on HSCs. Our results are similar to those of the previous report.

      (Specific comments)

      Specifically, there are several major points that rise concerns about the claims:

      1. The gating strategy to select HSCs with enlarged Sca1 gating is not convincing. I understand the rationale to have a sufficient number of cells to analyze, however this gating strategy should be applied also in the control group. From the FACS plot seems that there are more HSCs upon 5FU treatment (Figure S1b). How that is possible? Is it because of the 20% more of cycling cells at day 6? To prove that this gating strategy still represents a pure HSC population, authors should compare the blood reconstitution capability of this population with a "standard" gated population. If the starting population is highly heterogeneous then the metabolic readout could simply reflect cell heterogeneity.

      Thank you for pointing this out. First, we did not enlarge the Sca-1 gating in this study. We apologize for any confusion caused by the incomplete description. The gating of c-Kit is based on that shown by Umemoto et al (Figure EV1A) 2, who used 250 mg/kg 5-FU, so their c-Kit reduction is more pronounced than ours.

      We followed this study and compared c-Kit expression in the Lin-Sca-1+CD150+CD48-EPCR+ gates to BMMNCs on day 6 after 5-FU administration (150 mg/kg). The results are shown below.

      Author response image 21.

      Since the MFI of c-Kit was downregulated, we used gating that extended the c-Kit gate to lower expression regions on day 6 after 5-FU administration (revised Figure S1C).

      At other time points, LSK gating was the same as in the PBS-treated group, as noted in the Methods.

      The reason why the number of HSCs appears to be higher in the 5-FU group is because most of the differentiated blood cells were lost due to 5-FU administration and the same number of cells as in the PBS group were analyzed by FACS, resulting in a relatively higher number of HSCs. The legend of Figure S1 shows that the number of HSCs in both the PBS and 5-FU groups appeared to increase because the same number of BMMNCs was obtained at the time of analysis (page S22, lines 596-598).

      Regarding cellular heterogeneity, from a metabolic point of view, the heterogeneity in HSCs is rather reduced by 5-FU administration. As shown in Figure S3A–C, this is simulated under stress conditions, such as after 5-FU administration or during OXPHOS inhibition, where the flux variability in each enzymatic reaction is significantly reduced. GO-ATeam2 analysis after 5-FU treatment showed no increase in cell population variability. After 2-DG treatment, ATP concentrations in HSCs were widely distributed from 0 mM to 0.8 mM in the PBS group, while more than 80% of those in the 5-FU group were less than 0.4 mM (Figures 4B, D). HSCs may have a certain metabolic diversity at a steady state, but under stress conditions, they may switch to a more specialized metabolism with less cellular heterogeneity in order to adapt.

      1. S2 does not show major differences before and after sorting. However, a key metabolite like Lactate is decreased, which is also one of the most present. Wouldn't that mean that HSCs once they move out from the hypoxic niche, they decrease lactate production? Do they decrease anaerobic glycolysis? How can quiescent HSC mostly rely on OXPHOS being located in hypoxic niche?

      2. Since HSCs in the niche are located in hypoxic regions of the bone marrow, would that not mimic OxPhos inhibition (oligomycin)? Would that not mean that HSCs in the niche are more glycolytic (anaerobic glycolysis)?

      3. In Figure 5B, the orange line (Glucose+OXPHOS inhibition) remains stable, which means HSCs prefer to use glycolysis when OXPHOS is inhibited. Which metabolic pathway would HSCs use under hypoxic conditions? As HSCs resides in hypoxic niche, does it mean that these steady-state HSCs prefer to use glycolysis for ATP production? As mentioned before, mitochondrial inhibition can be comparable at the in vivo condition of the niche, where low pO2 will "inhibit" mitochondria metabolism.

      Thank you for the first half of comment 2 on the technical features of our approach. First, as you have pointed out, there is minimal variation and stable detection of many metabolites before and after sorting (Figure S2A), suggesting that isolation from the hypoxic niche and sorting stress do not significantly alter metabolite detection performance. This is consistent with a previous report by Jun et al. 10. Meanwhile, lactate levels decreased by sorting. Therefore, if the activity of anaerobic glycolysis was suppressed in stressed HSCs, it may be difficult to detect these metabolic changes with our tracer analysis. However, in this study, several glycolytic metabolites, including an increase in lactate, were detected in HSCs from 5-FU-treated mice compared with HSCs from PBS-treated mice that were similarly sorted and prepared, suggesting an increase in glycolytic activity. In other words, we may have been fortunate to detect the stress-induced activation of the glycolytic system beyond the characteristic of our analysis system that lactate levels tend to appear lower than they are. Given that damage to the bone marrow hematopoiesis tends to alleviate the low-oxygen status of the niche 11, we postulate that this upregulated aerobic glycolysis arises intrinsically in HSCs rather than from external conditions.

      The second half of comment 2, and comments 7 and 10, are essential and overlapping comments and will be answered together. Although genetic analyses have shown that HSCs produce ATP by anaerobic glycolysis in low-oxygen environments 9,12, our GO-ATeam2 analysis in this study confirmed that HSCs also generate ATP via mitochondria. This is also supported by Ansó's prior findings where the knockout of the Rieske iron–sulfur protein (RISP), a constituent of the mitochondrial electron transport chain, impairs adult HSC quiescence and bone marrow repopulation 13. Bone marrow is a physiologically hypoxic environment (9.9–32.0 mmHg 11). However, the p50 value of mitochondria (the partial pressure of oxygen at which respiration is half maximal) is below 0.1% oxygen concentration at atmospheric pressure (less than 1 mmHg) 7. This suggests that OXPHOS can retain sufficient activity even under physiologically hypoxic conditions. We are currently initiating efforts to discern ATP concentrations in vivo within the bone marrow under physiological hypoxia. This will be reported in a separate project in the future. Admittedly, when we began this research, we did not anticipate the significant mitochondrial reliance of HSCs. As we previously reported, the metabolic uncoupling of glycolysis and mitochondria 12 may enable HSCs to activate only glycolysis, and not mitochondria, under stress conditions such as post-5-FU administration, suggesting a unique metabolic trait of HSCs. We have included these technical limitations and perspectives in the Discussion section (page 17, lines 520-523).

      1. The authors performed challenging experiments to track radiolabeled glucose, which are quite remarkable. However, the data do not fully support the conclusions. Mitochondrial metabolism in HSCs can be supported by fatty acid and glutamate, thus authors should track the fate of other energy sources to fully discriminate the glycolysis vs mito-metabolism dependency. From the data on S2 and Fig1 1C-F, the authors can conclude that upon 5FU treatment HSCs increase glycolytic rate.

      2. FIG.2B-C: Increase of Glycolysis upon oligomycin treatment is common in many different cell types. As explained before, other radiolabeled substrates should be used to understand the real effect on mitochondria metabolism.

      Thank you for your suggestion. While this is essential for future studies, we believe it is not the primary focus of the paper. We are planning another research project on tracer analysis using labeled fatty acids and glutamates, which we will report on in the near future. We have clearly stated in the Abstract and Introduction of the revised manuscript that the focus of this study is on changes in glucose metabolism when HSCs are stressed (page 3, line 75 and 87, page 5, lines 135).

      Instead, we have added the following analyses of metabolic changes in fatty acids and glutamate using the GO-ATeam2 system: HSCs derived from GO-ATeam2 mice treated with PBS or 5-FU were used to measure changes in ATP concentrations after exposure to the fatty acid beta-oxidation (FAO) inhibitor etomoxir and the glutaminolysis inhibitor 6-diazo-5-oxo-L-norleucine (DON). Etomoxir was used at 100 µM, a concentration that inhibits FAO without inhibiting mitochondrial electron transfer complex I, as previously reported 5. DON was used at 2 mM, a concentration that sufficiently inhibits the enzyme as the Ki for glutaminase is 6 µM. In this experiment, etomoxir alone, DON alone, or etomoxir and DON in combination did not decrease the ATP concentration of HSCs in the PBS and 5-FU groups (revised Figures S7J–M), suggesting that FAO and glutaminolysis were not essential for ATP production in HSCs in the short term. Thus, according to the analysis using the GO-Ateam2 system, HSCs exposed to acute stresses change the efficiency of glucose utilization (accelerated glycolytic ATP production) rather than other energy sources. Since there are reports that FAO and glutaminolysis are required for HSC maintenance in the long term 5,6, compensatory pathways may be able to maintain ATP levels in the short term. A description of these points has been added to the Discussion (page 17, lines 525-527).

      Author response image 22.

      Fatty acid β-oxidation activity was also measured in 5-FU-treated HSCs using the fluorescent probe FAOBlue and was unchanged compared to PBS-treated HSCs (revised Figure S7N).

      Author response image 23.

      Notably, the addition of 100 µM etomoxir plus glucose and Pfkfb3 inhibitors resulted in a rapid decrease in ATP concentration in HSCs (revised Figures S7O–P). This indicates that etomoxir partially mimics the effect of oligomycin, suggesting that at a steady state, OXPHOS is driven by FAO, but can be compensated by the acceleration of the glycolytic system by Pfkfb3. Meanwhile, the exposure of HSCs to Pfkfb3 inhibitors in addition to 2 mM DON did not reduce ATP (revised Figures S7O–P). This suggests that ATP production from glutaminolysis is limited in HSCs at a steady state.

      Author response image 24.

      These points suggest that OXPHOS is driven by fatty acids at a steady state, but unlike the glycolytic system, FAO is not further activated by HSCs after 5-FU treatment. The results of these analyses and related descriptions are included in the revised manuscript (page 11, lines 332-344).

      1. In Figure S1, 5-FU leads to the induction of cycling HSCs and in figure 1, 5-FU results in higher activation of glycolysis. Would it be possible to correlate these two phenotypes together? For example, by sorting NBDG+ cells and checking the cell cycle status of these cells?

      We appreciate the reviewer’s insightful comments. We administered 2-NBDG to mice and fractionated HSCs at the 2-NBDG fluorescence level for cell cycle analysis. The results are shown below (revised Figure S1M). Notably, even in the PBS-treated group, HSCs with high 2-NBDG uptake were more proliferative than HSCs with low 2-NBDG uptake and were comparable to HSCs after 5-FU treatment, although the overall population of HSCs that exited the G0 phase and entered the G1 phase increased after 5-FU treatment. In both PBS/5-FU-treated groups, these profound differences in cell cycle glucose utilization suggest a direct link between HSC proliferation and glycolysis activation. Descriptions of the above findings and perspectives have been added to the Results and Discussion section (page 7, lines 208-214, page 20, lines 607-610).

      Author response image 25.

      1. Why are only ECAR measurements (and not OCR measurements) shown? In Fig.2G, why are HSCs compared with cKit+ myeloid progenitors, and not with MPP1? The ECAR increased observed in HSC upon oligomycin treatment is shared with many other types of cells. However, cKit+ cells have a weird behavior. Upon oligo treatment cKit+ cells decrease ECAR, which is quite unusual. The data of both HSCs and cKit+ cells could be clarified by adding OCR curves. Moreover, it is recommended to run glycolysis stress test profile to assess the dependency to glycolysis (Glucose, Oligomycin, 2DG).

      In addition to the ECAR data of the Mito Stress test (original Figures 2G–H), OCR data were added in the revised manuscript (revised Figures 2H, S3D). Compared to c-Kit+ myeloid progenitors (LKS- cells), HSC exhibited a similar increase in ECAR, while the decrease in OCR was relatively limited. This may be because glycolytic and mitochondrial metabolism are coupled in c-Kit+ myeloid progenitors, whereas they are decoupled in HSCs. This is also suggested by the glucose plus oligomycin experiment in Figures 5B, C, and S6A–D (orange lines). In summary, in HSCs, glycolytic and mitochondrial ATP production are decoupled and can maintain ATP levels by glycolytic ATP production alone, whereas in progenitors including GMPs, the two ATP production systems are constantly coupled, and glycolysis alone cannot maintain the ATP concentration. While we could not conduct a glycolysis stress test, we believe that Pfkfb3-dependent glycolytic activation, which is evident in the oligomycin+glucose+Pfkfb3i experiment, is only apparent in HSCs when subjected to glucose+oligomycin treatment (original Figures 5F–I). We have added descriptions of these points in the Results and Discussion section (page 8, lines 240-243, page 18, lines 558-561).

      Author response image 26.

      FIG.3 A-C. As mentioned previously, the flux analyses should be integrated with data using other energy sources. If cycling HSCs are less dependent to OXPHOS, what happen if you inhibit OXHPHOS in 5-FU condition? Since the authors are linking OXPHOS inhibition and upregulation of Glycolysis to increase proliferation, do HSCs proliferate more when treated with oligomycin?

      First, please see our response to comments 3 and 5 regarding the first part of this comment about the flux analysis of other energy sources. According to the analysis using the GO-Ateam2 system, stressed HSCs change the efficiency of glucose utilization (accelerated glycolytic ATP production) rather than other energy sources. The change in ATP concentration after OXPHOS inhibition for 5-FU-treated HSCs is shown in Figures 4C and E, suggesting that the activity of OXPHOS itself does not increase. HSCs after oligomycin treatment and HSCs after 5-FU treatment are similar in that they activate glycolytic ATP production. However, inhibition of OXPHOS did not induce the proliferation of HSCs (original Figure S1K). This suggests that proliferation activates glycolysis and not that activation of the glycolytic system induces proliferation. This similarity and dissimilarity of glycolytic activation upon proliferation and OXPHOS inhibition is discussed in the Discussion section (page 16-17, lines 505-515).

      1. FIG.4 shows that in vivo administration of radiolabeled glucose especially marks metabolites of TCA cycle and Glycolysis. The authors interpret enhanced anaerobic glycolysis, but I am not sure this is correct; if more glycolysis products go in the TCA cycle, it might mean that HSC start engaging mitochondrial metabolism. What do the authors think about that?

      Thank you for pointing this out. We believe that the data are due to two differences in the experimental features between in vivo (Figure S5) and in vitro (Figures 1 and S2) tracer analysis. The first difference is that in in vivo tracer analysis, unlike in vitro, all cells can metabolize U-13C6-glucose. Another difference is that after glucose labeling in vivo, it takes approximately 120–180 minutes to purify HSCs to extract metabolites, and processing on ice may result in a gradual progression of metabolic reactions within HSCs. As a result, in vivo tracer analysis may detect an increased influx of labeled carbon derived from U-13C6-glucose into the TCA cycle over an extended period. However, it is difficult to interpret whether this influx of labeled carbon is derived from the direct influx of glycolysis or the re-uptake by HSCs of metabolites that have been metabolized to other metabolites in other cells. Meanwhile, as shown in Figure 4C using the GO-ATeam2 system, ATP production from mitochondria is not upregulated by 5-FU treatment. This suggests that even if the direct influx from glycolysis into the TCA cycle is increased, the rate of ATP production does not exceed that of glycolysis. Despite these technical caveats in interpretation, the results of in vivo and in vitro tracer analyses are considered essential. In particular, we consider the increased labeling of metabolites involved in glycolysis and nucleotide synthesis to be crucial. We have added a discussion of these points, including experimental limitations (page 17-18, lines 530-545).

      1. FIG.4: the experimental design is not clear. Are BMNNCs stained and then put in culture? Is it 6-day culture or BMNNCs are purified at day 6 post 5FU? FIG-4B-C The difference between PBS vs 5FU conditions are the most significant; however, the effect of oligomycin in both conditions is the most dramatic one. From this readout, it seems that HSCs are more dependent on mitochondria for energy production both upon 5FU treatment and in PBS conditions.

      We apologize for the incomplete description of the experimental details. The experiment involved dispensing freshly stained BMMNC with surface antigens into the medium and immediately subjecting them to flow cytometry analysis. For post-5-FU treatment HSCs, mice were administered with 5-FU (day 1), and freshly obtained BMMNCs were analyzed on day 6. The analysis of HSCs and progenitors was performed by gating each fraction within the BMMNC (original Figure S5A). We have added these details to ensure that readers can grasp these aspects more clearly (page S5, lines 155-158).

      As pointed out by the reviewer, we understand that HSCs produce more ATP through OXPHOS. However, ATP production by glycolysis, although limited, is observed under steady-state conditions (post-PBS treatment HSC), and its reliance increases during the proliferation phase (post-5-FU treatment HSC) (original Figures 4B, D). Until now, discussions on energy production in HSCs have focused on either glycolysis or mitochondrial functions. However, with the GO-ATeam2 system, it has become possible for the first time to compare their contributions to ATP production and evaluate compensatory pathways. As a result, it became evident that while OXPHOS is the main source of ATP production, the reliance on glycolysis plastically increases in response to stress. This has led to a better understanding of HSC metabolism. These points are included in the Discussion as well (page 16, lines 479-488).

      1. FIG.6H should be extended with cell cycle analyses. There are no differences between 5FU and ctrl groups. If 5FU induces HSCs cycling and increases glycolysis I would expect higher 2-NBDG uptake in the 5FU group. How do the authors explain this?

      Thank you for your comments. In the original Figure 6H, we found that 2-NBDG uptake correlated with mPFKFB3 levels in both the 5-FU and PBS groups. mPfkfb3 levels remained low in the few HSCs with low 2-NBDG uptake in the 5-FU group.

      In the revised manuscript, to directly relate glucose utilization to the cell cycle, we administered 2-NBDG to mice and fractionated HSCs at the 2-NBDG fluorescence level for cell cycle analysis. The results are shown below (revised Figure S1M). Notably, even in the PBS-treated group, HSCs with high 2-NBDG uptake were more proliferative than those with low 2-NBDG uptake and are comparable to HSCs after 5-FU treatment, although the overall population of HSCs that exited the G0 phase and entered the G1 phase increased after 5-FU treatment. The large differences in glucose utilization per cell cycle observed in both PBS/5-FU-treated groups suggest a direct link between HSC proliferation and glycolysis activation. Descriptions of the above findings have been added to the Results and Discussion ((page 7, lines 208-214, page 20, lines 607-610).

      Author response image 27.

      1. In S7 the experimental design is not clear. What are quiescent vs proliferative conditions? What does it mean "cell number of HSC-derived colony"? Is it a CFU assay? Then you should show colony numbers. When HSCs proliferate, they need more energy thus inhibition of metabolism will impact proliferation. What happens if you inhibit mitochondrial metabolism with oligomycin?

      Regarding the proliferative and quiescence-maintaining conditions, we have previously reported on these 8. In brief, these are culture conditions that maintain HSC activity at a high level while allowing for the proliferation or maintenance of HSCs in quiescence, achieved by culturing under fatty acid-rich, hypoxic conditions with either high or low cytokine concentrations. Analysis was performed after one week of culture, with the HSC number determined by flow cytometry based on the LSK-SLAM marker. While these are mentioned in the Methods section, we have added a description in the main text to highlight these points for the reader (page 7, lines 214-217).

      In vitro experiments with the oligomycin treatment of HSCs showed that OXPHOS inhibition activates the glycolytic system, but does not induce HSC proliferation (original Figure S1K). This suggests that proliferation activates glycolysis and not that activation of the glycolytic system induces proliferation. This similarity and dissimilarity of glycolytic activation upon proliferation and OXPHOS inhibition is discussed in the Discussion (page 16-17, lines 505-515).

      1. In FIG 7 since homing of HSCs is influenced by the cell cycle state, should be important to show if in the genetic model for PFKFB3 in HSCs there's a difference in homing efficiency.

      In response to the reviewer's comments, we knocked out PFKFB3 in HSPCs derived from Ubc-GFP mice, transplanted 200,000 HSPCs into recipients (C57BL/6 mice) post-8.5Gy irradiation, and harvested the bone marrow of recipients after 16 h to compare homing efficiency (revised Figure S10H). Even with the knockout of PFKFB3, no significant difference in homing efficiency was detected compared to the control group (Rosa knockout group). These results suggest that the short-term reduction in chimerism due to PFKFB3 knockout is not due to decreased homing efficiency or cell death by apoptosis (Figure 7K) but a transient delay in cell cycle progression. We have added descriptions regarding these findings in the Results and Discussion sections (page 15, lines 470-471, page 19, lines 576-578).

      Author response image 28.

      1. Yamamoto M, Kim M, Imai H, Itakura Y, Ohtsuki G. Microglia-Triggered Plasticity of Intrinsic Excitability Modulates Psychomotor Behaviors in Acute Cerebellar Inflammation. Cell Rep. 2019;28(11):2923-2938 e2928.

      2. Umemoto T, Johansson A, Ahmad SAI, et al. ATP citrate lyase controls hematopoietic stem cell fate and supports bone marrow regeneration. EMBO J. 2022:e109463.

      3. Seita J, Sahoo D, Rossi DJ, et al. Gene Expression Commons: an open platform for absolute gene expression profiling. PLoS One. 2012;7(7):e40321.

      4. Boyd S, Brookfield JL, Critchlow SE, et al. Structure-Based Design of Potent and Selective Inhibitors of the Metabolic Kinase PFKFB3. J Med Chem. 2015;58(8):3611-3625.

      5. Ito K, Carracedo A, Weiss D, et al. A PML–PPAR-δ pathway for fatty acid oxidation regulates hematopoietic stem cell maintenance. Nat Med. 2012;18(9):1350-1358.

      6. Oburoglu L, Tardito S, Fritz V, et al. Glucose and glutamine metabolism regulate human hematopoietic stem cell lineage specification. Cell Stem Cell. 2014;15(2):169-184.

      7. Gnaiger E, Mendez G, Hand SC. High phosphorylation efficiency and depression of uncoupled respiration in mitochondria under hypoxia. Proc Natl Acad Sci U S A. 2000;97(20):11080-11085.

      8. Kobayashi H, Morikawa T, Okinaga A, et al. Environmental Optimization Enables Maintenance of Quiescent Hematopoietic Stem Cells Ex Vivo. Cell Rep. 2019;28(1):145-158 e149.

      9. Wang YH, Israelsen WJ, Lee D, et al. Cell-state-specific metabolic dependency in hematopoiesis and leukemogenesis. Cell. 2014;158(6):1309-1323.

      10. Jun S, Mahesula S, Mathews TP, et al. The requirement for pyruvate dehydrogenase in leukemogenesis depends on cell lineage. Cell Metab. 2021;33(9):1777-1792 e1778.

      11. Spencer JA, Ferraro F, Roussakis E, et al. Direct measurement of local oxygen concentration in the bone marrow of live animals. Nature. 2014;508(7495):269-273.

      12. Takubo K, Nagamatsu G, Kobayashi CI, et al. Regulation of glycolysis by Pdk functions as a metabolic checkpoint for cell cycle quiescence in hematopoietic stem cells. Cell Stem Cell. 2013;12(1):49-61.

      13. Anso E, Weinberg SE, Diebold LP, et al. The mitochondrial respiratory chain is essential for haematopoietic stem cell function. Nat Cell Biol. 2017;19(6):614-625.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1:

      In this study, Fang H et al. describe a potential pathway, ITGB4-TNFAIP2-IQGAP1-Rac1, that may involve in the drug resistance in triple negative breast cancer (TNBC). Mechanistically, it was demonstrated that TNFAIP2 bind with IQGAP1 and ITGB4 activating Rac1 and the following drug resistance. The present study focused on breast cancer cell lines with supporting data from mouse model and patient breast cancer tissues. The study is interesting. The experiments were well controlled and carefully carried out. The conclusion is supported by strong evidence provided in the manuscript. The authors may want to discuss the link between ITGB4 and Rac1, between IQGAP1 and Rac1, and between TNFAIP2 and Rac1 as compared with the current results obtained. This is important considering some recent publications in this area (Cancer Sci 2021, J Biol Chem 2008, Cancer Res 2023). In addition, some key points need to be addressed in order to support their conclusion in full.

      Thanks for your positive comments.

      1) It is rarely found studies using the term of "DNA damage drug resistance". Do the authors mean "DNA damage and drug resistance" or "DNA damage-related drug resistance" or "DNA damage-induced drug resistance"? It is better to define "DNA damage drug resistance" in the manuscript if it is not a common term in the field.

      We agree with you that the description "DNA damage-related drug resistance" is better so that we revised it uniformly in the manuscript.

      2) For Figure 4A, it is stated the IQGAP1 is identified via IP-MS. However, the MS results are not presented in the Figure or in the supplementary. In Figure 4A, only the IP results with silver staining was presented. Moreover, based on the silver staining here, a bunch of proteins were increased in TNFAIP2 overexpression group compared to the vector group. Especially, there is a much clearer band at 52kDa. The authors didn't explain why they chose IQGAP1 and ITGB4 which are less clear than the protein(s) at 52kDa.

      Supplementary table 1 is our mass spectrometry results. There are two reasons for choosing ITGB4 and IQGAP1. Firstly, we selected the proteins that indeed interact with TNFAIP2 according to our verification experiments. Secondly, we were interested in the mechanism by which TNFAIP2 promoting DNA damage-related drug resistance, and we found that ITGB4 promoted drug resistance, while IQGAP1 activated Rac1.

      3) According to the images in Figure 4C, the efficiency of si-IQGAP1 is limited. The authors could analyze the WB image to confirm the inhibition efficiency of si-IQGAP1.

      We analyzed the WB images and the quantitative results are as follows in Author response image 1. The knockdown efficiency is acceptable.

      Author response image 1.

      4) In Figure 5B, I wonder whether the authors can explain why the IgG could immunoprecipitate similar amount of ITGB4 protein as input group.

      In this experiment, the Input group had relatively less loading amount (5%), while the IgG group had nonspecific binding.

      5) According to the results from Figure 6B, the inhibition efficiency of shITGB4#1 is much higher than shITGB4#2. However, the effects of shITGB4#1 on GTP-Rac1 are similar to or even weaker than those of shITGB4#2 in both HCC1806 and HCC1937. Can this be explained?

      The possible reason is that downregulation of ITGB4 expression to a certain level is sufficient to inhibit the activation of Rac1.

      6) In Figure 6F, there are double bands for ITGB4 while only one band shows in other Figures. Please find a better representative image here.

      ITGB4 has a cleaved band in addition to the main band. These two bands could be separated when we used a low concentration SDS-PAGE gel.

      7) In the manuscript, GAPDH, b-Actin and Tubulin are used in different experiments as internal controls. Is there any specific reason to using different internal controls for different experiments here?

      There is no specific reason using different internal controls. These experiments were conducted by different person. Each individual chose different internal controls based on the protein sizes.

      8) I cannot find Table 1 for the correlation results for TNFAIP2 and ITGB4. I wonder whether Figure 8E is the Table 1 as is mentioned, since it is stated in line 561 that Figure 8E is "the work model of this paper" but actually Figure 8F is. If Figure 8E is the correlation results, I highly recommended the scatter plots graph is used here to present more clear and visualized correlation between TNFAIP2 and ITGB4.

      Figure 8E is indeed the correlation result. In addition, Figure 8E could not be presented as scatter plot graph because the pattern of TNFAIP2 and ITGB4 expression is negative or positive according to the determination of IHC results which was carried out by professional pathologists.

      9) Throughout the whole manuscript, no description of N number was found in figure legends or in Methods for in vitro experiments. N number is important for statistical analysis.

      All our experiments have set up three replicates. We provide this information in figure legends.

      Reviewer #2:

      Breast cancer is the most common malignant tumor in women. One of subtypes in breast cancer is so called triple-negative breast cancer (TNBC), which represents the most difficult subtype to treat and cure in the clinic. Chemotherapy drugs including epirubicin and cisplatin are widely used for TNBC treatment. However, drug resistance remains as a challenge in the clinic. The authors uncovered a molecular pathway involved in chemotherapy drug resistance, and molecular players in this pathway represent as potential drug targets to overcome drug resistance. The experiments are well designed and the conclusions drawn mostly were supported by the data. The findings have potential to be translated into the clinic.

      Thanks for your positive comments.

      1) In Introduction, the statement of "Breast cancer is the most common malignant tumor in women, and the morbidity and mortality rates of female malignant tumors are ranked first in the world" is inaccurate.

      We have revised the description as“Breast cancer is the most commonly diagnosed cancer and the leading cause of cancer death in women”.

      2) In Materials and Methods, "Immunopurification and silver staining" is not correct, which should be replaced with "Immunoprecipitation and silver staining".

      We replaced the description in the manuscript according to your suggestion.

      3) It is unclear Why the authors chose the two TNBC cell lines, HCC1806 and HCC1937, for cell models in this work.

      We chose these two cell lines according to our previous work“KLF5 promotes breast cancer proliferation, migration and invasion in part by upregulating the transcription of TNFAIP2” (doi: 10.1038/onc.2015.263. Epub 2015 Jul 20).

      4) To demonstrate TNFAIP2 and ITGB4 confer TNBC drug resistance in vivo, the knockdown efficiency of animal experiments was not shown.

      The knockdown efficiency of animal experiments was shown below. We added this result into Figure 2-figure supplement 2G and Figure 5-figure supplement 2N.

      5) I would strongly suggest the authors seek help from a language editing service to improve the manuscript.

      We improved the manuscript by using a professional English language editing service and we have carefully revised the manuscript.

      Reviewer #3:

      In this manuscript, Fang and colleagues found that IQGAP1 interacts with TNFAIP2, which activates Rac1 to promote drug resistance in TNBC. Furthermore, they found that ITGB4 could interact with TNFAIP2 to promote TNBC drug resistance via the TNFAIP2/IQGAP1/Rac1 axis by promoting DNA damage repair.

      This work has good innovation and high potential clinical significance. However, there are several unsolved concerns that have to be addressed.

      Thanks for your positive comments.

      1) In the manuscript, there are four drugs used for in vitro cell experiments, why is olaparib (AZD) not used for in vivo animal experiments?

      There are two reasons why we did not choose AZD. First,the killing effect of AZD is not as strong as that of BMN. Second, AZD is more expensive than BMN. We finally chose BMN for animal experiments.

      2) In Figure 4B, why the immunoprecipitation experiments is done in HCC1806 cell line?

      In our previous study “KLF5 promotes breast cancer proliferation, migration and invasion in part by upregulating the transcription of TNFAIP2” (doi: 10.1038/onc.2015.263. Epub 2015 Jul 20), we found that TNFAIP2 knockdown could obviously inhibit the activation of Rac1 in HCC1806 when compared to the result in HCC1937. So, we used HCC1806 cell line to perform the IP-Mass assay.

      3) There should be data showing the knockdown effect of TNFAIP2 and ITGB4 in animal experiments.

      We addressed the same question above (Reviewer #2, Question#4).

      4) When screening the interaction regions between ITGB4 and TNFAIP2, why the TNFAIP2 protein truncation strategy is to delete the N-terminus?

      In fact, we also deleted the C-terminus, but the deletion of C-terminus of TNFAIP2 did not affect the interaction.

      5) In the manuscript, "input" should be changed to "Input".

      We corrected it.

      6) There should be a space between "Figure" and numbers.

      We add a space between "Figure" and numbers.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Tobón and Moser reveal a remarkable amount of presynaptic diversity in the fundamental Ca dependent exocytosis of synaptic vesicles at the afferent fiber bouton synapse onto the pilar or mediolar sides of single inner hair cells of mice. These are landmark findings with profound implications for understanding acoustic signal encoding and presynaptic mechanisms of synaptic diversity at inner hair cell ribbon synapses. The paper will have an immediate and long-lasting impact in the field of auditory neuroscience.

      Main findings: 1) Synaptic delays and jitter of masker responses are significantly shorter (synaptic delay: 1.19 ms) at high SR fibers (pilar) than at low SR fibers (mediolar; 2.57 ms). 2) Masked evoked EPSC are significantly larger in high SR than in low SR. 3) Quantal content and RRP size are 14 vesicles in both high and low SR fibers. 4) Depression is faster in high SR synapses suggesting they have a higher release probability and tighter Ca nanodomain coupling to docked vesicles. 5) Recovery of master-EPSCs from depletion is similar for high and low SR synapses, although there is a slightly faster rate for low SR synapses that have bigger synaptic ribbons, which is very interesting. 6) High SR synapses had larger and more compact (monophasic) sEPSCs, well suited to trigger rapidly and faithfully spikes. 7) High SR synapses exhibit lower voltage (~sound pressure in vivo) dependent thresholds of exocytosis.

      Strengths:

      Great care was taken to use physiological external pH buffers and physiological external Ca concentrations. Paired recordings were also performed at higher temperatures with IHCs at physiological resting membrane potentials and in more mature animals than previously done for paired recordings. This is extremely challenging because it becomes increasingly difficult to visualize bouton terminals when myelination becomes more prominent in the cochlear afferents.

      In addition, perforated patch recordings were used in the IHC to preserve its intracellular milieu intact and thus extend the viability of the IHCs. The experiments are tour-de-force and reveal several novel aspects of IHC ribbon synapses. The data set is rich and extensive. The analysis is detailed and compelling.

      We would like to thank the reviewer for the appreciation of our work and the comments that helped us to improve our manuscript. We detail our responses to the comments below.

      Weaknesses:

      (1) Materials and Methods: Please provide whole-cell Rs (series resistance ) and Cm (membrane capacitance) average +/- S.E.M. (or SD) values for IHC and afferent fiber bouton recordings. The Cm values for afferents have been estimated to be about 0.1 pF (Glowatzki and Fuchs, 2002) and it would be interesting to know if there are differences in these numbers for high and low SR afferents. Is it possible to estimate Cm from the capacitative transient time constant? Minimal electronic filtering would be required for that to work, so I realize the authors may not have this data and I also realize that the long cable of the afferents do not allow accurate Cm measurements, but some first order estimate would be very interesting to report, if possible.

      In response to the reviewer’s comment, we now added the estimates of series resistance and membrane capacitance for IHC and bouton recordings in Material and Methods and in the Figure 1 – figure supplement 1. Our estimate for bouton Cm is on average 1.7 ± 0.09 pF, a value that compares well to the literature. For example, Glowatzki and Fuchs (2002) provided estimates ranging 0.5-2 pF for recordings from afferent inner hair cell synapses in rats that showed a capacitance transient. In own prior work on afferent inner hair cell synapses of pre-hearing mice, we found estimates of 2.6 ± 0.5 pF (Chapochnikov et al., 2014) and 1.9 ± 0.2 pF (Takago et al., 2019). Keen and Hudspeth (2006) reported capacitances of 1–4 pF for afferent terminals in the bullfrog amphibian papilla. There was no difference in bouton Cm between high SR (1.78 ± 0.19 pF) and low SR synapses (1.68 ± 0.11 pF; p = 0.6575, unpaired t test).

      (2) Page 20, 26 and Figure 4: With regard to synaptic delays at auditory hair cell synapses: please see extensive studies done in Figure 11 of Chen and von Gersdorff (JNeurosci., 2019); this showed that synaptic delays are 1.26 ms in adult bullfrog auditory hair cells at 31oC, which is very similar to the High SR fibers (1.19 ms; Fig.4B and page 20). During ongoing depolarizations (e.g. during a sustained sine wave) the synaptic delay can be reduced to just 0.72 ms for probe EPSCs, which is a more usual number for mature fast synapses. This paper should, thus, be cited and briefly discussed in the Discussion. So a significant shortening of delay occurs for the probe response and this is also observed in young rat IHC synapses (see Goutman and Glowatzki, 2011).

      We thank the reviewer for this comment. We have analysed the synaptic delay of the probe response and included it in Figure 4 – figure supplement 1. Contrary to the findings from Goutman and Glowatzki (2011) and Chen and von Gersdorff (2019), we did not observe a shortening of the synaptic delay for the probe response compared to the masker response. This difference might arise from the duration of the masker stimulus and/or the IHC holding potential. Synaptic facilitation in hair cells seems to occur only when the RRP is not depleted by the first stimulus (Cho et al., 2011). Our 100 ms masker depolarization from a holding potential of -58 mV effectively depleted the synapse RRP (Figure 4D), while both studies mentioned above used relatively short depolarizations (2 in rat and 20 ms in bullfrog) from a holding potential around -90 mV, which most likely didn’t deplete the RRP. Indeed, when using partially RRP depleting stimuli of 10 ms, Goutman (2011) observed longer synaptic latencies and smaller responses to the second stimulus. We have included this discussion in the last paragraph of the results section.

      Additionally, we would also like to note that we referred to the important work on frog hair cell synapses in the manuscript, yet aimed to focus on relating synaptic heterogeneity of mammalian inner hair cell synapses to the functional diversity of type I spiral ganglion neurons that unlike the frog afferents show little branching of their peripheral neurites (in only ~15% of the neurons). We think it will be very interesting to study the aspect of presynaptic heterogeneity in the bullfrog amphibian papilla, but assume that the converging input of several active zones onto a single afferent might provide a different encoding scheme than in the mammalian cochlea.

      (3) Gaussian-like (and/or multi-peak) EPSC amplitude distributions were obtained in more mature rat IHCs by Grant et al. (see their Figure 4G; JNeurosci. 2010; postnatal day 19-21). The putative single quanta peak was at 50 pA and the main peak was at 375 pA. The large mean suggests a low CV (probably < 0.4). However, Fig. 2F shows a mean of about 100 pA and CV = 0.7 for spontaneous EPSCs. This major difference deserves some more discussion. I suppose that one possible explanation may be that the current paper holds the IHC membrane potential fixed at -58 mV, whereas Grant et al. (2010) did not control the IHC membrane potential and spontaneous fluctuations in the Vm may have depolarized the IHC, thus producing larger evoked EPSCs that are triggered by Ca channel openings. Some discussion that compares these differences and possible explanations would be quite useful for the readers.

      We understand the reviewer’s concern. We have now included the amplitude distribution of sEPSCs recorded from 12 boutons without patch-clamping the IHC (Figure 2–figure supplement 1, panel A). The rest of the recording conditions (i.e., artificial perilymph-like solution, physiological temperature and age) were identical to the conditions used for the paired recordings. Both the range of spontaneous rate (0 up to 16.33 sEPSC/s) and the amplitude distribution (peak at -40 pA and CV of 0.66) were comparable to the values we obtained when clamping the IHC resting potential at -58 mV. In addition, for two of our pairs, we established the bouton recording first, measured the spontaneous release, then established the perforated patch-clamp of the IHC and measured the spontaneous release again with IHC held at -58 mV. For pair #l300321_1, the SR before clamping the IHC was 0.0125 sEPSC/s, with a maximal AmpsEPSC of -110 pA (avg. -52 pA). The SR while holding the IHC at -58 mV was 0.36 sEPSCs/s, with a maximal AmpsEPSC of -140 pA (avg. -46 pA). For pair #l200522_2, the SR changed from 0.07 sEPSC/s to 0. The maximal AmpsEPSC before clamping the IHC was -70 pA (avg. -31 pA). Overall, our data recorded without controlling the IHC argues against the resting potential of -58 mV as a major source of differences in EPSC rate and amplitudes compared to previous studies.

      Nonetheless, it is important to note that the experimental conditions used in our study differ from previous reports in several aspects. Our extracellular solution contains the physiological pH buffer bicarbonate instead of the fast buffer HEPES, as well as TEA and Cs+ for proper isolation of the Ca2+ currents. Both pH and potassium channel blockers can alter the excitability of the cell and, consequently, the spontaneous and evoked release. For instance, despite maintaining a similar extracellular pH (7.3 to 7.4), the choice of bicarbonate or HEPES for the extracellular solution can influence differently the regulation of the intracellular pH of the cell (Michl et al., 2019). Indeed, the activity of ion channels and receptors (e.g., AMPAR), and the resting potential can change depending on the extracellular buffer used (Hare and Owen, 1998, Vincent et al., 2019, Cho and von Gersdorff, 2014; and review Sinning and Hübner, 2013). Additionally, the animal model and the age range could be a source of difference. In rats, the EPSC amplitude distribution seems to change with maturation but not with K+ stimulation (Grant et al., 2010) or voltage depolarizations (Goutman and Glowatzki, 2007). This however does not seem to be the case for afferent boutons recorded from mice. In resting conditions (i.e. 5.8 mM extracellular K+), average EPSC amplitudes are around -100 to -150 pA for both prehearing (Chapochnikov et al., 2014) and hearing mice (Niwa et al., 2021 and the present study). Upon stimulation (40 mM K+ or voltage depolarizations), the mean EPSC amplitude does not change in prehearing mice (Jing et al., 2013; Takaba et al., 2019), but it significantly increases in hearing mice (Niwa et al., 2021 and the present study). In p20 and p30 mice, the mean EPSC amplitude was predominantly below -100 pA at rest and only increased above -100 pA after stimulation with 40 mM K+ (Niwa et al., 2021). Similarly, our reported avg. AmpsEPSC is below -150 pA, while the evoked EPSCs reached average amplitudes above -200 pA (Figure 1–figure supplement 1, panel F and Figure 4 – figure supplement 1, panel F).

      We have included the aforementioned points in the discussion under the section "Diversity of spontaneous release and their topographical segregation”.

      Reviewer #2 (Public Review):

      Summary:

      The study by Jaime-Tobon & Moser is a truly major effort to bridge the gap between classical observations on how auditory neurons respond to sounds and the synaptic basis of these phenomena. The so-called spiral ganglion neurons (SGNs) are the primary auditory neurons connecting the brain with hair cells in the cochlea. They all respond to sounds increasing their firing rates, but also present multiple heterogeneities. For instance, some present a low threshold to sound intensity, whereas others have high threshold. This property inversely correlates with the spontaneous rate, i.e., the rate at which each neuron fires in the absence of any acoustic input. These characteristics, along with others, have been studied by many reports over the years. However, the mechanisms that allow the hair cells-SGN synapses to drive these behaviors are not fully understood.

      Strengths:

      The level of experimental complexity described in this manuscript is unparalleled, producing data that is hardly found elsewhere. The authors provide strong proof for heterogeneity in transmitter release thresholds at individual synapses and they do so in extremely complex experimental settings. In addition, the authors found other specific differences such as in synaptic latency and max EPSCs. A reasonable effort is put into bridging these observations with those extensively reported in in vivo SGNs recordings. Similarities are many and differences are not particularly worrying as experimental conditions cannot be perfectly matched, despite the authors' efforts in minimizing them.

      We would like to thank the reviewer for the appreciation of our work and the comments that helped us to improve our manuscript. We detail our responses to the comments below.

      Weaknesses:

      Some concern surges in relation to mismatches with previous reports of IHC-SGN synapses function. EPSCs at these synapses present a peculiar distribution of amplitudes, shapes, and rates. These characteristics are well-established and some do not seem to be paralleled in this study. Here, amplitude distributions are drastically shifted to smaller values, and rates of events are very low, all compared with previous evidence. The reasons for these discrepancies are unclear. The rate at which spontaneous EPSCs appear is an especially sensitive matter. A great part of the conclusions relies on the definition of which of the SGNs (or should say synapses) belong to the low end and which to the high end in the spectrum of spontaneous rates. The data presented by the authors seem a bit off and the criteria used to classify recordings are not well justified. The authors should clarify the origin of these differences since they do not seem to come from obvious reasons such as animal ages, recording techniques, mouse strain, or even species.

      We understand the reviewer’s concern. We have now included the amplitude distribution of sEPSCs recorded from 12 boutons without patch-clamping the IHC (Figure 2–figure supplement 1, panel A). The rest of the recording conditions (i.e., artificial perilymph-like solution, physiological temperature and age) were identical to the conditions used for the paired recordings. Both the range of spontaneous rate (0 up to 16.33 sEPSC/s) and the amplitude distribution (peak at -40 pA and CV of 0.66) were comparable to the values we obtained when clamping the IHC resting potential at -58 mV. In addition, for two of our pairs, we established the bouton recording first, measured the spontaneous release, then established the perforated patch-clamp of the IHC and measured the spontaneous release again with IHC held at -58 mV. For pair #l300321_1, the SR before clamping the IHC was 0.0125 sEPSC/s, with a maximal AmpsEPSC of -110 pA (avg. -52 pA). The SR while holding the IHC at -58 mV was 0.36 sEPSCs/s, with a maximal AmpsEPSC of -140 pA (avg. -46 pA). For pair #l200522_2, the SR changed from 0.07 sEPSC/s to 0. The maximal AmpsEPSC before clamping the IHC was -70 pA (avg. -31 pA). Overall, our data recorded without controlling the IHC argues against the resting potential of -58 mV as a major source of differences in EPSC rate and amplitudes compared to previous studies.

      Additionally, as noted on the section “Diversity of spontaneous release and their topographical segregation”, our SR values also agree with the range of 0.1 – 16.42 spikes/s reported by Wu et al., (2016) using loose patch recordings from p15-p17 rats. 90% of the paired recordings (and 60% of the bouton recordings) of our dataset were obtained from mice between p14-p17, where spontaneous activity is still low compared to older age groups (p19-p21: 0 – 44.22 spikes/s; p29p32: 0.11 – 54.9 spikes/s Wu et al., 2016; p28: 0 – 47.94 spikes/s, Siebald at al., 2023). There are two additional aspects to consider: i) about 40% of the SGN spikes seem to be generated intrinsically (not activated by an EPSP, ergo an EPSC) at p15-p18 (Wu et al., 2016); and ii) the presence of a spike or EPSC is the sole determinant of a successful recording when the IHC is not stimulated (either by K+ or voltage), thus, these type of experiments undersample fibers with low SR.

      We have included the aforementioned points in the discussion under the section "Diversity of spontaneous release and their topographical segregation”.

      Reviewer #3 (Public Review):

      Summary:

      "Bridging the gap between presynaptic hair cell function and neural sound encoding" by Jaime Tobon and Moser uses patch-clamp electrophysiology in cochlear preparations to probe the pre- and post-synaptic specializations that give rise to the diverse activity of spiral ganglion afferent neurons (SGN). The experiments are quite an achievement! They use paired recordings from pre-synaptic cochlear inner hair cells (IHC) that allow precise control of voltage and therefore calcium influx, with post-synaptic recordings from type I SGN boutons directly opposed to the IHC for both presynaptic control of membrane voltage and post-synaptic measurement of synaptic function with great temporal resolution.

      Strengths

      Any of these techniques by themselves are challenging, but the authors do them in pairs, at physiological temperatures, and in hearing animals, all of which combined make these experiments a real tour de force. The data is carefully analyzed and presented, and the results are convincing. In particular, the authors demonstrate that post-synaptic features that contribute to the spontaneous rate (SR) of predominantly monophasic post-synaptic currents (PSCs), shorter EPSC latency, and higher PSC rates are directly paired with pre-synaptic features such as a lower IHC voltage activation and tighter calcium channel coupling for release to give a higher probability of release and subsequent increase in synaptic depression. Importantly, IHCs paired with Low and High SR afferent fibers had the same total calcium currents, indicating that the same IHC can connect to both low and high SR fibers. These fibers also followed expected organizational patterns, with high SR fibers primarily contacting the pillar IHC face and low SR fibers primarily contacting the modiolar face. The authors also use in vivo-like stimulation paradigms to show different RRP and release dynamics that are similar to results from SGN in vivo recordings. Overall, this work systematically examines many features giving rise to specializations and diversity of SGN neurons.

      We would like to thank the reviewer for the appreciation of our work and the comments that helped us to improve our manuscript. We detail our responses to the comments below.

      Weaknesses / Comments / edits:

      (1) The careful analysis of calcium coupling and EPSC metrics is especially nice. Can the authors speculate as to why different synapses (likely in the same IHC) would have different calcium cooperativity?

      The finding of different apparent Ca2+ cooperativities among IHC synapses is intriguing. Paired pre- and postsynaptic patch-clamp recordings (this work and (Jaime Tobón and Moser, 2023)) and single synapse imaging of presynaptic Ca2+ signals and glutamate release (Özçete and Moser, 2021) jointly support this notion. Both methodologies complement each other. Imaging allows to assess the presynaptic Ca2+ of the specific synapse, while in paired recordings release is related to the whole cell Ca2+ influx. Paired recordings, on the other hand, provide the sensitivity and temporal resolution to assess the initial release rate with short stimuli (2 to 10 ms), which avoids an impact of RRP depletion and ongoing SV replenishment that needs to be considered for the longer stimuli used in imaging (50 ms). Both approaches agree on the finding of tighter coupling of Ca2+ channels and release sites (i.e., lower apparent Ca2+ cooperativity during depolarization within the range of receptor potentials) at pillar synapses. Moreover, the present study took advantage of recording individual release events [which was not achieved by imaging] and further supported the hypothesis that high SR SGNs receive input from active zones with tighter coupling than low SR SGNs. However, our two non-overlapping data sets for paired patch-clamp recordings (this work and (Jaime Tobón and Moser, 2023)) found a narrower range of apparent Ca2+ cooperativities compared to results from single synapse imaging (Özçete and Moser, 2021). This might reflect the technical differences described above. Future studies, potentially combining paired patch-clamp recordings with imaging of presynaptic Ca2+ signals will be needed to scrutinize this aspect.

      We think that the different Ca2+ cooperativities reflect subtle differences in the topography of presynaptic Ca2+ channels and vesicular release sites at the specific IHC active zones. The work of Özçete and Moser (2021) indicated that indeed, apparent Ca2+ cooperativities differ among active zones even within the same inner hair cell. Synaptic heterogeneity within one individual cell can expand its coding capacity. In the case of IHCs, differences in the Ca2+ dependence of synaptic release, in addition to the heterogeneous voltage dependence, appears to diversify the response properties (i.e., synaptic vesicle release probability) of individual synapses to the same stimulus. This is particularly important for sound intensity and temporal coding.

      We have included the aforementioned points in the discussion under the section "Candidate mechanisms distinguishing evoked release at low and high SR synapses”.

      (2) On the bottom of page 6 it would be helpful to mention earlier how many pillar vs modiolar fibers were recorded from, otherwise the skewness of SRs (figure 2H could be thought to be due to predominantly recordings from modiolar fibers. As is, it reads a bit like a cliff-hanger.

      Done!

      (3) The contrasts for some of the data could be used to point out that while significant differences occur between low and high SR fibers, some of these differences are no longer apparent when comparing modiolar vs pillar fibers (eg by contrasting Figure 2C and 2K). This can indicate that indeed there are differences between the fiber activity, but that the activity likely exists in a gradient across the hair cell faces. Pointing this out at the top of page 10 (end of the first paragraph) would be helpful, it would make the seemingly contradictory voltage dependence data easier to understand on first read (voltage-dependence of release is significantly different between different SR fibers (figure 3) but is not significantly different between fibers on different HC faces (figure S3).

      Done!

      (4) It should be acknowledged that although the use of post-hearing animals here (P14-23) ensures that SGN have begun to develop more mature activity patterns (Grant et al 2010), the features of the synapses and SGN activity may not be completely mature (Wu et al 2016 PMID: 27733610). Could this explain some of the 'challenges' (authors' section title) detailed on page 28, first full paragraph?

      Done!

      (5) In the discussion on page 24, the authors compare their recorded SR of EPSCs to measure values in vivo which are higher. Could this indicate that in vivo, the resting membrane potential of IHCs is more depolarized than is currently used for in vitro cochlear experiments?

      That is indeed one possible explanation among others. We have expanded the discussion about the factors that could affect the SR in ex vivo experiments.

      (6) The results showing lower calcium cooperativity of high SR fibers are powerful, but do the authors have an explanation for why the calcium cooperativity of < 2 is different from that (m = 3-4) observed in other manuscripts?

      We assume this question to potentially result from a misunderstanding. Using membrane capacitance measurements and Ca2+ uncaging, Beutner et al. (2001) reported a high intrinsic Ca2+ cooperativity of inner hair cell exocytosis (m = 4-5). Based on this data, it has been proposed that the binding of 4-5 Ca2+ ions is required to trigger the fusion of a synaptic vesicle in IHCs. However, given the shortcoming of Ca2+ uncaging, we and others aimed to further study this aspect using alternative methods. By varying the current of single Ca2+ channels in apical IHCs of hearing mice, several studies reported a high apparent Ca2+ cooperativity (m = 3-5) that is thought to reflect the high intrinsic cooperativity (Brandt et al., 2005; Wong et al., 2014; Özçete and Moser, 2021; Jaime Tobón and Moser, 2023).

      On the other hand, the apparent Ca2+ cooperativity observed upon changing the number of open Ca2+ channels would also reflect the active zone topography (i.e., number and distance of Ca2+ channels to the vesicular release site). In the present study, we used different depolarizations within the range of receptor potentials and found a low apparent Ca2+ cooperativity (m < 2) in 93% of the studied synapses. Other studies in apical IHCs from hearing mice used similar and alternative methods to change the number of open Ca2+ channels and also estimated an apparent cooperativity of < 2 (Brandt et al., 2005; Johnson et al., 2005; Johnson et al., 2007; Wong et al., 2014; Özçete and Moser, 2021; Jaime Tobón and Moser, 2023). The fact that these estimates are smaller than those seen upon changes in single Ca2+ current has been taken to indicate that SV release is governed by one or few Ca2+ channels in nanometer proximity (Ca2+ nanodomain-like control of SV exocytosis), building on classical synapse work (Augustine et al., 1991). 

      In contrast, comparable recordings from mouse IHCs before the onset of hearing (Wong et al., 2014) revealed more similar apparent Ca2+ cooperativities (m ~3) for both changes in the number of open Ca2+ channels and changes in single Ca2+ channel current. This suggests that IHCs before the onset of hearing employ a Ca2+ microdomain-like control of SV exocytosis in which release is governed by the combined activity of several Ca2+ channels in >100 nm distance to the release site. A Ca2+ microdomain-like control of SV exocytosis was also reported for basocochlear IHCs (Johnson et al., 2017).

      Recommendations for the authors:

      As explained in the public reviews of Reviewers 1 and 2, some mismatches between the data presented here and previous reports from the literature have been identified. It is recommended that you discuss those mismatches, perhaps in relation to the choice of patchclamping the hair cells at -58mV.

      We have addressed this point thoroughly in the revised MS. Please see our response to the public review.

      Reviewer #1 (Recommendations For The Authors):

      Minor suggestions and corrections:

      (1) Figures 3 and 4 show beautiful data with paired recordings. Figure 3 shows 10 ms pulses, whereas Fig. 4 shows 100 ms depolarizing pulses. The example in Fig. 3A shows asynchronous release after Ca channel closure, whereas Fig. 4 does not show this so prominently. Was there quite a bit of variability in the asynchronous release from different cell pairs, or was this correlated with pulse duration?

      The asynchronous release is also present after 100 ms depolarizing pulses (please see the updated panel A of Figure 4). However, we have not analysed asynchronous release and think that this would be beyond the scope of the current MS. For clarity, we have added dashed lines in the EPSC traces of Figs. 3 and 4 to indicate the on and off-set of the depolarization.

      (2) Differences in apex and basal IHC ribbon synapse nanodomain to microdomain Ca channel coupling to exocytosis-sensor have been reported also for gerbil IHCs (see Johnson et al., JNeurosci., 2017). This may be worth mentioning since it is another indication of major synaptic diversity in the mammalian cochlea, this time from low to frequency-located IHCs.

      Done

      (3) Page 22: change "hight SR" to "high SR".

      Done

      (4) Page 27: change "addess" to "addressed".

      Done

      Reviewer #2 (Recommendations For The Authors):

      Major points:

      (1) As indicated in methods, recording stretches of 5-10 seconds were used to determine the SR of a given SGN. This seems too short for a reasonable estimate of the SR in these neurons. Also, the reported SRs for these mature mice are not only much lower than those measured in in-vivo SGN extracellular recordings but also compared to those reported in ex-vivo rat recordings. Why this discrepancy? The authors decided to estimate SR by voltage-clamping IHCs at a fixed value of - 58 mV, which they take from Johnson, 2015. I wonder if it is not more reasonable to use a range of IHC holdings and measure SR at those, instead of using a single one. It is hard to visualize a very strong argument for using strictly -58 mV. In addition, mapping out a range of holding potentials could provide additional information on IHCs resting membrane potential in physiological conditions.

      Related to this point, considering that SR values found in the ex-vivo preparation are much lower than those described in in-vivo situations, is it fair to use the same 1 sp/s criteria, as in Taberner & Liberman, to segregate low and high? Shouldn't this value be adjusted to the overall lower SR? This criterion is naturally critical for the consequent evaluation of other SGN properties.

      Finally, on this same problem of IHC Vh, does -58 mV estimate include the 19 mV liquid junction potential? How does it compare with the activation threshold of calcium influx at modiolar vs pillar synapses (see imaging studies)?

      We had proactively discussed the challenges of relating ex vivo and in vivo data in the preprint provided for review. While we consider the outcome of our study helpful for better understanding the relation of afferent synaptic heterogeneity and diverse firing properties of SGNs, we do not claim that the assumptions based on literature (such as on the physiological resting potential) represent ground truth.

      When carefully revising the MS, we have expanded on the discussion to address the points raised here, particularly regarding the lower SR and sEPSC amplitudes. As this and the other reviewer commented in the public review, these experiments were hard to achieve and we consider repeating them with a range of IHC holding potentials (then not only for spontaneous rate of transmission, but also for in depth characterization of evoked release) to be beyond the scope of the present study.

      We do appreciate the suggestion to adjust the distinction between low and high SR given the overall lower rates. However, we would like to refrain from it, as i) we consider it quite arbitrary to define another criterium and ii) we would like to avoid any apparent cherry-picking bias.

      Finally, yes, of course, the -58 mV represent the liquid junction potential corrected holding potential. Our average IHC whole-cell Vhalf ICa (-38.86 mV for high SR and -37.60 mV for low SR) compares well with previous reports of average whole-cell Vhalf ICa (-35.44 mV) and average synaptic Vhalf Rhod-FF (-41.15 mV) (Özçete and Moser, 2021). Additionally, our Vhalf QEPSC distribution (ranging from -53.97 to -31.72 mV) also compares well with the Vhalf iGluSnFR distribution (ranging from -45.25 to -29.86 mV) obtained by imaging of synaptic glutamate release (Özçete and Moser, 2021).

      2) EPSCs amplitude distributions in Figure 2 seem very different from those reported before by Grant et al., 2010 and Niwa et al., 2021 (even Chapochnikov et al., 2014; although not sure if the animal ages match). The average amplitudes of EPSCs reported here, for either pillar or modiolar SGNs, seem way smaller than those reported previously. The authors should provide a convincing explanation for this critical deviation from the consensual results.

      Please refer to our response to the public review (point #3).

      3) Rise time analysis in Fig. 2 supp 1. The actual values seem too long, again, compared to reported values. Also, what would these differences between modiolar and pillar represent?

      Previous reports on mouse, rat, turtle and bullfrog focused mainly on the rise times (or time to peak) of monophasic EPSCs: about 0.39 ms (p8-p11 mouse; Chapochnikov et al., 2014, Takago et al., 2019), 0.33-0.58 ms (p7-p14 rat; Yi at al., 2010, Grant et al., 2010, Glowatzki and Fuchs, 2002), 0.17-0.29 ms (p15-p21 rat; Chapochnikov et al., 2014, Huang and Moser, 2018, Grant et al., 2010), 0.1-0.2 ms (turtle auditory papilla; Schnee et al., 2013) and 0.15-0.2 ms (bullfrog 31ºC and 22ºC; Li et al., 2009, Chen and von Gersdorff, 2019). Regarding multiphasic EPSCs, some studies have reported rise times (or times to peak) of about 1.5 ms (p8-p11 mouse; Takago et al., 2019), 1.1 ms (p8-p11 rat; Grant et al., 2010) and 0.6-0.8 ms (p15-p21 rats; Huang and Moser, 2018, Chapochnikov et al., 2014, Grant et al., 2010). When we factor in the waveform of the sEPSCs, our rise times are comparable to the literature:

      Author response table 1.

      Thus, IHC synapses with higher SR and predominantly located at the pillar side appear to have sEPSCs with faster rise times regardless of their waveform. This might be a consequence of the fusion kinetics of the synaptic vesicles which are tightly influenced by the Ca2+ influx (Huang and Moser, 2018). Additionally, differences in the composition and density of the postsynaptic AMPA receptors could play a role in the rise time of the EPSC (Rubio et al., 2017). 

      4) One of the most impressive observations of the in-vivo SGN physiology is the difference in sound threshold among specific fibers. This can vary over tens of dB of sound pressure levels.

      The representation of this phenomenon when using an ex-vivo preparation is not obvious. Overall, it has been reported that IHC Vm is a good proxy for stimulus intensity. Consequently, the authors reported an 'IHC Vm threshold' at the start of SGN synaptic activity for each recording. This can be found in Figure 3 Eii, where values vary between -65 to -30 mV. This is already an important finding. However, the representative traces on panel A only diverge by 5 mV. It would be very interesting to the reader to have represented in the figure recordings that can better illustrate this wide range of values.

      We agree with the reviewer regarding the impressive difference in the sound thresholds recorded in vivo. To illustrate better illustrate our findings, we have chosen a different representative trace for the high SR synapse.

      5) On the masker-probe experiments it would be interesting to look at the synaptic delay of the probe pulses. Are they different between high and low SR synapses?

      We have now included the results of the synaptic delay of the probe response (Figure 4– supplementary figure 1). Despite not being statistically significant, the eEPSC probe latency of high SR is on average faster than low SR.

      Reviewer #3 (Recommendations For The Authors):

      (1) The terms monophasic and compact are used interchangeably. This is fine, but perhaps compact could be defined earlier, otherwise, readers may think that 'compact' means 'short' (as is sometimes euphemistically used to describe short people), which then makes phrasing such as the figure legend for figure 2 a bit confusing. This could be included at first use in a figure as well, in figure 1B where the two types of EPSCs are first shown.

      Done, now explained and preferentially used monophasic.

      (2) Check for mention of figure panels in the results text - for example, there is no mention in the results text of figure 2A, 2I,

      Done

      (3) The locations of some of the statistics are inconsistent. This is fine if the authors have a reason for including the stats where they did, but in some cases, the stats are duplicated (for example figure 2J, 2K, 2L, the stats are in both the figure legend and the results text, then check throughout).

      Done

      (4) The color coding in figure 4 is confusing in panel A - does orange still mean a high SR fiber here? The legend indicates that orange is for EPSCs, but does not specify charge. It could be helpful to show both a high and low SR response, both for EPSCs and for charge. 

      Thanks for pointing us to this aspect: we have carefully revised the figure and figure legend for clarity. We also included an exemplary response of a low SR synapse in the figure.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Recommendations For The Authors:

      Reviewer #1:

      ●      It might help the reader if you make it explicit that mDES allows you to create an approximate amalgam of different kinds of experiences by assuming that, across individuals, there is a general consensus of experiences at particular points in the movie. Whether this assumption is an accurate reflection of the way in which each individual's brain is an important, testable prediction that could be discussed/examined in different projects. For instance, in other projects there are clear idiosyncratic responses to the same naturalistic stimuli: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8064646/.

      Thank you, this is an excellent point. We have included this article in our revision and expanded on the introduction to emphasize how this study relates to our work. Additionally, we have included an additional figure that helps illustrate how mDES can be used to evaluate the idiosyncrasy for each respective thought component to visually display the variance across moments in the film:

      Page 6-7 [137-148] In our study, we used multi-dimensional experience sampling (mDES) to describe ongoing thought patterns during the movie-watching experience [8]. mDES is an experience sampling method that identifies different features of thought by probing participants about multiple dimensions of their experiences. mDES can provide a description of a person’s thoughts, generating reliable thought patterns across laboratory cognitive tasks [22, 32, 33] and in daily life [34, 35], and is sensitive to accompanying changes in brain activity [24, 36]. Studies that use mDES to describe experience ask participants to provide experiential reports by answering a set of questions about different features of their thought on a continuous scale from 1 (Not at all) to 10 (Completely) [24, 32-41]. Each question describes a different feature of experience such as if their thoughts are oriented in the future or the past, about oneself or other people, deliberate or intrusive in nature, and more (See methods for a full list of questions used in the current study).

      ●      A cartoon describing the mDES technique could be helpful for uninitiated readers.

      Thank you for your suggestion, we have added an additional figure (Figure 3) that illustrates the process of mDES in the laboratory during this experiment, clarifying that participants answer mDES items using a slider to indicate their score (rather than expressing it verbally).

      ●      Did the authors check for any measures of reliability across mDES estimates other than split-half reliability? For instance, the authors could demonstrate construct validity by showing that engagement with certain features of the thought-sampling space aligned with specific points in the movies. If so, the start of the Results section would be a great place to demonstrate the reliability of the approach. For instance, did any two participants sample the same 15-second window of time in a particular stimulus? If so, you could compare their experience samples to determine whether the method was extensible across subjects.

      This is a great point, thank you very much for highlighting this. We have eight individuals at each time point in our analysis, which is probably not enough to calculate meaningful reliability measures. However, we have added a time series analysis of experience in each clip to our revision (Figure 3). In these time plots, it is possible to see clear moments in the film in which scores do not straddle 0 (using 95% CI), and often, these persist across successive moments (Figure 3; see time-series plot four for the clearest example).  When the confidence intervals of a sampling epoch do not overlap with zero, this suggests a high degree of agreement in thought content across participants. At the same time, our analysis shows that individual differences do exist since the relative presence of each component for each participant was linked to objective measures of movie watching (in this case, comprehension). In this revision we have specifically addressed this question by conducting ANOVAs to determine how scores on each component across the clip (See also supplementary table 11). This additional analysis shows that mDES effectively captures shared aspects of movie-watching and is also sensitive to individual variation (since it can describe individual differences).

      Page 15 [304-323]: Next, we examined how each pattern of thought changes across each movie clip. For this analysis, we conducted separate ANOVA for each film clip for the four components (see Table 1 and Figure 3). Clear dynamic changes were observed in several components for different films. We analyzed these data using an Analysis of Variance (ANOVA) in which the time in each clip were explanatory variables of interest. This identified significant change in “Episodic Social Cognition” scores across Little Miss Sunshine, F(1, 712) = 10.80, p = .001, , η2 = .03, and Citizenfour, F(1, 712) = 5.23, p = .023, , η2 = .02. There were also significant change in “Verbal Detail” scores across Little Miss Sunshine, F(1, 712) = 31.79, p <.001, η2 = .09. Lastly, there were significant changes in “Sensory Engagement” scores for both Citizenfour, F(1, 712) = 6.22, p = .013, η2 = .02, and 500 Days of Summer, F(1, 706) = 80.41, p <.001, η2 = .18. These time series are plotted in Figure 3 and highlight how mDES can capture the dynamics of different types of experience across the three movie clips. Moreover, in several of these time series plots, it is clear that thought patterns reported extend beyond adjacent time periods (e.g. scores above zero between time periods 150 to 400 for Sensory Engagement in 500 days of Summer and for time periods between 175 and 225 for Verbal Detail in Little Miss Sunshine). It is important to note that no participant completed experience sampling reports during adjacent sampling points (see Supplementary Figure 7), so the length of these intervals indicates agreement in how specific scenes within a film were experienced and conserved across different individuals. Notably, the component with the least evidence for temporal dynamics was “Intrusive Distraction.”

      ●      P10: "Generation of the thought-space" - how stable are these word clouds to individual subjects? If there are subject-specific differences, are there ways to account for this with some form of normalization?

      Thank you for bringing up this point. Our current goal was to show how the average experience of one group of participants relates to the brain activity of a second group. In this regard it is important to seek the patterns of similarity across individuals in how they experience the film. However, as is normal in our studies using mDES, we can also use the variation from the mean to predict other cognitive measures and, in this way, account for the variability that individuals have in their movie-watching experience. In other words, the word clouds reflect the mean of a particular dimension, so when an individual score is close to 0, their thought content does not align with this dimension -- however, deviating scores, positive or negative, indicating that this dimension provides meaningful information about the individual's experience. Evidence of the meaningful nature of this variation can be seen in the links between the reported thoughts and the individuals’ comprehension (e.g. individuals whose thoughts do not contain strong evidence of “Intrusive Distraction”, or in other words, a negative score, tended to do better on comprehension tests of information in the movies they watched).

      ●      P11: "Variation in thought patterns" - can the authors use a null model here to demonstrate that the associations they've observed would occur above chance levels (e.g., for a comparison of time series with similar temporal autocorrelation but non-preserved semantic structure)? Further, were there any pre-defined hypotheses over whether any of the three different movies would engage any of the 4 observed dimensions?

      This is a great point. We chose to sample from three distinctly different films to help us understand if mDES was sensitive to different semantic and affective features of films. Our analysis, therefore, shows that at a broad level, mDES is able to discriminate between films, highlighting its broad sensitivity to variation in semantic or affective content. Armed with this knowledge, researchers in the future could derive mechanistic insights into how the semantic features may influence the mDES data. For example, future studies could ask participants to watch movies in a scrambled order to understand how varying the structure of semantics or information breaks the mapping between brains and ongoing experience. In this revision we have amended the text to reflect this possibility:

      Page 34 [674-679]. Our analysis shows that mDES is able to discriminate between films, highlighting its broad sensitivity to variation in semantic or affective content. Armed with this knowledge, we propose that in the future, researchers could derive mechanistic insights into how the semantic features may influence the mDES data. For example, it may be possible to ask participants to watch movies in a scrambled order to understand how the structure of semantic or information influences the mapping between brains and ongoing experience as measured by mDES.

      ●      P14: "Brain - Thought Mappings: Voxel-space Analysis" - this is a cool analysis, and a nice validation of the authors' approach. I would personally love to see some form of reliability analysis on these approaches - e.g., do the same locations in the cerebral cortex align with the four features in all three movies? Across subjects?

      This is another great point, and we thank you for your enthusiasm. The data we have has only sampled mDES during a relatively short period of brain activity which we suspect would make an individual-by-individual analysis underpowered. In the future, however, it may be possible to adopt a precision mapping approach in which we sample mDES during longer periods of movie watching and identify how group-level mappings of experience relate to brain activity within a single subject. To reflect this possibility, we have amended the text in this revision in the following way:

      Page 34-35 [672-687]: In addition, our study is correlational in nature, and in the future, it could be useful to generate a more mechanistic understanding of how brain activity maps onto the participants' experience. Our analysis shows that mDES is able to discriminate between films, highlighting its broad sensitivity to variation in semantic or affective content. Armed with this knowledge, we propose that in the future, researchers could derive mechanistic insights into how the semantic features may influence the mDES data. For example, it may be possible to ask participants to watch movies in a scrambled order to understand how the structure of semantic or information influences the mapping between brains and ongoing experience as measured by mDES. Finally, our study focused on mapping group-level patterns of experience onto group-level descriptions of brain activity. In the future, it may be possible to adopt a “precision-mapping” approach by measuring longer periods of experience using mDES and determining how the neural correlates of experience vary across individuals who watched the same movies while brain activity was collected [1]. In the future, we anticipate that the ease with which our method can be applied to different groups of individuals and different types of media will make it possible to build a more comprehensive and culturally inclusive understanding of the links between brain activity and movie-watching experience

      Reviewer #2:

      (1) The three-dimensional scatter plot in Figure 2 does not represent "Intrusive Distraction." Would it make sense to color-code dots by this important dimension?

      Thank you for this suggestion. Although it could be possible to indicate the location of each film in all four dimensions, we were worried that this would make the already complex 3-D space confusing to a naive reader. In this case, we prefer to provide this information in the form of bar graphs, as we did in the previous submission.

      (2) The coloring of neural activation patterns in Figure 3 is not distinct enough between the different dimensions of thought. Please reconsider color intensities or coding. The same applies to the left panel in Figure 4.

      Thanks for this comment; we found it quite difficult to find a colour mapping that allows us to show the distinction between four states in a simple manner, yet we believe it is valuable to show all of the results on a similar brain. Nonetheless, to provide a more fine-grained viewing of our results in this revision we have provided a supplementary figure (Supplementary Figure 6) that shows each of the observed patterns of activity in isolation.

      (3) The new method (mDES) is mentioned too often without explanation, making it hard to follow without referring to the methods section. It would be helpful to state prominently that participants rated their thoughts on different dimensions instead of verbalizing them.

      Thank you for this point, we have adjusted the Introduction to clarify and expand on the mDES method. We have also included an example of the mDES method in an additional figure that we have now included to visually express how participants respond to mDES probes (Figure 3).

      Page 6-7 [136-148]: In our study, we used multi-dimensional experience sampling (mDES) to describe ongoing thought patterns during the movie-watching experience [2]. mDES is an experience sampling method that identifies different features of thought by probing participants about multiple dimensions of their experiences. mDES can provide a description of a person’s thoughts, generating reliable thought patterns across laboratory cognitive tasks [3-5] and in daily life [6, 7], and is sensitive to accompanying changes in brain activity when reports are gained during scanning [8, 9]. Studies that use mDES to describe experience ask participants to provide experiential reports by answering a set of questions about different features of their thought on a continuous scale from 1 (Not at all) to 10 (Completely) [3, 5-14]. Each question describes a different feature of experience, such as if their thoughts are oriented in the future or the past, about oneself or other people, deliberate or intrusive in nature, and more (See Methods for a full list of questions used in the current study).

      Author response image 1.

      (4) Reporting of single-movie thought patterns seems quite extensive. Could this be condensed in the main text?

      Thank you for this point, upon re-visiting the manuscript, we have adjusted the text to be more concise.

      Reviewer #3:

      ●      This is a very elegant experiment and seems like a very promising approach. The text is currently hard to read.

      Thank you for this point, we have since revisited the text and adjusted the manuscript to be more concise and add more clarity.

      ●      The introduction (+ analysis goals) fails to explain the basic aspects of the analysis and dataset. It is not clear how many participants and datapoints were used to establish the group-level thought patterns, nor is it entirely clear that the fMRI data is a separate existing dataset. Some terms are introduced and highlighted and never revisited (e.g decoupled states and the role of the DMN).

      Thank you for this critique, we have since adjusted the introduction to clearly explain the difference between Sample 1 and Sample 2 and further clarify that the fMRI data is an entirely separate, independent sample compared to the laboratory mDES sample:

      Page 7-8 [158-174]: Thus, to overcome this obstacle, we developed a novel methodological approach using two independent sample participants. In the current study, one set of 120 participants was probed with mDES five times across the three ten-minute movie clips (11 minutes total, no sampling in the first minute). We used a jittered sampling technique where probes were delivered at different intervals across the film for different people depending on the condition they were assigned. Probe orders were also counterbalanced to minimize the systematic impact of prior and later probes at any given sampling moment. We used these data to construct a precise description of the dynamics of experience for every 15 seconds of three ten-minute movie clips. These data were then combined with fMRI data from a different sample of 44 participants who had already watched these clips without experience sampling [15]. By combining data from two different groups of participants, our method allows us to describe the time series of different experiential states (as defined by mDES) and relate these to the time series of brain activity in another set of participants who watched the same films with no interruptions. In this way, our study set out to explicitly understand how the patterns of thoughts that dominate different moments in a film in one group of participants relate to the brain activity at these time points in a second set of participants and, therefore, better understand the contribution of different neural systems to the movie-watching experience.

      Page 8-9 [177-188] The goal of our study, therefore, was to understand the association between patterns of brain activity over time during movie clips in one group of participants and the patterns of thought that participants reported at the corresponding moment in a different set of participants (see Figure 1). This can be conceptualized as identifying the mapping between two multi-dimensional spaces, one reflecting the time series of brain activity and the other describing the time series of ongoing experience (see Figure 1 right-hand panel). In our study, we selected three 11-minute clips from movies (Citizenfour, Little Miss Sunshine and 500 Days of Summer) for which recordings of brain data in fMRI already existed (n = 44) [15] (Figure 1, Sample 1). A second set of participants (n = 120) viewed the same movie clips, providing intermittent reports on their thought patterns using mDES (Figure 1, Sample 2). Our goal was to understand the mapping between the patterns of brain activity at each moment of the film and the reports of ongoing thought recorded at the same point in the movies.

      ●      It is unclear what the utility of the method is - is it meant to be done in fMRI studies on the same participants? Or is the idea to use one sample to model another?

      Great point, thank you for highlighting this important question. This paper aimed to interrogate the relationship between experience and neural states while preserving the novelty of movie-watching. Although it could be done in the same sample, it may be difficult to collect frequent reports of experience without interrupting the dynamics of the brain. However, in the future it could be possible to collect mDES and brain activity in the same individuals while they watched movies. For example, our prior studies (e.g. [9]) where we combined mDES with openly-available brain data activity during tasks. In the future, this online method could also be applied during movie watching to identify direct mapping between brain activity and films. However, this online approach would make it very expensive to produce the time series of experience across each clip given that it would require a large number of participants (e.g. 200 as we used in our current study). The following has been included in our manuscript:

      Page 7 [149-159] One challenge that arises when attempting to map the dynamics of thought onto brain activity during movie watching is accounting for the inherently disruptive nature of experience sampling: to measure experience with sufficient frequency to map the dynamics of thoughts during movies would disrupt the natural dynamics of the brain and would also alter the viewer’s experience (for example, by pausing the film at a moment of suspense). Therefore, if we periodically interrupt viewers to acquire a description of their thoughts while recording brain activity, this could impact capturing important dynamic features of the brain. On the other hand, if we measured fMRI activity continuously over movie-watching (as is usually the case), we would lack the capacity to directly relate brain signals to the corresponding experiential states. Thus, to overcome this obstacle, we developed a novel methodological approach using two independent sample participants

      ●      The conclusions currently read as somewhat trivial (e.g "Our study, therefore, establishes both sensory and association cortex as core features of the movie-watching experience", "Our study supports the hypothesis that perceptual coupling between the brain and external input is a core feature of how we make sense of events in movies").

      Thank you for this comment. In this revision we have attempted to extend the theoretical significance of our work in the discussion (for example, in contrasting the links between Intrusive distraction and the other components). To this end we have amended the text in this revision by including the following sections:

      Page 33-35 [654-687]: Importantly, our study provides a novel method for answering these questions and others regarding the brain basis of experiences during films that can be applied simply and cost-effectively. As we have shown mDES can be combined with existing brain activity allowing information about both brain activity and experience to be determined at a relatively low cost.  For example, the cost-effective nature of our paradigm makes it an ideal way to explore the relationship between cognition and neural activity during movie-watching during different genres of film. In neuroimaging, conclusions are often made using one film in naturalistic paradigm studies [16]. Although the current study only used three movie clips, restraining our ability to form strong conclusions regarding how different patterns of thought relate to specific genres of film, in the future, it will be possible to map cognition across a more extensive set of movies and discern whether there are specific types of experience that different genres of films engage. One of the major strengths of our approach, therefore, is the ability to map thoughts across groups of participants across a wide range of movies at a relatively low cost.

      Nonetheless, this paradigm is not without limitations. This is the first study, as far as we know, that attempts to compare experiential reports in one sample of participants with brain activity in a second set of participants, and while the utility of this method enables us to understand the relationship between thought and brain activity during movies, it will be important to extend our analysis to mDES data during movie watching while brain activity is recorded. In addition, our study is correlational in nature, and in the future, it could be useful to generate a more mechanistic understanding of how brain activity maps onto the participants experience. Our analysis shows that mDES is able to discriminate between films, highlighting its broad sensitivity to variation in semantic or affective content. Armed with this knowledge, we propose that in the future, researchers could derive mechanistic insights into how the semantic features may influence the mDES data. For example, it may be possible to ask participants to watch movies in a scrambled order to understand how the structure of semantic or information influences the mapping between brains and ongoing experience as measured by mDES. Finally, our study focused on mapping group-level patterns of experience onto group-level descriptions of brain activity. In the future it may be possible to adopt a “precision-mapping” approach by measuring longer periods of experience using mDES and determining how the neural correlates of experience vary across individuals who watched the same movies while brain activity was collected [1]. In the future, we anticipate that the ease with which our method can be applied to different groups of individuals and different types of media will make it possible to build a more comprehensive and culturally inclusive understanding of the links between brain activity and movie-watching experience

      ●      The beginning of the discussion is very clear and explains the study very well. Some of it could be brought up in the intro/analysis goal sections.

      Thank you for this comment, this is an excellent idea. We have revisited the introduction and analysis goals section to mirror this clarity across the manuscript.

      ●      The different components are very interesting, and not entirely clear. Some examples in the text could help. Especially regarding your thought that verbal components would refer to a "decoupled" mental verbal analysis participants might be performing in their thoughts.

      Thank you for this point. We would prefer not to elaborate on this point since, at present, it would simply be conjecture based on our correlational design. However, we have included a section in the discussion which explains how, in principle, we would draw more mechanistic conclusions (for example, by shuffling the order of scenes in a movie as suggested by another reviewer). In the current revision, we have amended the text in the following way:

      Page 34 [674-679]: Our analysis shows that mDES is able to discriminate between films, highlighting its broad sensitivity to variation in semantic or affective content. Armed with this knowledge, we propose that in the future, researchers could derive mechanistic insights into how the semantic features may influence the mDES data. For example, it may be possible to ask participants to watch movies in a scrambled order to understand how the structure of semantic or information influences the mapping between brains and ongoing experience as measured by mDES

      ●      The reference to using neurosynth as performing a meta-analysis seems a little stretched.

      We have adjusted the manuscript to remove ‘meta-analysis’ when referring to the analysis computed with neurosynth. Thank you for bringing this to our attention.

      ●      State-space is defined as brain-space in the methods.

      Thank you, we have since updated this.

      ●      It could be useful to remind the reader what thought and brain spaces are at the top of the state-space results section.

      This is an excellent point, and it has since been updated to remind the reader of thought- and brain-space. Thank you for this comment.

      Page 24 [458-467]: Our next analysis used a “state-space” approach to determine how brain activity at each moment in the film predicted the patterns of thoughts reported at these moments (for prior examples in the domain of tasks, see [12, 17], See Methods). In this analysis, we used the coordinates of the group average of each TR in the “brain-space” and the coordinates of each experience sampling moment in the “thought-space.”. To clarify, the location of a moment in a film in “brain-space” is calculated by projecting the grand mean of brain activity for each volume of each film against the first five dimensions of brain activity from a decomposition of the Human Connectome Project (HCP) resting state data, referred to as Gradients 1-5. “Thought-space” is the decomposition of mDES items to create thought pattern components, referred to as “Episodic Knowledge”, “Intrusive Distraction”, “Verbal Detail” and “Sensory Engagement.”

      ●      DF missing from the t-test for episodic knowledge/grad 4.

      Thank you for catching this, the degrees of freedom has since been included in this revision.

      Page 24 [474-476]: First, we found a significant main effect of Gradient 4 (DAN to Visual), which predicted the similarity of answers to the “Episodic Knowledge” component, t(2046) = 2.17, p = .013, η2 = .01.

      Public Reviews:

      Reviewer #1:

      ●      The lack of direct interrogation of individual differences/reliability of the mDES scores warrants some pause.

      Our study's goal was to understand how group-level patterns of thought in one group of participants relate to brain activity in a different group of participants. To this end, we decomposed trial-level mDES data to show dimensions that are common across individuals, which demonstrated excellent split-half reliability. Then we used these data in two complementary ways. First, we established that these ratings reliably distinguished between the different films (showing that our approach is sensitive to manipulations of semantic and affective features in a film) and that these group-level patterns were also able to predict patterns of brain activity in a different group of participants (suggesting that mDES dimensions are also sensitive to broad differences in how brain activity emerges during movie watching). Second, we established that variation across individuals in their mDES scores predicted their comprehension of information from the films. This establishes that when applied to movie-watching, mDES is sensitive to individual differences in the movie-watching experience (as determined by an individual's comprehension). Given the success of this study and the relative ease with which mDES can be performed, it will be possible in the future to conduct mDES studies that hone in on the common and distinct features of the movie-watching experience.

      Reviewer #2:

      (1) The dimensions of thought seem to distinguish between sensory and executive processing states. However, it is unclear if this effect primarily pertains to thinking. I could imagine highly intrusive distractions in movie segments to correlate with stagnating plot development, little change in scenery, or incomprehensible events. Put differently, it may primarily be the properties of the movies that evoke different processing modes, but these properties are not accounted for. For example, I'm wondering whether a simple measure of engagement with stimulus materials could explain the effects just as much. How can the effects of thinking be distinguished from the perceptual and semantic properties of the movie, as well as attentional effects? Is the measure used here capturing thought processes beyond what other factors could explain?

      Our study used mDES to identify four distinct components of experience, each of which had distinct behavioural and neural correlates and relationships to comprehension. Together this makes it unlikely that a single measure of engagement would be able to capture the range of effects we observed in our study. For example, “Intrusive Distraction” was associated with regions of association cortex, while the other three components highlighted regions of sensory cortex. Behaviorally, we found that some components had a common effect on comprehension (e.g. “Intrusive distraction” was related to worse comprehension across all films), while others were linked to clear benefits to comprehension in specific films (e.g. “Episodic Knowledge” was associated with better comprehension in only one of the films). Given the complex nature of these effects, it would be difficult for a single metric of engagement to explain this pattern of results, and even if it did, this could be misleading because our analysis implies that they are better explained by a model of movie-watching experience in which there are several relatively orthogonal dimensions upon which our experience can vary.

      At the same time, we also found that films vary in the general types of experience they can engender. For example, Citizenfour was high on “Intrusive Distraction” and participants performed relatively low on comprehension. This shows that manipulations of the semantic and affective content of films also have implications for the movie-watching experience. This pattern is consistent with laboratory studies that applied mDES during tasks and found that different tasks evoke different types of experience (for example, patterns of ‘intrusive’ thoughts were common in movie clips that were suspenseful, [18]). At the same time, in the same study, patterns of intrusive thought across the tasks were also associated with trait levels of dysphoria reported by participants. Other studies using mDES in daily life have shown that the data can be described by multiple dimensions and that each of these types of thought is more prevalent in certain activities than others ([19]). For example, in daily life, patterns of ‘intrusive distraction’ thoughts were more prevalent when individuals were engaged in activities that were relatively unengaging (such as resting). Collectively, therefore, studies using mDES suggest that is likely that human thought is multidimensional in nature and that these dimensions vary in a complex way in terms of (a) the contexts that promote them, and (b) how they are impacted by features of the individual (whether they be traits like anxiety or depression or memory for information in a film).

      (2) I'm skeptical about taking human thought ratings at face value. Intrusive distraction might imply disengagement from stimulus materials, but it could also be an intended effect of the movie to trigger higher-level, abstract thinking. Can a label like intrusive distraction be misleading without considering the actual thought and movie content?

      Our method uses a data-driven approach to identify the dimensions that best describe the range of answers that our participants provided to describe their experience. We use these dimensions to understand how these patterns of thought emerge in different contexts and how they vary across individuals (in this case, in different movies, but in other studies, laboratory tasks [3, 8, 9, 12, 20-22] or activities in daily life[6, 7]). These context relationships help constrain interpretations of what the components mean. For example, “Intrusive Distraction” scores were highest in the film with the most real-world significance for the participants (Citizenfour) and were associated with worse comprehension. In daily life, however, patterns of “Intrusive Distraction” thoughts tend to occur when activities engage in non-demanding activities, like resting. Psychological perspectives on thoughts that arise spontaneously occur in this manner since there is evidence that they occur in non-demanding tasks with no semantic content (when there is almost no external stimulus to explain the occurrence of the experience, see [23]), however, other studies have shown that specific cues in the environment can also cue the experience (see [23]). Consistent with this perspective, and our current data, patterns of ‘Intrusive Distraction’ thought are likely to arise for multiple reasons, some of which are more intrinsic in nature (the general association with poor comprehension across all films) and others which are extrinsic in nature (the elevation of intrusive distraction in Citizenfour).

      It is also important to note that our data-driven approach also found patterns of experience that provide more information about the content of their experience, for example, the dimension of “Episodic Knowledge” is characterized by thoughts based on prior knowledge, involving the past, and concerning oneself, and was most prevalent in the romance film (500 Days of Summer). Likewise, “Sensory Engagement” was associated with experiences related to sensory input and positive emotionality and occurred more during the romance movie (500 Days of Summer) than in the documentary (Citizenfour) and was linked to increased brain activity across the sensory systems. This shows that mDES can also provide information about the content of that experience, and discriminate between different sources of experience. In the future, it will be possible to improve the level of detail regarding the content of experiences by changing the questions used to interrogate experience.     

      (3) A jittered sampling approach is used to acquire thought ratings every 15 seconds. Are ratings for the same time point averaged across participants? If so, how consistent are ratings among participants? High consistency would suggest thoughts are mainly stimulus-evoked. Low consistency would question the validity of applying ratings from one (group of) participant(s) to brain-related analyses of another participant.

      In this experiment, we sampled experience every 15 seconds in each clip, and in each sampling epoch, we gained mDES responses from eight participants. Furthermore, no participant was sampled at an adjacent time point, as our approach jittered probes approximately 2 minutes apart (See Supplementary Figure 7). To illustrate the consistency of mDES data, we have included an additional figure (Figure 3) highlighting how experience varies over time in each clip. It is evident from these plots that there are distinct moments in which group-averaged reported thoughts across participants are stable and that these can extend across adjacent sampling points (i.e. when the confidence intervals of the score at a timepoint do not overlap with zero). Therefore, in some cases, adjacent sampling points, consisting of different sets of eight participants, describe their experiences as having similar positions on the same mDES dimension. This suggests that there is agreement among individuals regarding how they experienced a specific moment in a film, and in some cases, this agreement was apparent in successive sets of eight participants. Together, our findings indicate a conservation of agreement across participants that spans multiple moments in a film. A clear example of agreement on experience across multiple sets of 10 participants can be seen between 150-400 seconds in the clip from 500 Days of Summer for the dimension of “Sensory Engagement” (time series plot 4 in Figure 3).

      (4) Using three different movies to conclude that different genres evoke different thought patterns (e.g., line 277) seems like an overinterpretation with only one instance per genre.

      We found that mDES was able to distinguish between each film on at least one dimension of experience. In other words, information encoded in the mDES dimensions was sensitive to variation in semantic and affective experiences in the different movie clips. This provides evidence that is necessary but not sufficient to conclude that we can distinguish different genres of films (i.e. if we could not distinguish between films, then we would not be able to distinguish genres). However, it is correct that to begin answering the broader question about experiences in different genres then it would be necessary to map cognition across a larger set of movies, ideally with multiple examples of each genre.

      (5) I see no indication that results were cross-validated, and no effect sizes are reported, leaving the robustness and strength of effects unknown.

      Thank you for drawing this to our attention. We have re-run the LMMs and ANOVA models to include partial eta-squared values to clarify the strength of the effects in each of our reported outcomes.

      Reviewer #3:

      ●      What are the considerations for treating high-order thought patterns that occur during film viewing as stable enough to be used across participants? What would be the limitations of this method? (Do all people reading this paper think comparable thoughts reading through the sections?)

      It is likely, based on our study, that films can evoke both stereotyped thought patterns (i.e. thoughts that many people will share) and others that are individualistic. It is clear that, in principle, mDES is capable of capturing empirical information on both stereotypical thoughts and idiosyncratic thoughts. For example, clear differences in experiences across films and, in particular, during specific periods within a film, show that movie-watching can evoke broadly similar thought patterns in different groups of participants (see Figure 3 right-hand panel). On the other hand, the association between comprehension and the different mDES components indicate that certain individuals respond to the same film clip in different ways and that these differences are rooted in objective information (i.e. their memory of an event in a film clip). A clear example of these more idiosyncratic features of movie watching experience can be seen in the association between “Episodic Knowledge” and comprehension. We found that “Episodic Knowledge” was generally high in the romance clip from 500 Days of Summer but was especially high for individuals who performed the best, indicating they remembered the most information. Thus good comprehends responded to the 500 Days of Summer clip with responses that had more evidence of “Episodic Knowledge” In the future, since the mDES approach can account for both stereotyped and idiosyncratic features of experience, it will be an important tool in understanding the common and distinct features that movie watching experiences can have, especially given the cost effective manner with which these studies can be run.   

      ●      How does this approach differ from collaborative filtering, (for example as presented in Chang et al., 2021)?

      Our study is very similar to the notion of collaborative filtering since we can use an approach that is similar to crowd-sourcing as a tool for understanding brain activity. One of its strengths is its generalizability since it is also a method that can be used to understand cognition because it is not limited to movie-watching. We can use the same mDES method to sample cognition in multiple situations in daily life ([6, 19]), while performing tasks in the behavioural lab [18, 24], and while brain activity is being acquired [8, 25, 26]. In principle, therefore, we can use mDES to understand cognition in different contexts in a common analytic space (see [27] for an example of how this could work)

      Page 5 [106-110]: In our study, we acquired experiential data in one group of participants while watching a movie clip and used these data to understand brain activity recorded in a second set of participants who watched the same clip and for whom no experiential data was recorded. This approach is similar to what is known as “collaborative filtering” [28].

      ●      In conclusion, this study tackles a highly interesting subject and does it creatively and expertly. It fails to discuss and establish the utility and appropriateness of its proposed method.

      Thank you very much for your feedback and critique. In our revision and our responses to these questions, we provided more information about the method's robustness utility and application to understanding cognition.

      References

      (1) Gordon, E.M., et al., Precision Functional Mapping of Individual Human Brains. Neuron, 2017. 95(4): p. 791-807.e7.

      (2) Smallwood, J., et al., The neural correlates of ongoing conscious thought. Iscience, 2021. 24(3).

      (3) Konu, D., et al., Exploring patterns of ongoing thought under naturalistic and conventional task-based conditions. Consciousness and Cognition, 2021. 93.

      (4) Smallwood, J., et al., The default mode network in cognition: a topographical perspective. Nature Reviews Neuroscience, 2021. 22(8): p. 503-513.

      (5) Turnbull, A., et al., Age-related changes in ongoing thought relate to external context and individual cognition. Consciousness and Cognition, 2021. 96: p. 103226.

      (6) McKeown, B., et al., The impact of social isolation and changes in work patterns on ongoing thought during the first COVID-19 lockdown in the United Kingdom. Proceedings of the National Academy of Sciences, 2021. 118(40): p. e2102565118.

      (7) Mulholland, B., et al., Patterns of ongoing thought in the real world. Consciousness and Cognition, 2023. 114: p. 103530.

      (8) Konu, D., et al., A role for the ventromedial prefrontal cortex in self-generated episodic social cognition. NeuroImage, 2020. 218: p. 116977.

      (9) Turnbull, A., et al., Left dorsolateral prefrontal cortex supports context-dependent prioritisation of off-task thought. Nature Communications, 2019. 10.

      (10) Ho, N.S.P., et al., Facing up to the wandering mind: Patterns of off-task laboratory thought are associated with stronger neural recruitment of right fusiform cortex while processing facial stimuli. NeuroImage, 2020. 214: p. 116765.

      (11) Karapanagiotidis, T., et al., Tracking thoughts: Exploring the neural architecture of mental time travel during mind-wandering. NeuroImage, 2017. 147: p. 272-281.

      (12) McKeown, B., et al., Experience sampling reveals the role that covert goal states play in task-relevant behavior. Scientific Reports, 2023. 13(1): p. 21710.

      (13) Vatansever, D., et al., Distinct patterns of thought mediate the link between brain functional connectomes and well-being. Network Neuroscience, 2020. 4(3): p. 637-657.

      (14) Wang, H.-T., et al., Dimensions of Experience: Exploring the Heterogeneity of the Wandering Mind. Psychological Science, 2017. 29(1): p. 56-71.

      (15) Aliko, S., et al., A naturalistic neuroimaging database for understanding the brain using ecological stimuli. Scientific Data, 2020. 7(1).

      (16) Yang, E., et al., The default network dominates neural responses to evolving movie stories. Nature Communications, 2023. 14(1): p. 4197.

      (17) Turnbull, A., et al., Reductions in task positive neural systems occur with the passage of time and are associated with changes in ongoing thought. Scientific Reports, 2020. 10(1): p. 9912.

      (18) Konu, D., et al., Exploring patterns of ongoing thought under naturalistic and conventional task-based conditions. Consciousness and cognition, 2021. 93: p. 103139.

      (19) Mulholland, B., et al., Patterns of ongoing thought in the real world. Consciousness and cognition, 2023. 114: p. 103530.

      (20) Christoff, K., et al., Experience sampling during fMRI reveals default network and executive system contributions to mind wandering. Proc Natl Acad Sci U S A, 2009. 106(21): p. 8719-24.

      (21) Zhang, M., et al., Perceptual coupling and decoupling of the default mode network during mind-wandering and reading. eLife, 2022. 11: p. e74011.

      (22) Zhang, M.C., et al., Distinct individual differences in default mode network connectivity relate to off-task thought and text memory during reading. Scientific Reports, 2019. 9.

      (23) Smallwood, J. and J.W. Schooler, The science of mind wandering: Empirically navigating the stream of consciousness. Annual review of psychology, 2015. 66(1): p. 487-518.

      (24) Turnbull, A., et al., The ebb and flow of attention: Between-subject variation in intrinsic connectivity and cognition associated with the dynamics of ongoing experience. Neuroimage, 2019. 185: p. 286-299.

      (25) Turnbull, A., et al., Left dorsolateral prefrontal cortex supports context-dependent prioritisation of off-task thought. Nature communications, 2019. 10(1): p. 3816.

      (26) Mckeown, B., et al., Experience sampling reveals the role that covert goal states play in task-relevant behavior. Scientific reports, 2023. 13(1): p. 21710.

      (27) Chitiz, L., et al., Mapping cognition across lab and daily life using experience-sampling. 2023.

      (28) Chang, L.J., et al., Endogenous variation in ventromedial prefrontal cortex state dynamics during naturalistic viewing reflects affective experience. Science Advances, 2021. 7(17): p. eabf7129.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      This is an interesting study investigating the mechanisms underlying membrane targeting of the NLRP3 inflammasome and reporting a key role for the palmitoylation-depalmitoylation cycle of cys130 in NRLP3. The authors identify ZDHHC3 and APT2 as the specific ZDHHC and APT/ABHD enzymes that are responsible for the s-acylation and de-acylation of NLRP3, respectively. They show that the levels of ZDHHC3 and APT2, both localized at the Golgi, control the level of palmitoylation of NLRP3. The S-acylation-mediated membrane targeting of NLRP3 cooperates with polybasic domain (PBD)-mediated PI4P-binding to target NLRP3 to the TGN under steady-state conditions and to the disassembled TGN induced by the NLRP3 activator nigericin.

      However, the study has several weaknesses in its current form as outlined below.

      (1) The novelty of the findings concerning cys130 palmitoylation in NLRP3 is unfortunately compromised by recent reports on the acylation of different cysteines in NLRP3 (PMID: 38092000), including palmitoylation of the very same cys130 in NLRP3 (Yu et al https://doi.org/10.1101/2023.11.07.566005), which was shown to be relevant for NLRP3 activation in cell and animal models. What remains novel and intriguing is the finding that NLRP3 activators induce an imbalance in the acylation-deacylation cycle by segregating NLRP3 in late Golgi/endosomes from de-acylating enzymes confined in the Golgi. The interesting hypothesis put forward by the authors is that the increased palmitoylation of cys130 would finally contribute to the activation of NLRP3. However, the authors should clarify the trafficking pathway of acylated-NLRP3. This pathway should, in principle, coincide with that of TGN46 which constitutively recycles from the TGN to the plasma membrane and is trapped in endosomes upon treatment with nigericin. 

      We think the data presented in our manuscript are consistent with the majority of S-acylated NLRP3 remaining on the Golgi via S-acylation in both untreated and nigericin treated cells. We have performed an experiment with BrefeldinA (BFA), a fungal metabolite that disassembles the Golgi without causing dissolution of early endosomes, that further supports the conclusion that NLRP3 predominantly resides on Golgi membranes pre and post activation. Treatment of cells with BFA prevents recruitment of NLRP3 to the Golgi in untreated cells and blocks the accumulation of NLRP3 on the structures seen in the perinuclear area after nigericin treatment (see new Supplementary Figure 4A-D). We do see some overlap of NLRP3 signal with TGN46 in the perinuclear area after nigericin treatment (see new Supplementary Figure 2E), however this likely represents TGN46 at the Golgi rather than endosomes given that the NLRP3 signal in this area is BFA sensitive.  As with 2-BP and GFP-NLRP3C130S, GFP-NLRP3 spots also form in BFA / nigericin co-treated cells but not with untagged NLRP3. These spots also do not show any co-localisation with EEA1, suggesting that under these conditions, endosomes don’t appear to represent a secondary site of NLRP3 recruitment in the absence of an intact Golgi. However, we cannot completely rule out that some NLRP3 may recruited to endosomes at some point during its activation.

      (2) To affect the S-acylation, the authors used 16 hrs treatment with 2-bromopalmitate (2BP). In Figure 1f, it is quite clear that NLRP3 in 2-BP treated cells completely redistributed in spots dispersed throughout the cells upon nigericin treatment. What is the Golgi like in those cells? In other words, does 2-BP alter/affect Golgi morphology? What about PI4P levels after 2-BP treatment? These are important missing pieces of data since both the localization of many proteins and the activity of one key PI4K in the Golgi (i.e. PI4KIIalpha) are regulated by palmitoylation.

      We thank the reviewer for highlighting this point and agree that it is possible the observed loss of NLRP3 from the Golgi might be due to an adverse effect of 2-BP on Golgi morphology or PI4P levels. We have tested the effect of 2-BP on the Golgi markers GM130, p230 and TGN46. 2BP has marginal effects on Golgi morphology with cis, trans and TGN markers all present at similar levels to untreated control cells (Supplementary Figure 2B-D). We also tested the effect of 2-BP on PI4P levels using mCherry-P4M, a PI4P biosensor. Surprisingly, as noted by the reviewer, despite recruitment of PI4K2A being dependent on S-acylation, PI4P was still present on the Golgi after 2-BP treatment, suggesting that a reduction in Golgi PI4P levels does not underly loss of NLRP3 from the Golgi (Supplementary Figure 2A). The pool of PI4P still present on the Golgi following 2-BP treatment is likely generated by other PI4K enzymes that localise to the Golgi independently of S-acylation, such as PI4KIIIB. We have included this data in our manuscript as part of a new Supplementary Figure 2. 

      (3) The authors argue that the spots observed with NLRP-GFP result from non-specific effects mediated by the addition of the GFP tag to the NLRP3 protein. However, puncta are visible upon nigericin treatment, as a hallmark of endosomal activation. How do the authors reconcile these data? Along the same lines, the NLRP3-C130S mutant behaves similarly to wt NLRP3 upon 2-BP treatment (Figure 1h). Are those NLRP3-C130S puncta positive for endosomal markers? Are they still positive for TGN46? Are they positive for PI4P?

      This is a fair point given the literature showing overlap of NLRP3 puncta formed in response to nigericin with endosomal markers and the similarity of the structures we see in terms of size and distribution to endosomes after 2BP + nigericin treatment. We have tested whether these puncta overlap with EEA1, TGN46 or PI4P (Supplementary Figure 2A, E-G). The vast majority of spots formed by GFP-NLRP3 co-treated with 2-BP and nigericin do not co-localise with EEA1, TGN46 or PI4P. This is consistent with these spots potentially being an artifact, although it has recently been shown that human NLRP3 unable to bind to the Golgi can still respond to nigericin (Mateo-Tórtola et al., 2023). These puncta might represent a conformational change cytosolic NLRP3 undergoes in response to stimulation, although our results suggest that this doesn’t appear to happen on endosomes.

      (4) The authors expressed the minimal NLRP3 region to identify the domain required for NLRP3 Golgi localization. These experiments were performed in control cells. It might be informative to perform the same experiments upon nigericin treatment to investigate the ability of NLRP3 to recognize activating signals. It has been reported that PI4P increases on Golgi and endosomes upon NG treatment. Hence, all the differences between the domains may be lost or preserved. In parallel, also the timing of such recruitment upon nigericin treatment (early or late event) may be informative for the dynamics of the process and of the contribution of the single protein domains.

      This is an interesting point which we thank the reviewer for highlighting. However, we think that each domain on its own is not capable of responding to nigericin as shown by the effect of mutations in helix115-125 or the PB region in the full-length NLRP3 protein. NLRP3HF, which still contains a functional PB region, isn’t capable of responding to nigericin in the same way as wild type NLRP3 (Supplementary Figure 6C-D). Similarly, mutations in the PB region of full length NLRP3 that leave helix115-125 intact show that helix115-125 is not sufficient to allow enhanced recruitment of NLRP3 to Golgi membranes after nigericin treatment (Supplementary Figure 9A). We speculate that helix115-125, the PB region and the LRR domain all need to be present to provide maximum affinity of NLRP3 for the Golgi prior to encounter with and S-acylation by ZDHHC3/7. Mutation or loss of any one of the PB region, helix115-125 or the LRR lowers NLRP3 membrane affinity, which is reflected by reduced levels of NLRP3 captured on the Golgi by S-acylation at steady state and in response to nigericin. 

      (5) As noted above for the chemical inhibitors (1) the authors should check the impact of altering the balance between acyl transferase and de-acylases on the Golgi organization and PI4P levels. What is the effect of overexpressing PATs on Golgi functions?

      We have checked the effect of APT2 overexpression on Golgi morphology and can show that it has no noticeable effect, ruling out an impact of APT on Golgi integrity as the reason for loss of NLRP3 from the Golgi in the presence of overexpressed APT2. We have included these images as Supplementary Figure 11H-J. 

      It is plausible that the effects of ZDHHC3 or ZDHHC7 on enhanced recruitment of NLRP3 to the Golgi may be via an effect on PI4P levels since, as mentioned above, both enzymes are involved in recruitment of PI4K2A to the Golgi and have previously been shown to enhance levels of PI4K2A and PI4P on the Golgi when overexpressed (Kutchukian et al., 2021). However, NLRP3 mutants with most of the charge removed from the PB region, which are presumably unable to interact with PI4P or other negatively charged lipids, are still capable of being recruited to the Golgi by excess ZDHHC3. This would suggest that the effect of overexpressed ZDHHC3 on NLRP3 is largely independent of changes in PI4P levels on the Golgi and instead driven by helix115-125 and S-acylation at Cys-130. The latter point is supported by the observation that NLRP3HF and NLRP3Cys130 are insensitive to ZDHHC3 overexpression.

      At the levels of HA-ZDHHC3 used in our experiments with NLRP3 (200ng pEF-Bos-HAZDHHC3 / c.a. 180,000 cells) we don’t see any adverse effect on Golgi morphology (Author response image 1), although it has been noted previously by others that higher levels of ZDHHC3 can have an impact on TGN46 (Ernst et al., 2018). ZDHHC3 overexpression surprisingly has no adverse effects on Golgi function and in fact enhances secretion from the Golgi (Ernst et al., 2018).  

      Author response image 1.

      Overexpression of HA-ZDHHC3 does not impact Golgi morphology. A) Representative confocal micrographs of HeLaM cells transfected with 200 ng HA-ZDHHC3 fixed and stained with antibodies to STX5 or TGN46. Scale bars = 10 µm. 

      Reviewer #2 (Public Review):

      Summary:

      This paper examines the recruitment of the inflammasome seeding pattern recognition receptor NLRP3 to the Golgi. Previously, electrostatic interactions between the polybasic region of NLRP3 and negatively charged lipids were implicated in membrane association. The current study reports that reversible S-acylation of the conserved Cys-130 residue, in conjunction with upstream hydrophobic residues plus the polybasic region, act together to promote Golgi localization of NLRP3, although additional parts of the protein are needed for full Golgi localization. Treatment with the bacterial ionophore nigericin inhibits membrane traffic and prevents Golgi-associated thioesterases from removing the acyl chain, causing NLRP3 to become immobilized at the Golgi. This mechanism is put forth as an explanation for how NLRP3 is activated in response to nigericin.

      Strengths:

      The experiments are generally well presented. It seems likely that Cys-130 does indeed play a previously unappreciated role in the membrane association of NLRP3.

      Weaknesses:

      The interpretations about the effects of nigericin are less convincing. Specific comments follow.

      (1) The experiments of Figure 4 bring into question whether Cys-130 is S-acylated. For Cys130, S-acylation was seen only upon expression of a severely truncated piece of the protein in conjunction with overexpression of ZDHHC3. How do the authors reconcile this result with the rest of the story?

      Providing direct evidence of S-acylation at Cys-130 in the full-length protein proved difficult. We attempted to detect S-acylation of this residue by mass spectrometry. However, the presence of the PB region and multiple lysines / arginines directly after Cys-130 made this approach technically challenging and we were unable to convincingly detect S-acylation at Cys-130 by M/S. However, Cys-130 is clearly important for membrane recruitment as its mutation abolishes the localisation of NLRP3 to the Golgi. It is feasible that it is the hydrophobic nature of the cysteine residue itself which supports localisation to the Golgi, rather than S-acylation of Cys-130. A similar role for cysteine residues present in SNAP-25 has been reported (Greaves et al., 2009). However, the rest of our data are consistent with Cys-130 in NLRP3 being S-acylated. We also refer to another recently published study which provides additional biochemical evidence that mutation of Cys-130 impacts the overall levels of NLRP3 S-acylation (Yu et al., 2024). 

      (2) Nigericin seems to cause fragmentation and vesiculation of the Golgi. That effect complicates the interpretations. For example, the FRAP experiment of Figure 5 is problematic because the authors neglected to show that the FRAP recovery kinetics of nonacylated resident Golgi proteins are unaffected by nigericin. Similarly, the colocalization analysis in Figure 6 is less than persuasive when considering that nigericin significantly alters Golgi structure and could indirectly affect colocalization. 

      We agree that it is likely that the behaviour of other Golgi resident proteins are altered by nigericin. This is in line with a recent proteomics study showing that nigericin alters the amount of Golgi resident proteins associated with the Golgi (Hollingsworth et al., 2024) and other work demonstrating that changes in organelle pH can influence the membrane on / off rates of Rab GTPases (Maxson et al., 2023). However, Golgi levels of other peripheral membrane proteins

      that associate with the Golgi through S-acylation, such as N-Ras, appear unaltered (Author response image 2.), indicating a degree of selectivity in the proteins affected. Our main point here is that NLRP3 is amongst those proteins whose behaviour on the Golgi is sensitive to nigericin and that this change in behaviour may be important to the NLRP3 activation process, although this requires further investigation and will form the basis of future studies. 

      The reduction in co-localisation between NLRP3 and APT2, due to alterations in Golgi organisation and trafficking, was the point we were trying to make with this figure, and we apologise if this was not clear. We think that the changes in Golgi structure and function caused by nigericin potentially affect the ability of APT2 to encounter NLRP3 and de-acylate it. We have added a new paragraph to the results section to hopefully explain this more clearly. We recognise that our results supporting this hypothesis are at present limited and we have toned down the language used in the results section to reflect the nature of these findings..  

      Author response image 2.

      S-acylated peripheral membrane proteins show differential sensitivity to nigericin. A) Representative confocal micrographs of HeLaM cells coexpressing GFP-NRas and an untagged NLRP3 construct. Cells were left untreated or treated with 10 µM nigericin for 1 hour prior to fixation. Scale bars = 10 µm. B) Quantification of GFP-NRas or NLRP3 signal in the perinuclear region of cells treated with or without nigericin

      Recommendations for the authors:

      Reviewer #2 (Recommendations For The Authors):

      (1) Does overnight 2-BP treatment potentially have indirect effects that could prevent NLRP3 recruitment? It would be useful here to show some sort of control confirming that the cells are not broadly perturbed.

      Please see our response to point (2) raised by reviewer #1 which is along similar lines. 

      (2) In Figure 5, "Veh" presumably is short for "Vehicle". This term should be defined in the legend.

      We have now corrected this.

      References

      Ernst, A.M., S.A. Syed, O. Zaki, F. Bottanelli, H. Zheng, M. Hacke, Z. Xi, F. Rivera-Molina, M. Graham, A.A. Rebane, P. Bjorkholm, D. Baddeley, D. Toomre, F. Pincet, and J.E. Rothman. 2018. SPalmitoylation Sorts Membrane Cargo for Anterograde Transport in the Golgi. Dev Cell. 47:479-493 e477.

      Greaves, J., G.R. Prescott, Y. Fukata, M. Fukata, C. Salaun, and L.H. Chamberlain. 2009. The hydrophobic cysteine-rich domain of SNAP25 couples with downstream residues to mediate membrane interactions and recognition by DHHC palmitoyl transferases. Mol Biol Cell. 20:1845-1854.

      Hollingsworth, L.R., P. Veeraraghavan, J.A. Paulo, J.W. Harper, and I. Rauch. 2024. Spatiotemporal proteomic profiling of cellular responses to NLRP3 agonists. bioRxiv.

      Kutchukian, C., O. Vivas, M. Casas, J.G. Jones, S.A. Tiscione, S. Simo, D.S. Ory, R.E. Dixon, and E.J. Dickson. 2021. NPC1 regulates the distribution of phosphatidylinositol 4-kinases at Golgi and lysosomal membranes. EMBO J. 40:e105990.

      Mateo-Tórtola, M., I.V. Hochheiser, J. Grga, J.S. Mueller, M. Geyer, A.N.R. Weber, and A. TapiaAbellán. 2023. Non-decameric NLRP3 forms an MTOC-independent inflammasome. bioRxiv:2023.2007.2007.548075.

      Maxson, M.E., K.K. Huynh, and S. Grinstein. 2023. Endocytosis is regulated through the pHdependent phosphorylation of Rab GTPases by Parkinson’s kinase LRRK2. bioRxiv:2023.2002.2015.528749.

      Yu, T., D. Hou, J. Zhao, X. Lu, W.K. Greentree, Q. Zhao, M. Yang, D.G. Conde, M.E. Linder, and H. Lin. 2024. NLRP3 Cys126 palmitoylation by ZDHHC7 promotes inflammasome activation. Cell Rep. 43:114070.

    1. Author response:

      The following is the authors’ response to the original reviews.

      We are grateful for these balanced, nuanced evaluations of our work concerning the observed epistatic trends and our interpretations of their mechanistic origins. Overall, we think the reviewers have done an excellent job at recognizing the novel aspects of our findings while also discussing the caveats associated with our interpretations of the biophysical effects of these mutations. We believe it is important to consider both of these aspects of our work in order to appreciate these advances and what sorts of pertinent questions remain.

      Notably, both reviewers are concerned that our lack of experimental approaches to compare the conformational properties of GnRHR variants weakens our claims. We would first humbly suggest that this constitutes a more general caveat that applies to nearly all investigations of the cellular misfolding of α-helical membrane proteins. Whether or not any current in vitro folding measurements report on conformational transitions that are relevant to cellular protein misfolding reactions remains an active area of debate (discussed further below). Nevertheless, while we concede that our structural and/ or computational evaluations of various mutagenic effects remain speculative, prevailing knowledge on the mechanisms of membrane protein folding suggest our mutations of interest (V276T and W107A) are highly unlikely to promote misfolding in precisely the same way. Thus, regardless of whether or not we were able experimentally compare the relevant folding energetics of GnRHR variants, we are confident that the distinct epistatic interactions formed by these mutations reflect variations in the misfolding mechanism and that they are distinct from the interactions that are observed in the context of stable proteins. In the following, we provide detailed considerations concerning these caveats in relation to the reviewers’ specific comments.

      Reviewer #1 (Public Review):

      The paper carries out an impressive and exhaustive non-sense mutagenesis using deep mutational scanning (DMS) of the gonadotropin-releasing hormone receptor for the WT protein and two single point mutations that I) influence TM insertion (V267T) and ii) influence protein stability (W107A), and then measures the effect of these mutants on correct plasma membrane expression (PME).

      Overall, most mutations decreased mGnRHR PME levels in all three backgrounds, indicating poor mutational tolerance under these conditions. The W107A variant wasn't really recoverable with low levels of plasma membrane localisation. For the V267T variant, most additional mutations were more deleterious than WT based on correct trafficking, indicating a synergistic effect. As one might expect, there was a higher degree of positive correlation between V267T/W107A mutants and other mutants located in TM regions, confirming that improper trafficking was a likely consequence of membrane protein co-translational folding. Nevertheless, context is important, as positive synergistic mutants in the V27T could be negative in the W107A background and vice versa. Taken together, this important study highlights the complexity of membrane protein folding in dissecting the mechanism-dependent impact of disease-causing mutations related to improper trafficking.

      Strengths

      This is a novel and exhaustive approach to dissecting how receptor mutations under different mutational backgrounds related to co-translational folding, could influence membrane protein trafficking.

      Weaknesses

      The premise for the study requires an in-depth understanding of how the single-point mutations analysed affect membrane protein folding, but the single-point mutants used seem to lack proper validation.

      Given our limited understanding of the structural properties of misfolded membrane proteins, it is unclear whether the relevant conformational effects of these mutations can be unambiguously validated using current biochemical and/ or biophysical folding assays. X-ray crystallography, cryo-EM, and NMR spectroscopy measurements have demonstrated that many purified GPCRs retain native-like structural ensembles within certain detergent micelles, bicelles, and/ or nanodiscs. However, helical membrane protein folding measurements typically require titration with denaturing detergents to promote the formation of a denatured state ensemble (DSE), which will invariably retain considerable secondary structure. Given that the solvation provided by mixed micelles is clearly distinct from that of native membranes, it remains unclear whether these DSEs represent a reasonable proxy for the misfolded conformations recognized by cellular quality control (QC, see https://doi.org/10.1021/acs.chemrev.8b00532). Thus, the use and interpretation of these systems for such purposes remains contentious in the membrane protein folding community. In addition to this theoretical issue, we are unaware of any instances in which GPCRs have been found to undergo reversible denaturation in vitro- a practical requirement for equilibrium folding measurements (https://doi.org/10.1146/annurev-biophys-051013-022926). We note that, while the resistance of GPCRs to aggregation, proteolysis, and/ or mechanical unfolding have also been probed in micelles, it is again unclear whether the associated thermal, kinetic, and/ or mechanical stability should necessarily correspond to their resistance to cotranslational and/ or posttranslational misfolding. Thus, even if we had attempted to validate the computational folding predictions employed herein, we suspect that any resulting correlations with cellular expression may have justifiably been viewed by many as circumstantial. Simply put, we know very little about the non-native conformations are generally involved in the cellular misfolding of α-helical membrane proteins, much less how to measure their relative abundance. From a philosophical standpoint, we prefer to let cells tell us what sorts of broken protein variants are degraded by their QC systems, then do our best to surmise what this tells us about the relevant properties of cellular DSEs.

      Despite this fundamental caveat, we believe that the chosen mutations and our interpretation of their relevant conformational effects are reasonably well-informed by current modeling tools and by prevailing knowledge on the physicochemical drivers of membrane protein folding and misfolding. Specifically, the mechanistic constraints of translocon-mediated membrane integration provide an understanding of the types of mutations that are likely to disrupt cotranslational folding. Though we are still learning about the protein complexes that mediate membrane translocation (https://doi.org/10.1038/s41586-022-05336-2), it is known that this underlying process is fundamentally driven by the membrane depth-dependent amino acid transfer free energies (https://doi.org/10.1146/annurev.biophys.37.032807.125904). This energetic consideration suggests introducing polar side chains near the center of a nascent TMDs should almost invariably reduce the efficiency of topogenesis. To confirm this in the context of TMD6 specifically, we utilized a well-established biochemical reporter system to confirm that V276T attenuates its translocon-mediated membrane integration (Fig. S1)- at least in the context of a chimeric protein. We also constructed a glycosylation-based topology reporter for full-length GnRHR, but ultimately found its’ in vitro expression to be insufficient to detect changes in the nascent topological ensemble.

      In contrast to V276T, the W107A mutation is predicted to preserve the native topological energetics of GnRHR due to its position within a soluble loop region. W107A is also unlike V276T in that it clearly disrupts tertiary interactions that stabilize the native structure. This mutation should preclude the formation of a structurally conserved hydrogen bonding network that has been observed in the context of at least 25 native GPCR structures (https://doi.org/10.7554/eLife.5489). However, without a relevant folding assay, the extent to which this network stabilizes the native GnRHR fold in cellular membranes remains unclear. Overall, we admit that these limitations have prevented us from measuring how much V276T alters the efficiency of GnRHR topogenesis, how much the W107A destabilizes the native fold, or vice versa. Nevertheless, given these design principles and the fact that both reduce the plasma membrane expression of GnRHR, as expected, we are highly confident that the structural defects generated by these mutations do, in fact, promote misfolding in their own ways. We also concede that the degree to which these mutagenic perturbations are indeed selective for specific folding processes is somewhat uncertain. However, it seems exceedingly unlikely that these mutations should disrupt topogenesis and/ or the folding of the native topomer to the exact same extent. From our perspective, this is the most important consideration with respect to the validity of the conclusions we have made in this manuscript.

      Furthermore, plasma membrane expression has been used as a proxy for incorrect membrane protein folding, but this not necessarily be the case, as even correctly folded membrane proteins may not be trafficked correctly, at least, under heterologous expression conditions. In addition, mutations can affect trafficking and potential post-translational modifications, like glycosylation.

      While the reviewer is correct that the sorting of folded proteins within the secretory pathway is generally inefficient, it is also true that the maturation of nascent proteins within the ER generally bottlenecks the plasma membrane expression of most α-helical membrane proteins. Our group and several others have demonstrated that the efficiency of ER export generally appears to scale with the propensity of membrane proteins to achieve their correct topology and/ or to achieve their native fold (see https://doi.org/10.1021/jacs.5b03743 and https://doi.org/10.1021/jacs.8b08243). Notably, these investigations all involved proteins that contain native glycosylation and various other post-translational modification sites. While we cannot rule out that certain specific combinations of mutations may alter expression through their perturbation of post-translational GnRHR modifications, we feel confident that the general trends we have observed across hundreds of variants predominantly reflect changes in folding and cellular QC. This interpretation is supported by the relationship between observed trends in variant expression and Rosetta-based stability calculations, which we identified using unbiased unsupervised machine learning approaches (compare Figs. 6B & 6D).

      Reviewer #2 (Public Review):

      Summary:

      In this paper, Chamness and colleagues make a pioneering effort to map epistatic interactions among mutations in a membrane protein. They introduce thousands of mutations to the mouse GnRH Receptor (GnRHR), either under wild-type background or two mutant backgrounds, representing mutations that destabilize GnRHR by distinct mechanisms. The first mutant background is W107A, destabilizing the tertiary fold, and the second, V276T, perturbing the efficiency of cotranslational insertion of TM6 to the membrane, which is essential for proper folding. They then measure the surface expression of these three mutant libraries, using it as a proxy for protein stability, since misfolded proteins do not typically make it to the plasma membrane. The resulting dataset is then used to shed light on how diverse mutations interact epistatically with the two genetic background mutations. Their main conclusion is that epistatic interactions vary depending on the degree of destabilization and the mechanism through which they perturb the protein. The mutation V276T forms primarily negative (aggravating) epistatic interactions with many mutations, as is common to destabilizing mutations in soluble proteins. Surprisingly, W107A forms many positive (alleviating) epistatic interactions with other mutations. They further show that the locations of secondary mutations correlate with the types of epistatic interactions they form with the above two mutants.

      Strengths:

      Such a high throughput study for epistasis in membrane proteins is pioneering, and the results are indeed illuminating. Examples of interesting findings are that: (1) No single mutation can dramatically rescue the destabilization introduced by W107A. (2) Epistasis with a secondary mutation is strongly influenced by the degree of destabilization introduced by the primary mutation. (3) Misfolding caused by mis-insertion tends to be aggravated by further mutations. The discussion of how protein folding energetics affects epistasis (Fig. 7) makes a lot of sense and lays out an interesting biophysical framework for the findings.

      Weaknesses:

      The major weakness comes from the potential limitations in the measurements of surface expression of severely misfolded mutants. This point is discussed quite fairly in the paper, in statements like "the W107A variant already exhibits marginal surface immunostaining" and many others. It seems that only about 5% of the W107A makes it to the plasma membrane compared to wild-type (Figures 2 and 3). This might be a low starting point from which to accurately measure the effects of secondary mutations.

      The reviewer raises an excellent point that we considered at length during the analysis of these data and the preparation of the manuscript. Though we remain confident in the integrity of these measurements and the corresponding analyses, we now realize this aspect of the data required further discussion and documentation which we have provided in the revised version of the manuscript as is described in the following.

      Still, the authors claim that measurements of W107A double mutants "still contain cellular subpopulations with surface immunostaining intensities that are well above or below that of the W107A single mutant, which suggests that this fluorescence signal is sensitive enough to detect subtle differences in the PME of these variants". I was not entirely convinced that this was true.

      We made this statement based on the simple observation that the surface immunostaining intensities across the population of recombinant cells expressing the library of W107A double mutants was consistently broader than that of recombinant cells expressing W107A GnRHR alone (see Author response image 1 for reference). Given that the recombinant cellular library represents a mix of cells expressing ~1600 individual variants that are each present at low abundance, the pronounced tails within this distribution presumably represent the composite staining of many small cellular subpopulations that express collections of variants that deviate from the expression of W107A to an extent that is significant enough to be visible on a log intensity plot.

      Author response image 1.

      Firstly, I think it would be important to test how much noise these measurements have and how much surface immunostaining the W107A mutant displays above the background of cells that do not express the protein at all.

      For reference, the average surface immunostaining intensity of HEK293T cells transiently expressing W107A GnRHR was 2.2-fold higher than that of the IRES-eGFP negative, untransfected cells within the same sample- the WT immunostaining intensity was 9.5-fold over background by comparison. Similarly, recombinant HEK293T cells expressing the W107A double mutant library had an average surface immunostaining intensity that was 2.6-fold over background across the two DMS trials. Thus, while the surface immunostaining of this variant is certainly diminished, we were still able to reliably detect W107A at the plasma membrane even under distinct expression regimes. We have included these and other signal-to-noise metrics for each experiment in the Results section of the revised manuscript.

      Beyond considerations related to intensity, we also previously noticed the relative intensity values for W107A double mutants exhibited considerable precision across our two biological replicates. If signal were too poor to detect changes in variant expression, we would have expected a plot of the intensity values across these two replicates to form a scatter. Instead, we found DMS intensity values for individual variants to be highly correlated from one replicate to the next (Pearson’s R2 = 0.95, see Author response image 2 for reference). This observation empirically demonstrates that this assay consistently differentiated between variants that exhibit slightly enhanced immunostaining from those that have even lower immunostaining than W107A GnRHR. We have included these discussion points in the Results section as well as scatter plots for replicate variant intensities within all three genetic backgrounds in Figure S3 of the revised manuscript.

      Author response image 2.

      But more importantly, it is not clear if under this regimen surface expression still reports on stability/protein fitness. It is unknown if the W107A retains any function or folding at all. For example, it is possible that the low amount of surface protein represents misfolded receptors that escaped the ER quality control.

      While we believe that such questions are outside the scope of this work, we certainly agree that it is entirely possible that some of these variants bypass QC without achieving their native fold. This topic is quite interesting to us but is quite challenging to assess in the context of GPCRs, which have complex fitness landscapes that involve their propensity to distinguish between different ligands, engage specific components associated with divergent downstream signaling pathways, and navigate between endocytic recycling/ degradation pathways following activation. In light of the inherent complexity of GPCR function, we humbly suggest our choice of a relatively simple property of an otherwise complex protein may be viewed as a virtue rather than a shortcoming. Protein fitness is typically cast as the product of abundance and activity. Rather than measuring an oversimplified, composite fitness metric, we focused on one variable (plasma membrane expression) and its dominant effector (folding). We believe restraining the scope in this manner was key for the elucidation of clear mechanistic insights.

      The differential clustering of epistatic mutations (Fig. 6) provides some interesting insights as to the rules that dictate epistasis, but these too are dominated by the magnitude of destabilization caused by one of the mutations. In this case, the secondary mutations that had the most interesting epistasis were exceedingly destabilizing. With this in mind, it is hard to interpret the results that emerge regarding the epistatic interactions of W107A. Furthermore, the most significant positive epistasis is observed when W107A is combined with additional mutations that almost completely abolish surface expression. It is likely that either mutation destabilizes the protein beyond repair. Therefore, what we can learn from the fact that such mutations have positive epistasis is not clear to me. Based on this, I am not sure that another mutation that disrupts the tertiary folding more mildly would not yield different results. With that said, I believe that the results regarding the epistasis of V276T with other mutations are strong and very interesting on their own.

      We agree with the reviewer. In light of our results we believe it is virtually certain that the secondary mutations characterized herein would be likely to form distinct epistatic interactions with mutations that are only mildly destabilizing. Indeed, this insight reflects one of the key takeaway messages from this work- stability-mediated epistasis is difficult to generalize because it should depend on the extent to which each mutation changes the stability (ΔΔG) as well as initial stability of the WT/ reference sequence (ΔG, see Figure 7). Frankly, we are not so sure we would have pieced this together as clearly had we not had the fortune (or misfortune?) of including such a destructive mutation like W107A as a point of reference.

      Additionally, the study draws general conclusions from the characterization of only two mutations, W107A and V276T. At this point, it is hard to know if other mutations that perturb insertion or tertiary folding would behave similarly. This should be emphasized in the text.

      We agree. Our findings suggest different mutations may not behave similarly, which we believe is a key finding of this work. We have emphasized this point in the Discussion section of the revised manuscript as follows:

      “These findings suggest the folding-mediated epistasis is likely to vary among different classes of destabilizing mutations in a manner that should also depend on folding efficiency and/ or the mechanism(s) of misfolding in the cell.”

      Some statistical aspects of the study could be improved:

      (1) It would be nice to see the level of reproducibility of the biological replicates in a plot, such as scatter or similar, with correlation values that give a sense of the noise level of the measurements. This should be done before filtering out the inconsistent data.

      We thank the reviewer for this suggestion and will include scatters for each genetic background like the one shown above in Figure S3 of the revised version of the manuscript.

      (2) The statements "Variants bearing mutations within the C- terminal region (ICL3-TMD6-ECL3-TMD7) fare consistently worse in the V276T background relative to WT (Fig. 4 B & E)." and "In contrast, mutations that are 210 better tolerated in the context of W107A mGnRHR are located 211 throughout the structure but are particularly abundant among residues 212 in the middle of the primary structure that form TMD4, ICL2, and ECL2 213 (Fig. 4 C & F)." are both hard to judge. Inspecting Figures 4B and C does not immediately show these trends, and importantly, a solid statistical test is missing here. In Figures 4E and F the locations of the different loops and TMs are not indicated on the structure, making these statements hard to judge.

      We apologize for this oversight and thank the reviewer for pointing this out. We utilized paired Wilcoxon-Signed Rank Tests to evaluate the statistical significance of these observations and modified the description of these findings in the revised version of the results section as follows:

      “Variants bearing mutations within the C-terminal regions including ICL3, TMD6, and TMD7 fare consistently worse in the V276T background relative to WT (paired Wilcoxon-Signed Rank Test p-values of 0.0001, 0.02, and 0.005, respectively) (Fig. 4 B & E). Given that V276T perturbs the cotranslational membrane integration of TMD6 (Fig. S1, Table S1), this directional bias potentially suggests that the apparent interactions between these mutations manifest during the late stages of cotranslational folding. In contrast, mutations that are better tolerated in the context of W107A mGnRHR are located throughout the structure but are particularly abundant among residues in the middle of the primary structure that form ICL2, TMD4, and ECL2 (paired Wilcoxon-Signed Rank Test p-values of 0.0005, 0.0001, and 0.004, respectively) (Fig. 4 C & F).”

      (3) The following statement lacks a statistical test: "Notably, these 98 variants are enriched with TMD variants (65% TMD) relative to the overall set of 251 variants (45% TMD)." Is this enrichment significant? Further in the same paragraph, the claim that "In contrast to the sparse epistasis that is generally observed between mutations within soluble proteins, these findings suggest a relatively large proportion of random mutations form epistatic interactions in the context of unstable mGnRHR variants". Needs to be backed by relevant data and statistics, or at least a reference.

      We thank the reviewer for this reasonable suggestion. In the revised manuscript, we included the results of a paired Wilcoxon-Signed Rank Test that confirms the statistical significance of this observation and modified the Results section to reflect this as follows:

      “Notably, these 98 variants are enriched with TMD variants (65% TMD) relative to the overall set of 251 variants (45% TMD, Fisher’s Exact Test p = 0.0019). These findings suggest random mutations form epistatic interactions in the context of unstable mGnRHR variants in a manner that depends on the specific folding defect (V276T vs. W107A) and topological context.”

      Reviewer #1 (Recommendations for the Authors):

      As far as this reviewer is aware, the effect of the V267T variant on MP insertion has not been measured directly; its position corresponds to T277 in TMD6 of human GnRHR that has been measured for TM insertion, but given the clear lack of conservation (threonine vs valine) the mutation in TM6 could potentially have a different impact on the mouse homologue. Please clarify what the predicted delta TM for insertion is between human and mouse GnRHR is? Moreover, I would argue that single TM insertion by tethering to Lep is insufficient to understand MP insertion/folding, as neighbouring TM helices could help to drive TM6 insertion. Has ER microsome experiments for mouse GnRHR also been carried out in the context of neighbouring helices?

      We included measurements (and predictions) of the impact of the V276T substitution on the translocon-mediated membrane integration of the mouse TMD6 in the context of a chimeric Lep protein (see Fig. S1 & Table S1). Our results reveal that this substitution decreases the efficiency of TMD6 membrane integration by ~10%. Though imperfect, this prevailing biochemical assay remains popular for a variety of theoretical and technical reasons. Importantly, extensive experimental testing of this system has shown that these measurements report apparent equilibrium constants that are well-described by two-state equilibrium partitioning models (see DOIs 10.1038/nature03216 and 10.1038/nature06387). This observation provides a reasonable rationale to interpret these measurements using energetic models as we have in this work (see Table S1). From a technical perspective, the Lep system is also advantageous due to the fact that this protein is generally well expressed in the context of in vitro translation systems containing native membranes, which generally ensures a consistent signal to noise and dynamic range for membrane integration measurements. Nevertheless, the reviewers are correct that membrane integration efficiencies are likely distinct in the context of the native mGnRHR protein. For these reasons, we attempted to develop a glycosylation-based topology reporter prior to the posting and submission of this manuscript. However, all GnRHR reporters we tested were poorly expressed in vitro and the resulting 35S-labeled proteins only generated faint smears on our phosphorimaging screens that could not be interpreted. For these reasons, we chose to rely the Lep measurements for these investigations.

      The lack of a more relevant topological reporter is one of many challenges we faced in our investigations of this unstable, poorly behaved protein. We share the reviewer’s frustrations concerning the speculative aspects of this work. Nevertheless, there is increasing appreciation for the fact that our perspectives on protein biophysics have been skewed by our continuing choice to focus on the relatively small set of model proteins that are compatible with our favored methodologies (doi: 10.1016/j.tibs.2013.05.001). We humbly suggest this work represents an example of how we can gain a deeper understanding of the limits of biochemical systems when we instead choose to study the unsavory bits of cellular proteomes. But this choice requires a willingness to make some reasonable assumptions and to lean on energetic/ structural modeling from time to time. Despite this limitation, we believe there is still tremendous value in this compromise.

      What is the experimental evidence the W107A variant affects the protein structure? Has its melting temperature with and without inverse agonist binding for WT vs the W107A variant been measured, for example? Even heat-FSEC of detergent-solubilised membranes would be informative to know how unstable the W107A variant is. If is very unstable in detergent, then it could be that recovery mutants are going to be unlikely as you are already starting with a poor construct showing poor folding/localisation.

      We again understand the rationale for this concern, but do not believe that thermal melting measurements are likely to report the same sorts of conformational transitions involved in cellular misfolding. Heating up a protein to the point in which membranes (or micelles) are disrupted and the proteins begin to form insoluble aggregates is a distinct physical process from those that occur during co- and post-translational folding within intact ER membranes at physiological temperatures (discussed further in the Response to the Reviews). Indeed, as the reviewer points out below, there seems to be little evidence that secretion is linked to thermal stability or various other metrics that others have attempted to optimize for the sake of purification and/ or structural characterization. Thus, we believe it would be just as speculative to suggest thermal aggregation represents a relevant metric for the propensity of membrane proteins to fold in the cell. The physical interpretation of membrane protein misfolding reaction remains contentious in our field due to the key fact that the denatured states of helical membrane proteins remain highly structured in a manner that is hard to generalize beyond the fact that the denatured states retain α-helical secondary structure (doi: 10.1146/annurev-biophys-051013-022926). This is in stark contrast to soluble proteins, where random coil reference states have proven to be generally useful for energetic interpretations of protein stability. For reference, our lab is currently working to leverage epistatic measurements like this to map the prevailing physiological denatured states of an integral membrane protein. Our current findings suggest that non-native electrostatic interactions form in the context of misfolded states. We hope that more information on the structural aspects of these states will help us to develop and interpret meaningful folding measurements within the membrane.

      For reference, even in cases when quantitative folding measurements can be achieved, their relevance remains actively debated. As a point of reference, the corresponding author of this work previously worked on the stability and misfolding of another human α-helical membrane protein (PMP22). Like GnRHR, PMP22 is prone to misfolding in the secretory pathway and is associated with dozens of pathogenic mutations that cause protein misfolding. To understand how the thermodynamic stability of this protein is linked to secretion, the corresponding author purified PMP22, reconstituted it into n-Dodecyl-phosphocholine (DPC) micelles, and measured its resistance to denaturation by an anionic denaturing detergent (Lauryl Sarcosine, LS). The results were initially perplexing due to the fact that equilibrium unfolding curves manifested as an exponential decay (rather than a sigmoid) and relaxation kinetics appeared to be dominated by the rate constant for unfolding (doi: 10.1021/bi301635f). Unfortunately, these data could not be fit with existing folding models due to the lack of a folded protein baseline and the absence of a folding arm in the chevron plot. We eventually found that a full sigmoidal unfolding transition and refolding kinetics could be measured upon addition of 15% (v/v) glycerol. Our measurements revealed that the free energy of unfolding in DPC micelles was 0 kcal/ mol (without glycerol). This shocking lack of WT stability made it impossible to directly measure the effects of destabilizing mutations that enhance misfolding- you can’t measure the unfolding of a protein that is already unfolded. We ultimately had to instead infer the energetic effects of such mutations from the thermodynamic coupling between cofactor binding and folding (doi: 10.1021/jacs.5b03743). Finally, after demonstrating the resulting ΔΔGs correlated with both cellular trafficking and disease phenotype, we still faced justified scrutiny about the relevance of these measurements due to the fact that they were carried out in micelles. For these reasons, we do not feel that additional biophysical measurements will add much to this work until more is understood about the nature of misfolding reactions in the membrane and how to effectively recapitulate it in vitro. We also note that PMP22 is secreted with 20% efficiency in mammalian cell lines, which is 20-fold more efficient than human GnRHR under similar conditions (doi: 10.1016/j.celrep.2021.110046). Thus, we suspect equilibrium unfolding measurements are likely out of reach using previously described measurements.

      Our greatest evidence suggesting W107A destabilizes the protein has to do with the fact that it deletes a highly conserved structural contact and that this structural modification kills its secretion. The fact that this mutation clearly reduces the escape of GnRHR from ER quality control is a classic indicator of misfolding that represents the cell’s way of telling us that the mutation compromises the folding of the nascent protein in some way or another. Precisely how this mutation remodels the nascent conformational ensemble of nascent GnRHR and how this relates to the free energy difference between the native and non-native portions of its conformational ensemble under cellular conditions is a much more challenging question that lies beyond the scope of this investigation (and likely beyond the scope of what’s currently possible). Indeed, there is an entire field dedicated to understanding such. Nevertheless, the difference in the epistatic interactions formed by W107A and V276T is at the very least consistent with our speculative interpretation that these two mutations vary in their misfolding mechanism and/ or in the extent to which they destabilize the protein. For these reasons, we feel the main conclusions of this manuscript are well-justified.

      Please clarify if the protein is glycosylated or not and, if it is, how would this requirement affect the conclusions of your analysis?

      As we noted in the Response to the Reviewers, which also constitutes a published portion of the final manuscript, this protein is indeed glycosylated. We were well aware of this aspect of the protein since inception of this project and do not think this changes our interpretation at all. Most membrane proteins are glycosylated, and several groups have demonstrated in various ways that the secretion efficiency of glycoproteins is proportional to certain stability metrics for secreted soluble proteins and membrane proteins alike. Generally, mutations that enhance misfolding do not change the propensity of the nascent chain to undergo N-linked glycosylation, which occurs during translation before protein synthesis and/ or folding is complete. Misfolded proteins typically carry lower weight glycans, which reflects their failure to advance from the ER to the Golgi, where N-linked glycans are modified and O-linked glycans are added. From our perspective, glycosyl modifications just ensure that nascent proteins are engaged by calnexin and other lectin chaperones involved in QC. It does not decouple folding from secretion efficiency. In the case of PMP22 (described above), we found that removal of its glycosylation site allows the nascent protein to bypass the lectin chaperones in a manner that enhances its plasma membrane expression eight-fold (doi: 10.1016/j.jbc.2021.100719). Similar to WT, the expression of several misfolded PMP22 variants also significantly increases upon removal of the glycosylation site. Nevertheless, their expression is still significantly lower than the un-glycosylated WT protein, and the expression patterns of the mutants relative to WT was quite similar across this panel of un-glycosylated proteins. Thus, while glycosylation certainly impacts secretion, it does not change its dependence on folding efficiency within the ER. There are many layers of partially redundant QC within the ER, and it seems that folding imposes a key bottleneck to secretion regardless of which QC proteins are involved. For these reasons, we do not think glycosylation (or other PTMs) should factor into our interpretation of these results.

      One caveat with the study is that there is a poor understanding of the factors that decide if the protein should be trafficked to the PM or not. Even secretory proteins not going through the calnexin/reticulum cycle (as they have no N-linked glycans), might still get stuck in the ER, despite the fact they are functional. Could this be a technical issue of heterologous expression overloading the Sec system?

      While we agree that there is much to be learned about this topic, we disagree with the notion that our understanding of folding and secretion is insufficient to generally interpret the molecular basis of the observed trends. In collaboration with various other groups, the corresponding author of this paper has shown for several other proteins that the stability of the native topology and the native tertiary structure can constrain secretion efficiency (see dois: 10.1021/jacs.8b08243, 10.1021/jacs.5b03743, and 10.1016/j.jbc.2021.100423). Moreover, the Balch and Kelly groups demonstrated many years ago that relatively simple models for the coupling between folding and chaperone binding can recapitulate the observed effects of mutations on the secretion efficiency of various proteins (doi: 10.1016/j.cell.2007.10.025). Given a wide body of prevailing knowledge in this area, we believe it is entirely reasonable to assume that the conformational effects of these mutation have a dominant effect on plasma membrane expression.

      Whether or not some of the proteins retained in the ER are folded and/ or functional is an interesting question, but is outside the scope of this work. Various lines of evidence concerning approaches to rescue misfolded membrane proteins suggest many of these variants are likely to retain residual function once they escape the ER, which may suggest there are pockets of foldable/ folded proteins within the ER. But it seems generally clear that the efficiency of folding in the ER bottlenecks secretion regardless of whether or not the ER contains some fraction of folded/ functional protein. We note that it is certainly possible, if not likely, that secretion efficiency is likely to be higher at lower expression levels (doi: 10.1074/jbc.AC120.014940). However, the mutational scanning platform used in this work was designed such that all variants are expressed from an identical promoter at the same location within the genome. Thus, for the purposes of these investigations, we believe it is entirely fair to draw “apples-to-apples” comparisons of their relative effects on plasma membrane expression.

      Please see Francis Arnold's paper on this point and their mutagenesis library of the channelrhodopsin (https://www.pnas.org/doi/10.1073/pnas.1700269114), which further found that 20% of mutations improved WT trafficking. Some general comparisons to this paper might be informative.

      We agree that it may be interesting to compare the results from this paper to those in our own. Indeed, we find that 20% of the point mutations characterized herein also enhance the expression of WT mGnRHR, as mentioned in the Results section. However, we think it might be a bit premature to suggest this is a more general trend in light of the fact that the channelrhodopsins engineered in those studies were not of eukaryotic origin and have likely resulted from distinct evolutionary constraints. We ultimately decided against adding more on this to our already lengthy discussion in order to maintain focus on the mechanisms of epistasis.

      Chris Tate and others have shown that there is a high frequency of finding stabilising point mutations in GPCRs and this is the premise of the StAR technology used to thermostabilise GPCRs in the presence of different ligands, i.e. agonist vs inverse agonists. As far as I am aware, there is a poor correlation between expression levels and thermostability (measured by ligand binding to detergent-solubilised membranes). As such, it is possible that some of the mutants might be more stable than WT even though they have lower levels of PME.

      We believe the disconnect between thermostability and expression precisely speaks to our main point about the suitability of current membrane protein folding assays for the questions we address herein. The degradative activity of ER quality control has not necessarily selected for proteins that are resistant to thermal degradation and/ or are suitable for macromolecular crystallography. For this reason, it is often not so difficult to engineer proteins with enhanced thermal stability. We do not believe this disconnect signals that quality control is insensitive to protein folding and stability, but rather that it is more likely to recognize conformational defects that are distinct from those involved in thermal degradation and/ or aggregation. Indeed, recent work from the Fluman group, which builds on a wider body of previous observations, has shown that the exposure of polar groups within the membrane is a key factor that recruits degradation machinery (doi: 0.1101/2023.12.12.571171). It is hard to imagine that these sorts of conformational defects are the same as those involved in thermal aggregation.

      Reviewer #2 (Recommendations For The Authors):

      (1) I believe that by focusing more on the epistasis with V276T, and less on W107A, the paper could be strengthened significantly.

      We appreciate this sentiment. But we believe the comparison of these two mutants really drive home the point that destabilizing mutations are not equivalent with respect to the epistatic interactions they form.

      (2) In the abstract - please define the term epistasis in a simple way, to make it accessible to a general audience. For example - negative epistasis means that... this should be explicitly explained.

      We thank the reviewer for this suggestion. To meet eLife formatting, we had to cut down the abstract significantly. We simplified this as best we could in the following statement:

      “Though protein stability is known to shape evolution, it is unclear how cotranslational folding constraints modulate the synergistic, epistatic interactions between mutations.”

      We also define positive and negative epistasis in the results section as follows:

      “Positive Ɛ values denote double mutants that have greater PME than would be expected based on the effects of single mutants. Negative Ɛ values denote double mutants that have lower PME than would be expected based on the effects of single mutants. Pairs of mutations with Ɛ values near zero have additive effects on PME.”

      (3) The title is quite complex and might deter readers from outside the protein evolution field. Consider simplifying it.

      We thank the reviewer for this suggestion. We have simplified the title to the following:

      “Divergent Folding-Mediated Epistasis Among Unstable Membrane Protein Variants”

      (4) The paper could benefit from a simple figure explaining the different stages of membrane protein folding (stages 1+2) to make it more accessible to readers from outside the membrane protein field.

      This is a great suggestion. We incorporated a new schematic in the revised manuscript that outlines the nature of these processes (see Fig. 1A in the revised manuscript).

      (5) For the FACS-Seq experiment - it was not clear to me if and when all cells are pulled together. For example - are the 3 libraries mixed together already at the point of transfection, or are the transfected cells pulled together at any point before sorting? This could have some implications on batch effects and should, therefore, be explicitly mentioned in the main text.

      We thank the reviewer for this suggestion. We modified the description of the DNA library assembly to emphasize that the mutations were generated in the context of three mixed plasmid pools, which were then transfected into the cells and sorted independently:

      “We then generated a mixed array of mutagenic oligonucleotides that collectively encode this series of substitutions (Table S3) and used nicking mutagenesis to introduce these mutations into the V276T, W107A, and WT mGnRHR cDNAs (Medina-Cucurella et al., 2019), which produced three mixed plasmid pools.”

      (6) The following description in the text is quite confusing. It would be better to simplify it considerably or remove it: "scores (Ɛ) were then determined by taking the log of the double mutant fitness value divided by the difference between the single mutant fitness values (see Methods)."

      We thank the reviewer for this valuable feedback and have simplified the text as follows:

      “To compare epistatic trends in these libraries, we calculated epistasis scores (Ɛ) for the interactions that these 251 mutations form with V276T and W107A by comparing their relative effects on PME of the WT, V276T, and W107A variants using a previously described epistasis model (product model, see Methods) (Olson et al. 2014).”

    1. Author Response

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      This work provides new mechanistic insights into the competitive inhibition in the mammalian P2X7 receptors using structural and functional approaches. The authors solved the structure of panda (pd) P2X7 in the presence of the classical competitive antagonists PPNDS and PPADS. They find that both drugs bind to the orthosteric site employed by the physiological agonist ATP. However, owing to the presence of a single phosphate group, they prevent movements in the flipper domain required for channel opening. The authors performed structure-based mutational analysis together with electrophysiological characterization to understand the subtype-specific binding of these drugs. It is known from previous studies that P2X1 and P2X3 are more sensitive to these drugs as compared to P2X7, hence, the residues adjacent to the ATP binding site in pdP2X7 were mutated to those present in P2X1. They observed that mutations of Q143, I214, and Q248 into lysine (hP2X1) increased the P2X7 sensitivity to PPNDS, whereas in P2X1, mutations of these lysines to alanine reduced sensitivity to PPNDS, suggesting that these key residues contribute to the subunit-specific sensitivity to these drugs. Similar experiments were done in hP2X3 to demonstrate its higher sensitivity to PPNDS. This preprint provides a useful framework for developing subtype-specific drugs for the family of P2X receptor channels, an area that is currently relatively unexplored.

      We appreciate the time and effort Reviewer #1 devoted to this review, and we have addressed the specific comments below.

      (1) Why was the crystallization construct of panda P2X7 used for structural studies instead of rat P2X7 with the cytoplasmic ballast which is a more complete receptor that is closely related to the human receptor? Can the authors provide a justification for this choice?

      We appreciate this comment. We did try to express the rat P2X7 receptor in its full-length form based on a previous report (Cell 2019, PMID: 31587896), but the expression of the receptor was not successful for an unknown reason. Instead, we employed a truncated construct of panda P2X7 based on the findings described another previous report (eLife 2016, PMID: 27935479). This truncated construct also possesses ATP-dependent channel activity (eLife 2016, PMID: 27935479). Thus, we understand that the full-length P2X7 construct would be preferable, particularly for addressing the function of the cytoplasmic domain; however, the main focus of this study was on PPNDS/PPNADS recognition and the associated structural changes in the ATP binding pocket, which we believe are less likely to be severely affected by truncation of the cytoplasmic domain. In support of this expectation, our mutational analyses are consistent with the structures in this study. Therefore, we believe that the use of the truncation construct in this study is justified.

      (2) Was there a good reason why hP2X1 and hP2X3 currents were recorded in perforated patches, whereas pdP2X7 currents were recorded using the whole-cell configuration? It seems that the extent of rundown is less of a problem with perforated patch recordings. Can the authors comment and perhaps provide a justification? It would also be good to present data for repeated applications of ATP alone using protocols similar to those for testing antagonists so the reader can better appreciate the extent of run down with different recording configurations for the different receptors.

      We thank the reviewer for bringing up this point. The whole-cell configuration is the most commonly used method in patch-clamp experiments; therefore, we used this method to record the current of pdP2X7 (Author response image 1). However, the whole-cell configuration is not suitable for all experiments; for example, the currents of P2X1 and P2X3 recorded by this method show a severe "rundown" effect. The "rundown" effect prevents accurate calculation of the inhibition rate of the antagonist, and to obtain more accurate results, we used perforated patches to record the currents of hP2X1 and hP2X3.

      Author response image 1.

      Representative current traces of pdP2X7, hP2X3, and hP2X1 after repeated applications of ATP. The pdP2X7 currents were recorded using the whole-cell configuration, and the hP2X1 and hP2X3 currents were recorded using perforated patches.

      (3) The data in Fig. S1, panel A shows multiple examples where the currents activated by ATP after removal of the antagonist are considerably smaller than the initial ATP application. Is this due to rundown or incomplete antagonist unbinding? It is interesting that this wasn't observed with hP2X1 and hP2X3 even though they have a higher affinity for the antagonist. Showing examples of rundown without antagonist application would help to distinguish these distinct phenomena and it would be good for the authors to comment on this in the text. It is also curious why a previous study on pdP2X7 did not seem to have problems with rundown (see Karasawa and Kawate. eLife, 2016).

      We thank the reviewer for bringing up this point. We believe that this difference may be the result of incomplete antagonist unbinding. A similar phenomenon has been observed in previous studies of pdP2X7 (eLife 2016, PMID: 27935479). In the previous experiment, the currents activated by ATP after removal of the antagonist A740003 did not return to the initial value upon ATP application, whereas activation by ATP after removal of the antagonist GW791343 immediately restored the initial value upon ATP application (Fig. 1C of eLife 2016, PMID: 27935479). This may be because different inhibitors dissociate differently from pdP2X7. In our experiments, we assumed that PPNDS/PPADS was not completely dissociated from P2X7 even after 20 min of elution. The activation of P2X7 by ATP without antagonists showed no rundown effect (Author response image 1); therefore, we calculated the inhibition rate of the antagonist according to the precontrol.

      (4) The written presentation could be improved as there are many instances where the writing lacks clarity and the reader has to guess what the authors wish to communicate.

      To address this comment, we made changes to the text, particularly by following the

      Recommendations for The Authors

      Reviewer #1 (Recommendations For The Authors):

      (1) The way the manuscript is written could be greatly improved. There are many confusing sections where the reader has to guess what the authors wish to convey. For example, on page 9 "In addition, the mutation of Val173 to aspartate, as observed in pdP2X7, significantly decreased the sensitivity to PPNDS (Fig. 6B)." It appears from this sentence that Asp is present in P2X7, which is incorrect, please rephrase. There are many more examples of confusing sentences that need to be carefully edited to improve comprehension.

      To address this comment, we extensively modified the text to avoid this kind of misunderstanding. Please see the manuscript file with the track changes.

      (2) Please use either a 1-letter or 3-letter code for amino acid residues throughout the manuscript to maintain uniformity.

      We made this correction throughout the revised manuscript.

      (3) In Figure 1 on the right side, including the nearby density and side chains for interacting residues of PPNDS and PPADS would give more information and reliability for the density of the drugs.

      We appreciate this comment. The corresponding information is shown in Fig. S7.

      (4) Typo: Figure S1, E, and F panels - please correct the y-axis label to Inhibition.

      We corrected the typo in Fig. S1.

      (5) Please rewrite the legends for Fig. S3 and S5. They are confusing. The figure shows 3D classification using Relion, however, the legend suggests it was done using Cryosparc. Please clarify.

      We apologize for the confusion. Before applying C3 symmetry, all steps including 3D classification were performed in Relion 3.1. With C3 symmetry, we performed further refinement using Cryosparc v4.2.1 by non-uniform refinement. We have corrected the figure legends accordingly.

      (6) For Fig. S3 and S5 increase the resolution and size of representative micrographs, and also please provide scale bars.

      We have corrected Figures S3 and S5 accordingly.

      (7) Please add the 3D classification protocol performed in Relion/Cryosparc in the methods section as well.

      We added the corresponding description to the revised manuscript (Lines 9-14, Page 16).

      (8) In Table S1, under the initial model the authors state 'this study' when they should report the use of 5U1L according to the methods section.

      We corrected Table S1 in accordance with this comment.

      (9) The authors should consider combining the raw data shown in Figure S1 in Figure 6 as it provides stronger support for the conclusions than the bar graphs shown in Figure 6B.

      We appreciate the comment and fully understand the intention of Reviewer #1. Nevertheless, we would like to keep Figure S1, since it was also mentioned earlier together with Figure 1. In addition, if we combine Figure S1 with Figure 6, the result would be too large to present as a single figure.

      (10) In Figure 6A, please provide colored labels for both P2X7 and P2X1 to aid comprehension of the structural models.

      Based on this comment, we corrected the labels in Figure 6.

      (11) In the discussion, the authors write about comparisons with the docking study by Huo et al. JBC, 2018. Can they show the superimposition of their EM model with the previous studies' docking model in a supplementary figure for more clarity?

      We appreciate the constructive comments. However, unfortunately, the docking model in the previous study (JBC 2018, PMID: 29997254) is not available, so it is not possible to show the superimposition.

      Reviewer #2 (Public Review):

      Summary:

      P2X receptors play pivotal roles in physiological processes such as neurotransmission and inflammation, making them promising drug targets. This study, through cryo-EM and functional experiments, reveals the structural basis of the competitive inhibition of the PPNDS and PPADS on mammalian P2X7 receptors. Key findings include the identification of the orthosteric site for these antagonists, the revelation of how PPADS/PPNDS binding impedes channel-activating conformational changes, and the pinpointing of specific residues in P2X1 and P2X3 subtypes that determine their heightened sensitivity to these antagonists. These insights present a comprehensive understanding that could guide the development of improved drugs targeting P2X receptors. This work will be a valuable addition to the field.

      Strengths and weaknesses:

      The combination of structural experiments and mutagenesis analyses offers a deeper understanding of the mechanism. While the inclusion of MD simulation is appreciated, providing more insights from the simulation might further strengthen this already compelling story.”

      We appreciate the time and effort Reviewer #2 devoted to this review, and we have addressed the specific comments below.

      Reviewer #2 (Recommendations For The Authors):

      (1) On page 3, the sentence "ATP analogs are the most competitive inhibitors of P2X receptors but are typically unsuitable due to a lack of high specificity in vivo," might need additional context. Could the authors clarify if they are referring to the unsuitability of ATP analogs for medical applications?

      To address this comment, we have rewritten the sentence as follows (Lines 13-16, Page 3):

      ATP analogs are most common among competitive inhibitors for P2X receptors; however, they are generally unsuitable for in vivo applications due to their relatively low specificity, which may result in off-target toxicity. This issue arises because the human body contains numerous ATP-binding proteins.

      (2) Fig. S1. I am curious why, for P2X7, the ATP-only current after removal of PPNDS/PPADS does not recover and become larger than the current in the presence of PPNDS/PPADS? Such behavior was not as pronounced in P2X1. Does that suggest PPNDS/PPADS might remain bound and can not be removed when the P2X7 channel is closed?

      We thank the reviewer for bringing up this point. We believe that this difference may be the result of incomplete antagonist unbinding. A similar phenomenon has been observed in previous studies of pdP2X7 (eLife 2016, PMID: 27935479). In the previous experiment, the currents activated by ATP after removal of the antagonist A740003 did not return to the initial value upon ATP application, whereas activation by ATP after removal of the antagonist GW791343 immediately restored the initial value upon ATP application (Fig. 1C of eLife 2016, PMID: 27935479). We strongly agree with the reviewer that this may be due to the difficulty of dissociating the antagonist from pdP2X7.

    1. Author response:

      The following is the authors’ response to the original reviews.

      We would like to thank both Editors and reviewers for their valuable time, careful reading, and constructive comments. The comments have been highly valuable and useful for improving the quality of our study, as well as important in guiding the direction of our present and future research. In the revised manuscript, we have incorporated the necessary changes including additional experimental data as suggested; please find our detailed pointby-point response to the reviewer’s comments and the changes we have made in the manuscript as follows.

      Reviewer #1 (Public Review):

      In this work, the authors have explored how treating C. albicans fungal cells with EDTA affects their growth and virulence potential. They then explore the use of EDTA-treated yeast as a whole-cell vaccine in a mouse model of systemic infection. In general, the results of the paper are unsurprising. Treating yeast cells with EDTA affects their growth and the addition of metals rescues the phenotype. Because of the significant growth defects of the cells, they don't infect mice and you see reduced virulence. Injection with these cells effectively immunises the mice, in the same way that heatkilled yeast cells would. The data is fairly sound and mostly well-presented, and the paper is easy to follow. However, I feel the data is an incremental advance at best, and the immune analysis in the paper is very basic and descriptive.

      Strengths:

      Detailed analysis of EDTA-treated yeast cells

      Weaknesses:

      • Basic immune data with little advance in knowledge.

      • No comparison between their whole-cell vaccine and others tried in the field.

      • The data is largely unsurprising and not novel.

      Reply: Thank you so much for appreciating our effort to generate a whole cell anti-fungal vaccine by treating C. albicans cells with EDTA. Also, we appreciate your comment that the manuscript is sound and well-presented. However, we are afraid that the respected reviewer assumed the CAET cells as dead cells while they only divide relatively slower than the untreated cells. In the revised manuscript, we have presented additional evidence to show that CAET are live cells (Supp. Figs 2) and based on the new data, we expect a positive change in the reviewer’s opinion. Since CAET is a live strain, the data presented here is novel.

      Reviewer #2 (Public Review):

      Summary:

      Invasive fungal infections are very difficult to treat with limited drug options. With the increasing concern of drug resistance, developing an antifungal vaccine is a high priority. In this study, the authors studied the metal metabolism in Candida albicans by testing some chelators, including EDTA, to block the metal acquisition and metabolism by the fungus. Interestingly, they found EDTAtreated yeast cells grew poorly in vitro and non-pathogenic in vivo in a murine model. Mice immunized by EDTA-treated Candida (CAET) were protected against challenge with wild-type Candida cells. RNA-Seq analysis to survey the gene expression profile in response to EDTA treatment in vitro revealed upregulation of genes in metal homeostasis and downregulation of ribosome biogenesis. They also revealed an induction of both pro- and anti-inflammatory cytokines involved in Th1, Th2 and Th17 host immune response in response to CAET immunization. Overall, this is an interesting study with translational potential.

      Strengths:

      The main strength of the report is that the authors identified a potential whole-cell live vaccine strain that can provide full protection against candidiasis. Abundant data both on in vitro phenotype, gene expression profile, and host immune response have been presented.

      Weaknesses:

      A weakness is that the immune mechanism of CAET-mediated host protection remains unclear. The immune data is somewhat confusing. The authors only checked cytokines and chemokines in blood. The immune response in infected tissues and antibody response may be investigated.

      Reply: Thank you very much for appreciating our work and finding our strain to be a live whole-cell anti-fungal vaccine strain with translational potential. Since the current study focused on the identification and detailed characterizations of a non-genetically modified live-attenuated strain and determination of its safety and efficacy as a potential vaccine candidate in the preclinical model, we have excluded the possible immune mechanisms involving CAET. In a separate study, we are currently investigating both cellular and molecular mechanisms that provide protective immunity in CAET-vaccinated mice.

      Reviewer #3 (Public Review):

      Summary:

      The authors are trying to find a vaccine solution for invasive candidiasis.

      Strengths:

      The testing of the antifungal activity of EDTA on Candida is not new as many other papers have examined this effect. The novelty here is the use of this EDTA-treated strain as a vaccine to protect against a secondary challenge with wild-type Candida.

      Weaknesses:

      However, data presented in Figure 5 and Figure 6 are not convincing and need further experimental controls and analysis as the authors do not show a time-dependent effect on the CFU of their vaccine formulation. The methodology used is also an issue. As it stands, the impact is minor.

      Reply: Thank you so much for appreciating our efforts to develop a novel vaccine against fungal infections. We are extremely sorry for the lack of clarity in our writing related to Figs. 5 and 6, we have now modified the text and hope that the respected reviewer will find these convincing.

      Recommendations for the authors:

      Although the reviewers recognize the importance of the manuscript, they would like to see: 1) comparisons between their whole-cell vaccine and others tried in the field, 2) an investigation of the immune response in infected tissues and antibody response, and 3) more controls in Figures 5 and 6, and a time-dependent effect on the colony-forming units of their vaccine formulation. Please, address the questions and submit a revised version together with a rebuttal letter addressing point-by-point raised by each reviewer.

      Reply: (1) We are afraid that a comparative study of a live and heat-killed cell vaccines will mislead the information presented here. This is the only non-genetically modified antifungal vaccine candidate therefore a comparison with a dead strain at present is unwarranted. We have now added supporting data to confirm that, the survivability of C. albicans cells was unaffected at 6 hr of EDTA treatment (CAET, Supp. Fig. S2). (2) Since the current study focused on the identification and a detailed characterization of a non-genetically modified live attenuated strain and its safety and efficacy as a potential vaccine candidate in the preclinical model, we have excluded the possible immune mechanisms involving CAET. However, in a separate study, we are currently investigating both cellular and molecular mechanisms that provide protective immunity in CAET-vaccinated mice. (3) The results of Figs 5 and 6 were misinterpreted by the respected reviewer, please see the explanation below.

      Reviewer #1 (Recommendations For The Authors):

      Some specific comments/suggestions for the authors: (1) What was the viability of the yeast after EDTA treatment? Is the delayed growth response because many cells died and it takes a while for remaining viable cells to catch up? This is important to know because it may mean the dose given to mice is substantially different and that should be accounted for. Some PI staining of the cells after treatment would help.

      Reply: The growth curve assays (Fig. 1A and 1E) were initiated with O.D.600nm=0.5 of each cultures (~ 107 cells/mL) and the analyses suggested that the EDTA-treated C. albicans cells grew slower than the untreated cells. Fig. 1B and 1F further demonstrated that EDTA has minimal effect on the survival of the strain up to 8 hrs post-exposure. The proportion of the number of cells increased without and with metal chelators almost remained the same for this duration (0 – 8 hrs). Therefore, for subsequent analyses, 6 hr treatment was selected and such treated cells were considered as CAET, which were actively dividing live cells, albeit slower than untreated cells. As suggested and to strengthen our finding, a time dependent SYTOX Green and Propidium iodide staining of C. albicans cells without and with EDTA treatment was carried out and analysed by flow cytometry and microscopy, respectively. Both analyses revealed that the percentage of dead cells up to 12 hrs of without and with EDTA treatment remained the same. The new data has now been added in the revised version of the manuscript as Supplementary figure 2.

      Author response image 1.

      (2) In line with the above, what was the viability of the CAET cells after 3h in media? In the macrophage in vitro experiments, how do you know the reduced viability of the CAET cells is macrophage-specific? Did you run a control of CAET cells in media on their own to determine how CFU changed in macrophage-free conditions? Is the proliferation rates of untreated and CAET cells different? That would affect CFSE labelling and results. These experiments would work better with a GFP-expressing C. albicans strain, which is widely available. In the images in Figure 4c, it looks like there are more hyphae in CAET than untreated - was hyphal induction checked/measured? That's important to know because more hyphae usually means more clumping and this can affect CFU counts (giving the impression of less CFU when actually there is more). Because of all the issues above, I'm not fully convinced by the uptake/killing data.

      Reply: As explained in response 1, we used actively dividing WT and CAET cells, and equal number of these cells were CFSE labelled. As can be seen in Fig.4A, the rate of phagocytosis was the same in 1 hr of pre-culture, but in the subsequent time points the double-positive cells were reduced in the case of CAET cells and that is due to fungal killing by macrophages. Fungal cells were released from the macrophages by warm water treatment and CFU was determined. Fig. 4B suggested that at 1hr of co-culture, the CFU of both fungal cells (WT and CAET) were the same and the fungal clearance was observed at later time points. Thus, the reduced viability of CAET cells was macrophagespecific. EDTA has minimal effect on hyphal transition without and with the presence of serum and the new data has now been provided in the revised version (Supplementary Fig. 3).

      Author response image 2.

      (3) Pooled data should be shown for all animal experiments.

      Reply: Thank you for the suggestion, wherever it was meaningful pooled data for the animal experiments have now been provided.

      (4) Immune cell counts/analysis in the kidney and bone marrow would be hugely helpful and more relevant to understanding immune responses following immunisation/infection. I think a more interesting analysis for the authors to consider would be to immunise with heat-killed yeast vs EDTAtreated yeast and see if there is a qualitative difference or better protection, i.e. is the EDTA-treated whole-cell vaccine superior to the heat-killed version? That is a better question to address. As it stands, the data in the paper is not surprising.

      Reply: The studies on cellular and molecular mechanisms underlying protective immunity in CAETvaccinated mice are under progress in a separate study. This study mostly focused on the identification and detailed characterization of a non-genetically modified live-attenuated strain and its safety and efficacy as a potential vaccine candidate in a preclinical model. We are afraid that a comparison of a live cell (CAET) with a dead cell (heat-killed) will dilute the content of the manuscript and will not be meaningful. It is well accepted that the heat-killed C. albicans strain only provides partial short-lived protection to re-challenge (Refs-PMIDs: 12146759, and 9916097), thus, it does not warrant any comparison with CAET.

      Reviewer #2 (Recommendations For The Authors):

      Overall, this is a highly interesting study. I have the following specific comments for clarification.

      (1) In the introduction, the authors mentioned other anti-candida vaccines that are mostly effective against Candida infection by inducing neutralizing antibodies. However, in their CAET vaccine candidate, they only checked the cellular immunity in blood and found a balanced immune response (both pro- and anti-inflammatory responses are induced). How about the antibody production in these mice? It is a bit surprising that both untreated Candida infection and CAET Candida infection produced similar immune activation based on Figure 6, yet the CAET immunization provides protection. Some innate cell recruitment is higher in untreated Ca infection than the CAET infected mice (Figure 5F). The overall results on immune response characterization did not seem to explain why the CAET infection led to host protection while untreated Ca infection cannot. Characterizing infected tissue immune cell differentiation and cytokine production may offer some additional insights.

      Reply: We agree with you that in this manuscript we have not provided any mechanistic study on the protective immunity in CAET-vaccinated mice. This will be demonstrated in a subsequent study.

      (2) In Figure 5, some critical data seem to be missing in panels B and C. The CFU and histopathological images for CAET-treated mice challenged by Ca should also be shown there for comparison. Although they did show some data in Figure 5E and Figure S4, it is necessary to have that data in 5B and 5C from the same experiment. Figure S4 is a very busy figure and the images are quite small. It may be necessary to use arrows to point out what information authors want to emphasize.

      Reply: Fig 5 B and 5C showed the data for mice that succumbed to infection. Since the other mice (saline control groups, CAET infected, CAET vaccinated, and re-challenged groups) survived, they were not sacrificed; therefore, the CFU data was not collected. In addition, we wanted to see the longevity of these survived mice and after 1 year of observations, they were handed over to the animal house for clearance as per the institutional guidelines. However, Figure 5E and Figure S4 (now Fig. S6) included all the mice groups as they were sacrificed at various time points irrespective of humane end points. As suggested FigS6 has now been modified and fungal cells were denoted by yellow arrows.

      (3) EDTA-treated yeast cells showed poor growth but also had thicker cell walls with high chitin, glucan, and mannan levels. What leads to its clearance in vivo remains unclear, as usually, cells with thick cell wall structures and low metabolism are more resistant to stress, e.g., dormant cells. Macrophages were shown to contribute to CAET killing in a phagocytosis assay (Figure 4). Checking cytokines produced by macrophages during co-incubation may offer some insights. In all, additional discussion on what caused in vivo clearance would be helpful.

      Reply: Mechanistic study on the protective immune responses of CAET will be demonstrated in a separate study. As suggested, the discussion section now contains additional information emphasising the in vivo clearance of CAET cells in the 3rd paragraph of discussion section.

      (4) Long paragraphs in the discussion section could be divided into a bigger number of shorter paragraphs.

      Reply: Thank you for the suggestion, it has now been modified in the revised version (7 short paragraphs). To make it more comprehensive, some of the content has been removed.

      Reviewer #3 (Recommendations For The Authors):

      (1) It is unclear how many cells were treated with 250 micromolar of EDTA for 6 hours before preparing the inoculum. It seems that only the OD was measured before adding EDTA. This is not a very rigorous and reproducible method.

      Reply: In this manuscript, we have repeatedly used the same protocol to generate CAET cells for various analyses. The O.D.600nm= 0.5 culture is equivalent to 107 C. albicans cells per mL and this information has now been added in the revised manuscript.

      (2) Upon treatment with 250 micromolar of EDTA, cells were harvested and counted to prepare the inoculum (5x10e5) for injecting it in mice. However, it appears that CFU of the inoculum was not done. Based on data shown in Fig. 1B, 250 micromolar of EDTA does inhibit Candida cell replication. Thus, the authors may have counted dead cells and, thus, injected dead cells together with live cells for the CAET inoculum. Thus, mice receiving this inoculum may have been infected (and vaccinated) with a lower number of live Candida cells.

      Reply: Please see a similar response to reviewer #1. EDTA has minimal effect on the survival of C. albicans cells at 6 hr (also see supp. Fig. S2). We have already mentioned the CFU analysis of untreated and CAET cells in the methodology section related to inoculum preparation.

      (3) It is unclear if 6 hours of treatment with 250 micromolar of EDTA is enough to induce a block of Candida cell replication. In Figure 1B, the authors treated for 24h. The authors are encouraged to wash the cells after 6 hours of treatment and see if their cell division will recover upon removal of EDTA.

      Reply: Thank you for the suggestion. At 6 hr treatment, survivability of C. albicans cells was unaffected upon EDTA exposure. PI and SYTOX GREEN staining confirmed it (Supp. Fig. 2). Additionally, as suggested a rescue experiment was carried out by exogenous addition of divalent metals after 6 hr EDTA treatment and growth/CFU analyses were followed thereafter. A modified Fig. 1 A and B with new data has been provided.

      (4) The data shown in Figure 5A is extremely exciting. However, the number of mice in each group (n=6) is too low. Normally, 10 mice per group are used for virulence studies unless the authors provide a power analysis that 6 mice per group will be sufficient. Also, CFU data were only provided for Ca and saline-Ca groups (Fig. 5B) and not for the other groups. CFU data should be provided for all mice.

      Reply: Thank you for the suggestion and a statistical analysis of Fig. 5A was provided in the revised version. The rationale behind not including all mice groups in Fig. 5B is already explained in a response to reviewer #2.

      (5) It is unclear how the authors differentiate between CFU arising from CAET or from WT Candida.

      Reply: Since the Fig 5 E demonstrated that no CAET cells were detected in the kidney beyond 10 days of inoculation, in the re-challenged mice group (1CAET 2 Ca), the fungal cells those detected in the 3rd and 7th days were from the later inoculated cells (brown colour).

      (6) Figure 5E: it is unclear if a 1 saline-2 saline (Figure legend) or if 1 saline-2 Ca (text) group was included. If the latter, where are the CFU? It is impossible that 1 saline-2 Ca mice have no CFU.

      Reply: Thank you so much for pointing this out. The legend has now been modified that include 1saline-2saline and 1CAET-2Ca.

      (7) It seems that CFU is significantly present in the kidney in the 1 CAET - 2 Ca group at day 7 but not at day 3. How is this possible? This is an extremely invasive model of infection, and the authors are challenging intravenously 500,000 live Candida cells. If by the 3rd day, the authors detect no CFU, then how is it possible that CFUs are arising on day 7?

      Reply: We do detect fungal cells on 3rd day in 1CAET 2 WT mice group (~2000 cells), albeit much lower than in 7 days (~11200 cells). A Log10 scale graph has now been provided for better representation.

      (8) Most importantly, if the authors are not detecting CFU at day 3, then earlier time points (e.g. day 2, day 1, or even 12 hours post-challenge) must be analyzed. The authors should show that CFU from the organs is decreasing in a time-dependent manner. Also, all CFU should be shown as Log10.

      Reply: please see the previous response.

      (9) Fig. 6: because it is unclear if the mice were challenged with the same inoculum of live Candida cells (untreated and treated with EDTA), the different cytokine profiles between the two groups could be simply due to the different inoculum sizes and not to the effect of EDTA on Ca.

      Reply: please see the previous response as given also for Reviewer 1.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This work presents valuable information about the specificity and promiscuity of toxic effector and immunity protein pairs. The evidence supporting the claims of the authors is currently incomplete, as there is concern about the methodology used to analyze protein interactions, which did not take potential differences in expression levels, protein folding, and/or transient interaction into account. Other methods to measure the strength of interactions and structural predictions would improve the study. The work will be of interest to microbiologists and biochemists working with toxin-antitoxin and effector-immunity proteins.

      We thank the reviewers for considering this manuscript. We agree that this manuscript provides a valuable and cross-discipline introduction to new EI pair protein families where we focus on the EI pair’s flexibility and impacts on community structure. As such, we believe we have provided a solid foundation for future studies to examine non-cognate interactions and their possible effects on microbial communities. This, by definition, leaves some areas “incomplete” and, therefore, open for further investigations. While the methods we show do consider potential differences in binding assays, we have more explicitly addressed how “expression, protein folding, and/or transient binding” may play into this expanded EI pair model. We have also tempered the discussion of the proposed model, while also clearly highlighting other published evidence of non-cognate binding interactions between effector and immunity proteins. We have responded to the reviewers’ public comments (italicized below). 

      In this revised manuscript, we have updated the main text, particularly the Discussion section, to include more careful language, explain past research better, and add new references to works showing non-cognate immunity proteins protecting against effectors in other systems. We have also updated the supplemental files with more analyses; the relevant procedures are in the Materials and Methods.

      Public Reviews:

      Note: Reviewer 1, who appeared to focus on a subset of the manuscript rather than the whole, based their comments on several inaccuracies, which we discuss below. We found the tone in this reviewer's comments to be, at times, inappropriate, e.g., using "harsh" and "simply too drastic" to imply that common structure-function analyses were outside of the field-standard methods. We also note that the reviewer took a somewhat atypical step in reviewing this manuscript by running and analyzing the potential protein-complex data in AlphaFold2 but did not discuss areas of low confidence within that model that may contradict their conclusions. We are concerned their approach muddled valid scientific criticisms with problematic conclusions.

      Reviewer #1 (Public Review):

      In this manuscript, Knecht, Sirias et al describe toxin-immunity pair from Proteus mirabilis. Their observations suggest that the immunity protein could protect against non-cognate effectors from the same family. They analyze these proteins by dissecting them into domains and constructing chimeras which leads them to the conclusion that the immunity can be promiscuous and that the binding of immunity is insufficient for protective activity.

      Strengths:<br />  The manuscript is well written and the data are very well presented and could be potentially interesting. The phylogenetic analysis is well done, and provides some general insights.

      Weaknesses:<br /> (1) Conclusions are mostly supported by harsh deletions and double hybrid assays. The later assays might show binding, but this method is not resolutive enough to report the binding strength. Proteins could still bind, but the binding might be weaker, transient, and out-competed by the target binding.

      The phrasing of structure-function analyses as “harsh” is a bit unusual, as other research groups regularly use deletions and hybrid studies. Given the known caveats to deletion and domain substitutions, we included point-mutation analyses for both the effector and immunity proteins, as found on lines 105 - 113 and 255 - 261 in the current manuscript. These caveats are also why we coupled the in vitro binding analyses with in vivo protection experiments in two distinct experimental systems (E. coli and P. mirabilis). Based on this manuscript’s introductory analysis (where we define and characterize the genes, proteins, interactions, phylogenetics, and incidences in human microbiomes), the next apparent questions are beyond the scope of this study. Future approaches would include analyzing purified proteins from the effector (E) and immunity (I) protein families using biochemical assays, such as X-ray crystallography, circular dichroism spectroscopy, among others. 

      Interestingly, most papers in the EI field do not measure EI protein affinity (Jana et al., 2019, Yadav et al., 2021). Notable exceptions are earlier colicin research (Wallis et al., 1995) and a new T6SS EI paper (Bosch et al., 2023) published as we first submitted this manuscript.

      (2) While the authors have modeled the structure of toxin and immunity, the toxin-immunity complex model is missing. Such a model allows alternative, more realistic interpretation of the presented data. Firstly, the immunity protein is predicted to bind contributing to the surface all over the sequence, except the last two alpha helices (very high confidence model, iPTM>0.8). The N terminus described by the authors contributes one of the toxin-binding surfaces, but this is not the sole binding site. Most importantly, other parts of the immunity protein are predicted to interact closer to the active site (D-E-K residues). Thus, based on the AlphaFold model, the predicted mechanism of immunization remains physically blocking the active site. However, removing the N terminal part, which contributes large interaction surface will directly impact the binding strength. Hence, the toxin-immunity co-folding model suggests that proper binding of immunity, contributed by different parts of the protein, is required to stabilize the toxin-immunity complex and to achieve complete neutralization. Alternative mechanisms of neutralization might not be necessary in this case and are difficult to imagine for a DNase.

      In response to the reviewer’s comment, we again reviewed the RdnE-RdnI AlphaFold2 complex predictions with the most updated version of ColabFold (1.5.2-patch with PDB100 and MMseq2) and have included them at the end of these responses [1].

      However, the literature reports that computational predictions of E-I complexes often do not match experimental structural results (Hespanhol et al., 2022, Bosch et al., 2023). As such, we chose not to include the predicted cognate and non-cognate RdnE-I complexes from ColabFold (which uses AlphaFold2) and have not included this data in the revised manuscript. (It is notable that reviewer 1 found the proposed expanded model and research so interesting as to directly input and examine the AI-predicted RdnE-RdnI protein interactions in AlphaFold2.)

      Discussion of the prevailing toxin-immunity complex model is in the introduction (lines 45-48) and Figure 5E. Further, there are various known mechanisms for neutralizing nucleases and other T6SS effectors, which we briefly state in the discussion (lines 359 - 361). More in-depth, these molecular mechanisms include active-site blocking (Benz et al., 2012), allosteric-site binding (Kleanthous et al., 1999 and Lu et al., 2014), enzymatic neutralization of the target (Ting et al., 2021), and structural disruption of both the active and binding sites (Bosch et al., 2023). Given this diversity of mechanisms, we did not presume to speculate on the as-of-yet unknown mechanism of RdnI protection. We have expanded discussion of these items in the revised manuscript.

      (3) Dissection of a toxin into two domains is also not justified from a structural point of view, it is probably based on initial sequence analyses. The N terminus (actually previously reported as Pone domain in ref 21) is actually not a separate domain, but an integral part of the protein that is encased from both sides by the C terminal part. These parts might indeed evolve faster since they are located further from the active site and the central core of the protein. I am happy to see that the chimeric toxins are active, but regarding the conservation and neutralization, I am not surprised, that the central core of the protein fold is highly conserved. However, "deletion 2" is quite irrelevant - it deletes the central core of the protein, which is simply too drastic to draw any conclusions from such a construct - it will not fold into anything similar to an original protein, if it will fold properly at all.

      The reviewer’s comment highlights why we turned to the chimera proteins to dissect the regions of RdnE (formerly IdrD-CT), as the deletions could result in misfolded proteins. (We initially examined RdnE in the years before the launch of AlphaFold2.) However, the reviewer is incorrect regarding the N-terminus of RdnE. The PoNe domain, while also a subfamily of the PD-(D/E)XK superfamily, forms a distinct clade of effectors from the PD-(D/E)XK domain in RdnE (formally IdrD-CT) as seen in Hespanhol et al., 2022; this is true for other DNase effectors as well. Many studies analyzing effectors within the PD-(D/E)XK superfamily only focus on the PD-(D/E)XK domain, removing just this domain from the context of the whole protein (Hespanhol et al., 2022; Jana et al., 2019). Of note, in RdnE, this region alone (containing the DNA-binding domain) is insufficient for DNase activity (unlike in PoNe). We have clarified this distinction in the results section of the current manuscript, visible in figure 2 .

      (4) Regarding the "promiscuity" there is always a limit to how similar proteins are, hence when cross-neutralization is claimed authors should always provide sequence similarities. This similarity could also be further compared in terms of the predicted interaction surface between toxin and immunity.

      Reviewer 1 points out a fundamental property of protein-protein interactions that has been isolated away from the impacts of such interactions on bacterial community structure. We have provided the whole protein alignments in figure 3 supplemental figure 3, the summary images in Figure 3D, and the protein phylogenetic trees in Figure 3C. We encourage others to consider the protein alignments as percent amino acid sequence similarity is not necessarily a good gauge for protein function and interactions. These data are publicly available on the OSF website associated with this manuscript https://osf.io/scb7z/, and we hope the community explores the data there.

      In consideration of the enthusiasm to deeply dive into the primary research data, we have included the pairwise sequence identities across the entire proteins here: Proteus RdnI vs. Rothia RdnI: 23.6%; Proteus RdnI vs. Prevotella RdnI: 16.3%, Proteus RdnI vs. Pseudomonas RdnI: 14.6%; Rothia RdnI vs. Prevotella RdnI: 22.4%, Rothia RdnI vs. Pseudomonas RdnI: 17.6%; Prevotella RdnI vs. Pseudomonas RdnI: 19.5%. (As stated in response to reviewer 1 comment 2, we did not find it appropriate to make inferences based on AlphaFold2-predicted protein complexes.)

      Overall, it looks more like a regular toxin-immunity couple, where some cross-reactions with homologues are possible, depending on how far the sequences have deviated. Nevertheless, taking all of the above into account, these results do not challenge toxin-immunity specificity dogma.

      In this manuscript, we did not intend to dismiss the E-I specificity model but rather point out its limitations and propose an important expansion of that model that accounts for cross-protection and survival against attacks from other genera. We agree that it is commonly considered that deviations in amino acid sequence over time could result in cross-binding and protection (see lines 364-368). However, the impacts of such cross-binding on community structure, bacterial survival, and strain evolution were rarely addressed in prior literature, with exceptions such as in Zhang et al., 2013 and Bosch et al., 2023 among others. One key insight we propose and show in this manuscript is that cross-binding can be a fitness benefit in mixed communities; therefore, it could be selected for evolutionarily (lines 378-380), even potentially in host microbiomes.

      Reviewer #2 (Public Review):

      Summary:

      The manuscript by Knecht et al entitled "Non-cognate immunity proteins provide broader defenses against interbacterial effectors in microbial communities" aims at characterizing a new type VI secretion system (T6SS) effector immunity pair using genetic and biochemical studies primarily focused on Proteus mirabilis and metagenomic analysis of human-derived data focused on Rothia and Prevotella sequences. The authors provide evidence that RdnE and RdnI of Proteus constitute an E-I pair and that the effector likely degrades nucleic acids. Further, they provide evidence that expression of non-cognate immunity derived from diverse species can provide protection against RdnE intoxication. Overall, this general line of investigation is underdeveloped in the T6SS field and conceptually appropriate for a broad audience journal. The paper is well-written and, aside from a few cases, well-cited. As detailed below however, there are several aspects of this paper where the evidence provided is somewhat insufficient to support the claims. Further, there are now at least two examples in the literature of non-cognate immunity providing protection against intoxication, one of which is not cited here (Bosch et al PMID 37345922 - the other being Ting et al 2018). In general therefore I think that the motivating concept here in this paper of overturning the predominant model of interbacterial effector-immunity cognate interactions is oversold and should be dialed back.

      We agree that analyses focusing on flexible non-cognate interactions and protection are underdeveloped within the T6SS field and are not fully explored within a community structure. These ideas are rapidly growing in the field, as evidenced by the references provided by the reviewer. As stated earlier, we did not intend to overturn the prevailing model but rather have proposed an expanded model that accounts for protection against attacks from foreign genera.

      Strengths:

      One of the major strengths of this paper is the combination of diverse techniques including competition assays, biochemistry, and metagenomics surveys. The metagenomic analysis in particular has great potential for understanding T6SS biology in natural communities. Finally, it is clear that much new biology remains to be discovered in the realm of T6SS effectors and immunity.

      Weaknesses:

      The authors have not formally shown that RdnE is delivered by the T6SS. Is it the case that there are not available genetics tools for gene deletion for the BB2000 strain? If there are genetic tools available, standard assays to demonstrate T6SS-dependency would be to interrogate function via inactivation of the T6SS (e.g. by deleting tssC).

      Our research group showed that the T6SS secretes RdnE (previously IdrD) in Wenren et al., 2013 (cited in lines 71-73). We later confirmed T6SS-dependent secretion by LC-MS/MS (Saak et al., 2017).  

      For swarm cross-phyla competition assays (Figure 4), at what level compared to cognate immunity are the non-cognate immunity proteins being expressed? This is unclear from the methods and Figure 4 legend and should be elaborated upon. Presumably these non-cognate immunity proteins are being overexpressed. Expression level and effector-to-immunity protein stoichiometry likely matters for interpretation of function, both in vitro as well as in relevant settings in nature. It is important to assess if native expression levels of non-cognate cross-phyla immunity (e.g. Rothia and Prevotella) protect similarly as the endogenously produced cognate immunity. This experiment could be performed in several ways, for example by deleting the RdnE-I pair and complementing back the Rothia or Prevotella RdnI at the same chromosomal locus, then performing the swarm assay. Alternatively, if there are inducible expression systems available for Proteus, examination of protection under varying levels of immunity induction could be an alternate way to address this question. Western blot analysis comparing cognate to non-cognate immunity protein levels expressed in Proteus could also be important. If the authors were interested in deriving physical binding constants between E and various cognate and non-cognate I (e.g. through isothermal titration calorimetry) that would be a strong set of data to support the claims made. The co-IP data presented in supplemental Figure 6 are nice but are from E. coli cells overexpressing each protein and do not fully address the question of in vivo (in Proteus) native expression.

      P. mirabilis strain ATCC29906 does not encode the rdnE and rdnI genes on the chromosome (NCBI BioSample: SAMN00001486) (line 151). Production of the RdnI proteins, including the cognate Proteus RdnI, comes from equivalent transgenic expression vectors. Specifically, the rdnI genes were expressed under the flaA promoter in P. mirabilis strain ATCC29906 (Table 1) for the swarm competition assays found in Figure 2C and Figure 4. This promoter results in constitutive expression in swarming cells (Belas et al., 1991; Jansen et al., 2003). In the revised manuscript, figure 4 Supplement Figure 2 shows the relative RdnI protein levels in these strains; we also clarified the expression constructs in the text (see reviewer 3, comment 1).

      Lines 321-324, the authors infer differences between E and I in terms of read recruitment (greater abundance of I) to indicate the presence of orphan immunity genes in metagenomic samples (Figure 5A-D). It seems equally or perhaps more likely that there is substantial sequence divergence in E compared to the reference sequence. In fact, metagenomes analyzed were required only to have "half of the bases on reference E-I sequence receiving coverage". Variation in coverage again could reflect divergent sequence dipping below 90% identity cutoff. I recommend performing metagenomic assemblies on these samples to assess and curate the E-I sequences present in each sample and then recalculating coverage based on the exact inferred sequences from each sample.

      This comment raises the challenges with metagenomic analyses. It was difficult to balance specificity to a particular species’ DNA sequence with the prevalence of any homologous sequence in the sample. Given the distinction in binding interactions among the examined four species, we opted to prioritize specificity, accepting that we were losing access to some rdnE and rdnI sequences in that decision. We chose a 90% identity cutoff, which, through several in silica controls, ensured that each sequence we identified was the rdnE or rdnI gene from that specific species. For the Version of Record, we have included analysis with a 70% cutoff in the supplemental information to try to account for sequence divergence by lowering the identity cutoffs as suggested. The data from the 70% identity cutoff was consistent with the original data from the 90% identity cutoff.

      A description of gene-level read recruitment in the methods section relating to metagenomic analysis is lacking and should be provided.

      Noted. We included the raw code and sequences on the OSF website associated with this manuscript https://osf.io/scb7z/.

      Reviewer #3 (Public Review):

      Summary:<br /> The authors discovered that the RdnE effector possesses DNase activity, and in competition, P. mirabilis having RdnE outcompetes the null strain. Additionally, they presented evidence that the RdnI immunity protein binds to RdnE, suppressing its toxicity. Interestingly, the authors demonstrated that the RdnI homolog from a different phylum (i.e., Actinomycetota) provides cross-species protection against RdnE injected from P. mirabilis, despite the limited identity between the immunity sequences. Finally, using metagenomic data from human-associated microbiomes, the authors provided bioinformatic evidence that the rdnE/rdnI gene pair is widespread and present in individual microbiomes. Overall, the discovery of broad protection by non-cognate immunity is intriguing, although not necessarily surprising in retrospect, considering the prolonged period during which Earth was a microbial battlefield/paradise.

      Strengths:<br /> The authors presented a strong rationale in the manuscript and characterized the molecular mechanism of the RdnE effector both in vitro and in the heterologous expression model. The utilization of the bacterial two-hybrid system, along with the competition assays, to study the protective action of RdnI immunity is informative. Furthermore, the authors conducted bioinformatic analyses throughout the manuscript, examining the primary sequence, predicted structural, and metagenomic levels, which significantly underscore the significance and importance of the EI pair. 

      Weaknesses:<br /> (1) The interaction between RdnI and RdnE appears to be complex and requires further investigation. The manuscript's data does not conclusively explain how RdnI provides a "promiscuous" immunity function, particularly concerning the RdnI mutant/chimera derivatives. The lack of protection observed in these cases might be attributed to other factors, such as a decrease in protein expression levels or misfolding of the proteins. Additionally, the transient nature of the binding interaction could be insufficient to offer effective defenses.

      Yes, we agree with the reviewer and hope that grant reviewers’ share this colleague’s enthusiasm for understanding the detailed molecular mechanisms of RdnE-RdnI binding across genera. In the revised manuscript, we have continued to emphasize such caveats as the next frontier is clearly understanding the molecular mechanisms for RdnI cognate or non-cognate protection. In the revised manuscript, figure 4 Supplement Figure 2 shows the RdnI protein levels; we also clarified the expression constructs in the text (see reviewer 2, comment 2).

      (2) The results from the mixed population competition lack quantitative analysis. The swarm competition assays only yield binary outcomes (Yes or No), limiting the ability to obtain more detailed insights from the data.

      The mixed swam assay is needed when studying T6SS effectors that are primarily secreted during Proteus’ swarming activity (Saak et al. 2017, Zepeda-Rivera et al. 2018). This limitation is one reason we utilize in vitro, in vivo, and bioinformatic analyses. Though the swarm competition assay yields a binary outcome, we are confident that the observed RdnI protection is due to interaction with a trans-cell RdnE via an active T6SS. By contrast, many manuscripts report co-expression of the EI pair (Yadev et al., 2021, Hespanhol et al., 2022) rather than secreted effectors, as we have achieved in this manuscript.

      (3) The discovery of cross-species protection is solely evident in the heterologous expression-competition model. It remains uncertain whether this is an isolated occurrence or a common characteristic of RdnI immunity proteins across various scenarios. Further investigations are necessary to determine the generality of this behavior.

      We agree, which is why we submitted this paper as a launching point for further investigations into the generality of non-cognate interactions and their potential impact on community structure.

      Comments from Reviewing Editor:<br />  - In addition to the references provided by reviewer#2, the first manuscript to show non-cognate binding of immunity proteins was Russell et al 2012 (PMID: 22607806).<br />  - IdrD was shown to form a subfamily of effectors in this manuscript by Hespanhol et al 2022 PMID: 36226828 that analyzed several T6SS effectors belonging to PDDExK, and it should be cited.

      We appreciate that the reviewer and eLife staff pointed out missed citations. We have incorporated these studies and cited them in the revised manuscript.

      [1] The Proteus RdnE in complex with either the Prevotella or Pseudomonas RdnI showed low confidence at the interface (pIDDT ~50-70%); this AI-predicted complex might support the lack of binding seen in the bacterial two-hybrid assay. On the other hand, the Proteus and Rothia RdnI N-terminal regions show higher confidence at the interface with RdnE. Despite this, the C-terminus of the Proteus RdnI shows especially low confidence (pIDDT ~50%) where it might interact near RdnE’s active site (as suggested by reviewer 1). Given this low confidence and the already stated inaccuracies of AI-generated complexes, we would rather wait for crystallization data to inform potential protection mechanisms of RdnI.

      Author response image 1.

    1. Author response:

      The following is the authors’ response to the original reviews.

      We thank the reviewers for their constructive comments and suggestions. We have prepared a revised manuscript with updated quantification of theta cycle skipping, new statistical comparisons of the difference between the two behavioral tasks, and general improvements to the text and figures.

      Reviewer #1 (Public Review):

      Summary

      The authors provide very compelling evidence that the lateral septum (LS) engages in theta cycle skipping.

      Strengths

      The data and analysis are highly compelling regarding the existence of cycle skipping.

      Weaknesses

      The manuscript falls short on in describing the behavioral or physiological importance of the witnessed theta cycle skipping, and there is a lack of attention to detail with some of the findings and figures:

      More/any description is needed in the article text to explain the switching task and the behavioral paradigm generally. This should be moved from only being in methods as it is essential for understanding the study.

      Following this suggestion, we have expanded the description of the behavioral tasks in the Results section.

      An explanation is needed as to how a cell can be theta skipping if it is not theta rhythmic.

      A cell that is purely theta skipping (i.e., always fires on alternating theta cycles and never on adjacent theta cycles) will only have enhanced power at half theta frequency and not at theta frequency. Such a cell will therefore not be considered theta rhythmic in our analysis. Note, however, that there is a large overlap between theta rhythmic and theta skipping cell populations in our data (Figure 3 - figure supplement 2), indicating that most cells are not purely theta skipping.

      The most interesting result, in my opinion, is the last paragraph of the entire results section, where there is more switching in the alternation task, but the reader is kind of left hanging as to how this relates to other findings. How does this relate to differences in decoding of relative arms (the correct or incorrect arm) during those theta cycles or to the animal's actual choice? Similarly, how does it relate to the animal's actual choice? Is this phenomenon actually behaviorally or physiologically meaningful at all? Does it contribute at all to any sort of planning or decision-making?

      We agree that the difference between the two behavioral tasks is very interesting. It may provide clues about the mechanisms that control the cycle-by-cycle expression of possible future paths and the potential impact of goal-directed planning and (recent) experience. In the revised manuscript, we have expanded the analysis of the differences in theta-cycle dynamics between the two behavioral tasks. First, we confirm the difference through a new quantification and statistical comparison. Second, we performed additional analyses to explore the idea that the alternation of non-local representations reflects the number of relevant paths available to the animal (Figure 11 – figure supplements 2 and 3), but this did not appear to be the case. However, these results provide a starting point for future studies to clarify the task dependence of the theta- cycle dynamics of spatial representations and to address the important question of behavioral/physiological relevance.

      The authors state that there is more cycle skipping in the alternation task than in the switching task, and that this switching occurs in the lead-up to the choice point. Then they say there is a higher peak at ~125 in the alternation task, which is consistent. However, in the final sentence, the authors note that "This result indicates that the representations of the goal arms alternate more strongly ahead of the choice point when animals performed a task in which either goal arm potentially leads to reward." Doesn't either arm potentially lead to a reward (but different amounts) in the switching task, not the alternation task? Yet switching is stronger in the alternation task, which is not constant and contradicts this last sentence.

      The reviewer is correct that both choices lead to (different amounts of) reward in the switching task. As written, the sentence that the reviewer refers to is indeed not accurate and we have rephrased it to: “This result indicates that the representations of the goal arms alternate more strongly ahead of the choice point when animals performed a task in which either goal arm potentially leads to a desirable high-value reward.”.

      Additionally, regarding the same sentence - "representations of the goal arms alternate more strongly ahead of the choice point when the animals performed a task in which either goal arm potentially leads to reward." - is this actually what is going on? Is there any reason at all to think this has anything to do with reward versus just a navigational choice?

      We appreciate the reviewer’s feedback and acknowledge that our statement needs clarification. At the choice point in the Y-maze there are two physical future paths available to the animal (disregarding the path that the animal took to reach the choice point) – we assume this is what the reviewer refers to as “a navigational choice”. One hypothesis could be that alternation of goal arm representations is present whenever there are multiple future paths available, irrespective of the animal’s (learned) preference to visit one or the other goal arm. However, the reduced alternation of goal arm representations in the switching task that we report, suggests that the animal’s recent history of goal arm visits and reward expectations likely do influence the theta-cycle representations ahead of the choice point. We have expanded our analysis to test if theta cycle dynamics differ for trials before and after a switch in reward contingency in the switching task, but there was no statistical difference in our data. We have rewritten and expanded this part of the results to make our point more clearly.

      Similarly, the authors mention several times that the LS links the HPC to 'reward' regions in the brain, and it has been found that the LS represents rewarded locations comparatively more than the hippocampus. How does this relate to their finding?

      Indeed, Wirtshafter and Wilson (2020) reported that lateral septum cells are more likely to have a place field close to a reward site than elsewhere in their double-sided T-maze. It is possible that this indicates a shift towards reward or value representations in the lateral septum. In our study we did not look at reward-biased cells and whether they are more or less likely to engage in theta cycle skipping. This could be a topic for future analyses. It should be noted that the study by Wirtshafter and Wilson (2020) reports that a reward bias was predominantly present for place fields in the direction of travel away from the reward site. These reward-proximate LS cells may thus contribute to theta-cycle skipping in the inbound direction, but it is not clear if these cells would be active during theta sweeps when approaching the choice point in the outbound direction.

      Reviewer #2 (Public Review)

      Summary

      Recent evidence indicates that cells of the navigation system representing different directions and whole spatial routes fire in a rhythmic alternation during 5-10 Hz (theta) network oscillation (Brandon et al., 2013, Kay et al., 2020). This phenomenon of theta cycle skipping was also reported in broader circuitry connecting the navigation system with the cognitive control regions (Jankowski et al., 2014, Tang et al., 2021). Yet nothing was known about the translation of these temporally separate representations to midbrain regions involved in reward processing as well as the hypothalamic regions, which integrate metabolic, visceral, and sensory signals with the descending signals from the forebrain to ensure adaptive control of innate behaviors (Carus-Cadavieco et al., 2017). The present work aimed to investigate theta cycle skipping and alternating representations of trajectories in the lateral septum, neurons of which receive inputs from a large number of CA1 and nearly all CA3 pyramidal cells (Risold and Swanson, 1995). While spatial firing has been reported in the lateral septum before (Leutgeb and Mizumori, 2002, Wirtshafter and Wilson, 2019), its dynamic aspects have remained elusive. The present study replicates the previous findings of theta-rhythmic neuronal activity in the lateral septum and reports a temporal alternation of spatial representations in this region, thus filling an important knowledge gap and significantly extending the understanding of the processing of spatial information in the brain. The lateral septum thus propagates the representations of alternative spatial behaviors to its efferent regions. The results can instruct further research of neural mechanisms supporting learning during goal-oriented navigation and decision-making in the behaviourally crucial circuits entailing the lateral septum.

      Strengths

      To this end, cutting-edge approaches for high-density monitoring of neuronal activity in freely behaving rodents and neural decoding were applied. Strengths of this work include comparisons of different anatomically and probably functionally distinct compartments of the lateral septum, innervated by different hippocampal domains and projecting to different parts of the hypothalamus; large neuronal datasets including many sessions with simultaneously recorded neurons; consequently, the rhythmic aspects of the spatial code could be directly revealed from the analysis of multiple spike trains, which were also used for decoding of spatial trajectories; and comparisons of the spatial coding between the two differently reinforced tasks.

      Weaknesses

      Possible in principle, with the present data across sessions, longitudinal analysis of the spatial coding during learning the task was not performed. Without using perturbation techniques, the present approach could not identify the aspects of the spatial code actually influencing the generation of behaviors by downstream regions.

      Reviewer #3 (Public Review)

      Summary

      Bzymek and Kloosterman carried out a complex experiment to determine the temporal spike dynamics of cells in the dorsal and intermediate lateral septum during the performance of a Y-maze spatial task. In this descriptive study, the authors aim to determine if inputting spatial and temporal dynamics of hippocampal cells carry over to the lateral septum, thereby presenting the possibility that this information could then be conveyed to other interconnected subcortical circuits. The authors are successful in these aims, demonstrating that the phenomenon of theta cycle skipping is present in cells of the lateral septum. This finding is a significant contribution to the field as it indicates the phenomenon is present in neocortex, hippocampus, and the subcortical hub of the lateral septal circuit. In effect, this discovery closes the circuit loop on theta cycle skipping between the interconnected regions of the entorhinal cortex, hippocampus, and lateral septum. Moreover, the authors make 2 additional findings: 1) There are differences in the degree of theta modulation and theta cycle skipping as a function of depth, between the dorsal and intermediate lateral septum; and 2) The significant proportion of lateral septum cells that exhibit theta cycle skipping, predominantly do so during 'non-local' spatial processing.

      Strengths

      The major strength of the study lies in its design, with 2 behavioral tasks within the Y-maze and a battery of established analyses drawn from prior studies that have established spatial and temporal firing patterns of entorhinal and hippocampal cells during these tasks. Primary among these analyses, is the ability to decode the animal's position relative to locations of increased spatial cognitive demand, such as the choice point before the goal arms. The presence of theta cycle skipping cells in the lateral septum is robust and has significant implications for the ability to dissect the generation and transfer of spatial routes to goals within and between the neocortex and subcortical neural circuits.

      Weaknesses

      There are no major discernable weaknesses in the study, yet the scope and mechanism of the theta cycle phenomenon remain to be placed in the context of other phenomena indicative of spatial processing independent of the animal's current position. An example of this would be the ensemble-level 'scan ahead' activity of hippocampal place cells (Gupta et al., 2012; Johnson & Redish, 2007). Given the extensive analytical demands of the study, it is understandable that the authors chose to limit the analyses to the spatial and burst firing dynamics of the septal cells rather than the phasic firing of septal action potentials relative to local theta oscillations or CA1 theta oscillations. Yet, one would ideally be able to link, rather than parse the phenomena of temporal dynamics. For example, Tingley et al recently showed that there was significant phase coding of action potentials in lateral septum cells relative to spatial location (Tingley & Buzsaki, 2018). This begs the question as to whether the non-uniform distribution of septal cell activity within the Y-maze may have a phasic firing component, as well as a theta cycle skipping component. If so, these phenomena could represent another means of information transfer within the spatial circuit during cognitive demands. Alternatively, these phenomena could be part of the same process, ultimately representing the coherent input of information from one region to another. Future experiments will therefore have to sort out whether theta cycle skipping, is a feature of either rate or phase coding, or perhaps both, depending on circuit and cognitive demands.

      The authors have achieved their aims of describing the temporal dynamics of the lateral septum, at both the dorsal extreme and the intermediate region. All conclusions are warranted.

      Reviewer #1 (Recommendations For The Authors)

      The text states: "We found that 39.7% of cells in the LSD and 32.4% of cells in LSI had significantly higher CSI values than expected by chance on at least one of the trajectories." The text in the supplemental figure indicates a p-value of 0.05 was used to determine significance. However, four trajectory categories are being examined so a Bonferroni correction should be used (significance at p<0.0125).

      Indeed, a p-value correction for multiple tests should be performed when determining theta cycle skipping behavior for each of the four trajectories. We thank the reviewer for pointing out this oversight. We have implemented a Holm-Sidak p-value correction for the number of tested trajectories per cell (excluding trajectories with insufficient spikes). As a consequence, the number of cells with significant cycle-skipping activity decreased, but overall the results have not changed.

      Figure 4 is very confusing as raster plots are displayed for multiple animals but it is unclear which animal the LFP refers to? The bottom of the plot is also referenced twice in the figure caption.

      We apologize for the confusion. We have removed this figure in the revised manuscript, as it was not necessary to make the point about the spatial distribution of theta cycle skipping. Instead, we show examples of spatially-resolved cycle skipping in Figure 4 (formerly Figure 5 - supplementary figures 1 and 2) and we have added a plot with the spatially-resolved cycle skipping index for all analyzed cells in Figure 5A.

      Figure 6 has, I think, an incorrect caption or figure. Only A and B are marked in the figure but A-G are mentioned in the caption but do not appear to correspond to anything in the figure.

      Indeed, the caption was outdated. This has now been corrected.

      Figure 8 is also confusing for several reasons: how is the probability scale on the right related to multiple semi-separate (top and middle) figures? In the top and bottom figures, it is not clear what the right and left sides refer to. It is also unclear why a probability of 0.25 is used for position (seems potentially low). The caption also mentions Figure A but there are no lettered "sub" figures in Figure 8.

      The color bar on the right applies to both the top plot (directional decoding) and the middle plot (positional decoding). However, the maximum probability that is represented by black differs between the top and middle plots. We acknowledge that a shared color bar may lead to confusion and we have given each of the plots a separate color bar.

      As for the maximum probability of 0.25 for position: this was a typo in the legend. The correct maximum value is 0.5. In general, the posterior probability will be distributed over multiple (often neighboring) spatial bins, and the distribution of maximum probabilities will depend on the number of spatial bins, the level of spatial smoothing in the decoding algorithm, and the amount of decodable information in the data. It would be more appropriate to consider the integrated probability over a small section of the maze, rather than the peak probability that is assigned to a single 5 cm bin. Also, note that a posterior probability of 0.5 is many times higher than the probability associated with a uniform distribution, which is in our case.

      The left and right sides of the plots represent two different journeys that the animal ran. On the left an outbound journey is shown, and on the right an inbound journey. We have improved the figure and the description in the legend to make this clearer.

      The reviewer is correct that there are no panels in Figure 8 and we have corrected the legend.

      Some minor concerns

      The introduction states that "a few studies have reported place cell-like activity in the lateral septum (Tingley and Buzsaki, 2018; Wirtshafter and Wilson, 2020, 2019)." However, notably and controversially, the Tingley study is one of the few studies to find NO place cell activity in the lateral septum. This is sort of mentioned later but the citation in this location should be removed.

      The reviewer is correct, Tingley and Buzsaki reported a spatial phase code but no spatial rate code. We have removed the citation.

      Stronger position/direction coding in the dLS consistent with prior studies and they should be cited in text (not a novel finding).

      Thank you for pointing out this omission. Indeed, a stronger spatial coding in the dorsal lateral septum has been reported before, for example by Van der Veldt et al. (2021). We now cite this paper when discussing these findings.

      Why is the alternation task administered for 30m but the switching task for 45m?

      The reason is that rats received a larger reward in the switching task (in the high-reward goal arm) and took longer to complete trials on average. To obtain a more-or-less similar number of trials per session in both tasks, we extended the duration of switching task sessions to 45 minutes. We have added this explanation to the text.

      Regarding the percentage of spatially modulated cells in the discussion, it is also worth pointing out that bits/sec information is consistent with previous studies.

      Thank you for the suggestion. We now point out that the spatial information in our data is consistent with previous studies.

      Reviewer #2 (Recommendations For The Authors)

      While the results of the study are robust and timely, further details of behavioural training, additional quantitative comparisons, and improvements in the data presentation would make the study more comprehensible and complete.

      Major comments

      (1) I could not fully comprehend the behavioural protocols. They require a clearer explanation of both the specific rationale of the two tasks as well as a more detailed presentation of the protocols. Specifically:

      (1.1) In the alternation task, were the arms baited in a random succession? How many trials were applied per session? Fig 1D: how could animals reach high choice accuracy if the baiting was random?

      We used a continuous version of the alternation task, in which the animals were rewarded for left→home→right and right→home→left visit sequences. In addition, animals were always rewarded on inbound journeys. There was no random baiting of goal arms. Perhaps the confusion stems from our use of the word “trial” to refer to a completed lap (i.e., a pair of outbound/inbound journeys). On average, animals performed 54 of such trials per 30-minute session in the alternation task. We have expanded the description of the behavioral tasks in the Results and further clarified these points in the Methods section.

      (1.2) Were they rewarded for correct inbound trials? If there was no reward, why were they considered correct?

      Yes, rats received a reward at the home platform for correct inbound trials. We have now explicitly stated this in the text.

      (1.3) In the switch alternation protocol, for how many trials was one arm kept more rewarding than the other, and how many trials followed after the rewarding value switch?

      A switch was triggered when rats (of their own volition) visited the high-reward goal arm eight times in a row. Following a switch, the animals could complete as many trials as necessary until they visited the new high- reward goal arm in eight consecutive trials, which triggered another switch. As can be seen in Figure 1D, at the population level, animals needed ~13 trials to fully commit to the high-reward goal arm following a switch. We have further clarified the switching task protocol in the Results and Methods sections.

      (1.4) What does the phrase "the opposite arm (as 8 consecutive visits)" exactly mean? Sounds like 8 consecutive visits signalled that the arm was rewarded (as if were not predefined in the protocol).

      The task is self-paced and the animals initially visit both goal arms, before developing a bias for the high- reward goal arm. A switch of reward size was triggered as soon as the animal visited the high-reward goal arm for eight consecutive trials. We have rewritten the description of the switching task protocol, including this sentence, which hopefully clarifies the procedure.

      (1.5) P. 15, 1st paragraph, Theta cycle skipping and alternation of spatial representations is more prominent in the alternation task. Why in the switching task, did rats visit the left and right arms approximately equally often if one was more rewarding than the other? How many switches were applied per recording session, and how many trials were there in total?

      Both the left and right goal arms were sampled more or less equally by the animals because both goal arms at various times were associated with a large reward following switches in reward values during sessions. The number of switches per session varied from 1 to 3. Sampling of both goal arms was also evident at the beginning of each session and following each reward value switch, before animals switched their behavior to the (new) highly rewarded goal arm. In Table 1, we have now listed the number of trials and the number of reward-value switches for all sessions.

      (1.6) Is the goal arm in figures the rewarded/highly rewarded arm only or are non-baited arms also considered here?

      Both left and right arms are considered goal arms and were included in the analyses, irrespective of the reward that was received (or not received).

      (2) The spatial navigation-centred behavioural study design and the interpretation of results highlight the importance of the dorsal hippocampal input to the LS. Yet, the recorded LSI cells are innervated by intermediate and ventral aspects of the hippocampus, and LS receives inputs from the amygdala and the prefrontal cortex, which together may together bring about - crucial for the adaptive behaviours regulated by the LS - reward, and reward-prediction-related aspects in the firing of LS cells during spatial navigation. Does success or failure to acquire reward in a trial modify spatial coding and cycle skipping of LSD vs. LSI cells in ensuing inbound and outbound trials?

      This is an excellent question and given the length of the current manuscript, we think that exploration of this question is best left for a future extension of our study.

      A related question: in Figure 10, it is interesting that cycle skipping is prominent in the goal arm for outbound switching trials and inbound trials of both tasks. Could it be analytically explained by task contingencies and behaviour (e.g. correct/incorrect trial, learning dynamics, running speed, or acceleration)?

      Our observation of cycle skipping at the single-cell level in the goal arms is somewhat surprising and, we agree with the reviewer, potentially interesting. However, it was not accompanied by alternation of representations at the population level. Given the current focus and length of the manuscript, we think further investigation of cycle skipping in the goal arm is better left for future analyses.

      (3) Regarding possible cellular and circuit mechanisms of cycle skipping and their relation to the alternating representations in the LS. Recent history of spiking influences the discharge probability; e.g. complex spike bursts in the hippocampus are associated with a post-burst delay of spiking. In LS, cycle skipping was characteristic for LS cells with high firing rates and was not uniformly present in all trajectories and arms. The authors propose that cycle skipping can be more pronounced in epochs of reduced firing, yet the opposite seems also possible - this phenomenon can be due to an intermittently increased drive onto some LS cells. Was there a systematic relationship between cycle skipping in a given cell and the concurrent firing rate or a recent discharge with short interspike intervals?

      In our discussion, we tried to explain the presence of theta cycle skipping in the goal arms at the single-cell level without corresponding alternation dynamics at the population level. We mentioned the possibility of a decrease in excitatory drive. As the reviewer suggests, an increase in excitatory drive combined with post- burst suppression or delay of spiking is an alternative explanation. We analyzed the spatial tuning of cells with theta cycle skipping and found that, on average, these cells have a higher firing rate in the goal arm than the stem of the maze in both outbound and inbound run directions (Figure 5 – figure supplement 1). In contrast, cells that do not display theta cycle skipping do not show increased firing in the goal arm. These results are more consistent with the reviewer’s suggested mechanism and we have updated the discussion accordingly.

      (4) Were the differences between the theta modulation (cycle skipping) of local vs. non-local representations (P.14, line 10-12, "In contrast...", Figure 9A) and between alternation vs. switching tasks (Figure 10 C,D) significantly different?

      We have added quantification and statistical comparisons for the auto- and cross-correlations of the local/non-local representations. The results indeed show significantly stronger theta cycle skipping of the non-local representations as compared to the local representations (Figure 10 - figure supplement 1A), a stronger alternation of non-local representations in the outbound direction (Figure 10 - figure supplement 1B), and significant differences between the two tasks (Figure 11E,F).

      (5) Regarding the possibility of prospective coding in LS, is the accurate coding of run direction not consistent with prospective coding? Can the direction be decoded from the neural activity in the start arm? Are the cycling representations of the upcoming arms near the choice point equally likely or preferential for the then- selected arm?

      The coding of run direction (outbound or inbound) is distinct from the prospective/retrospective coding of the goal arm. As implemented, the directional decoding model does not differentiate between the two goal arms and accurate decoding of direction with this model can not inform us whether or not there is prospective (or retrospective) coding. To address the reviewer’s comments, we performed two additional analyses. First, we analyzed the directional (outbound/inbound) decoding performance as a function of location in the maze (Figure 6 - figure supplement 3E). The results show that directional decoding performance is high in both stem and goal arms. Second, we analyzed how well we can predict the trajectory type (i.e., to/from the left or right goal arm) as a function of location in the maze, and separately for outbound and inbound trajectories (Figure 6 - figure supplement 3C,D). The results show that on outbound journeys, decoding the future goal arm is close to chance when the animals are running along the stem. The decoding performance goes up around the choice point and reaches the highest level when animals are in the goal arm.

      (6) Figure 10 seems to show the same or similar data as Figures 5 (A,B) and 9 (C,D).

      Figure 10 (figure 11 in revised manuscript) re-analyzes the same data as presented in Figures 5 and 9, but separates the experimental sessions according to the behavioral task. We now explicitly state this.

      Minor comments

      (1) If cycle skipping in the periodicity of non-local representations was more prominent in alternation than in the switching task, one might expect them to be also prominent in early trials of the switching task, when the preference of a more rewarding arm is not yet established. Was this the case?

      The reviewer makes an interesting suggestion. Indeed, if theta cycle skipping and the alternation of non-local representations reflect that there are multiple paths that the animal is considering, one may predict that the theta skipping dynamics are similar between the two tasks in early trials (as the reviewer suggests). Similarly, one may predict that in the switching task, the alternation of non-local representations is weaker immediately before a reward contingency switch (when the animal has developed a bias towards the goal arm with a large reward) as compared to after the switch.

      We have now quantified the theta cycle dynamics of spatial representations in the early trials in each session of both tasks (Figure 11 - figure supplement 2) and in the trials before and after each switch in the switching task (Figure 11 - figure supplement 3).

      The results of the early trial analysis indicate stronger alternation of non-local representations in the alternation task than in the switching task (consistent with the whole session analysis), which is contrary to the prediction.

      The pre-/post-switch analysis did not reveal a significant difference between the trials before and after a reward contingency switch. If anything, there was a trend towards stronger theta cycle skipping/alternation in the trials before a switch, which would be opposite to the prediction.

      These results do not appear to support the idea that the alternation of non-local representations reflects the number of relevant paths available to the animal. We have updated the text to incorporate these new data and discuss the implications.

      (2) Summary: sounds like the encoding of spatial information and its readout in the efferent regions are equally well established.

      Thank you for pointing this out.

      (3) Summary: "motivation and reward processing centers such as the ventral tegmental area." How about also mentioning here the hypothalamus, which is a more prominent output of the lateral septum than the VTA?

      We have now also mentioned the hypothalamus.

      (4) "lateral septum may contribute to the hippocampal theta" - readers not familiar with details of the medial vs. lateral septum research may misinterpret the modest role of LS in theta compared to MS.

      We have added “in addition to the strong theta drive originating from the medial septum” to make clear that the lateral septum has a modest role in hippocampal theta generation.

      (5) "(Tingley and Buzsáki, 2018) found a lack of spatial rate coding in the lateral septum and instead reported a place coding by specific phases of the hippocampal theta rhythm (Rizzi-Wise and Wang, 2021) " needs rephrasing.

      Thank you, we have rephrased the sentence.

      (6) Figure 4 is a bit hard to generalize. The authors may additionally consider a sorted raster presentation of the dataset in this main figure.

      We have removed this figure in the revised manuscript, as it was not necessary to make the point about the location of theta cycle skipping. Instead, we show examples of spatially-resolved cycle skipping in Figure 4 (formerly Figure 5 - supplementary figures 1 and 2), and, following the reviewer’s suggestion, we have added a plot with the spatially-resolved cycle skipping index for all analyzed cells (Figure 5A).

      (7) It would help if legends of Figure 5 (and related supplementary figures) state in which of the two tasks the data was acquired, as it is done for Figure 10.

      Thank you for the suggestion. The legends of Figure 4A,B (formerly Figure 5 – supplemental figures 1 and 2) and Figure 5 now include in which behavioral task the data was acquired.

      (8) Page 10, "Spatial coding...", 1st Citing the initial report by Leugeb and Mizumori would be appropriate here too.

      The reviewer is correct. We have added the citation.

      (9) The legend in Figure 6 (panels A-G) does not match the figure (only panels A,B). What is shown in Fig. 6B, the legend does not seem to fully match.

      Indeed, the legend was outdated. This has now been corrected.

      (10) 7 suppl., if extended to enable comparisons, could be a main figure. Presently, Figure 7C does not account for the confounding effect of population size and is therefore difficult to interpret without complex comparisons with the Supplementary Figure which is revealing per se.

      We thank the reviewer for their suggestion. We have changed Figure 7 such that it only shows the analysis of decoding performed with all LSD and LSI cells. Figure 7 – supplemental figure 1 has been transformed into main Figure 8, with the addition of a panel to show a statistical comparison between decoding performance in LSD and LSI with a fixed number of cells.

      (11) 14, line 10 there is no Figure 8A

      This has been corrected.

      (12) 15 paragraph 1, is the discussed here model the one from Kay et al?

      From Kay et al. (2020) and also Wang et al. (2020). We have added the citations.

      (13) Figure 5 - Figure Supplement 1 presents a nice analysis that, in my view, can merit a main figure. I could not find the description of the colour code in CSI panels, does grey/red refer to non/significant points?

      Indeed, grey/red refers to non-significant points and significant points respectively. We have clarified the color code in the figure legend. Following the reviewer’s suggestion, we have made Figure 5 Supplement 1 and 2 a main figure (Figure 4).

      (14) Figure 5 -Figure Supplement 2. Half of the cells (255 and 549) seems not to be representative of the typically high SCI in the goal arm in left and right inbound trials combined (Figure 5 A). Were the changes in CSI in the right and left inbound trials similar enough to be combined in Fig 5A? Otherwise, considering left and right inbound runs separately and trying to explain where the differences come from would seem to make sense.

      Figure 5 – figure supplement 2 is now part of the new main Figure 4. Originally, the examples were from a single session and the same cells as shown in the old Figure 4. However, since the old Figure 4 has been removed, we have selected examples from different sessions and both left/right trajectories that are more representative of the overall distribution. We have further added a plot with the spatially-resolved cycle skipping for all analyzed cells in Figure 5A.

      (15) In the second paragraph of the Discussion, dorso-ventral topography of hippocampal projections to the LS (Risold and Swanson, Science, 90s) could be more explicitly stated here.

      Thank you for the suggestion. We have now explicitly mentioned the dorsal-ventral topography of hippocampal-lateral septum projections and cite Risold & Swanson (1997).

      (16) Discussion point: why do the differences in spatial information of cells in the ventral/intermediate vs. dorsal hippocampus not translate into similarly prominent differences in LSI vs. LSD?

      In our data, we do observe clear differences in spatial coding between LSD and LSI. Specifically, cell activity in the LSD is more directional, has higher goal arm selectivity, and higher spatial information (we have now added statistical comparisons to Figure 6 – figure supplement 1). As a result, spatial decoding performance is much better for LSD cell populations than LSI cell populations (see updated Figure 8, with statistical comparison of decoding performance). Spatial coding in the LS is not as strong as in the hippocampus, likely because of the convergence of hippocampal inputs, which may give the impression of a less prominent difference between the two subregions.

      (17) Discussion, last paragraph: citation of the few original anatomical and neurophysiological studies would be fitting here, in addition to the recent review article.

      Thank you for the suggestion. We have added selected citations of the original literature.

      (18) Methods, what was the reference electrode?

      We used an external reference electrode that was soldered to a skull screw, which was positioned above the cerebellum. We have added this to the Methods section.

      (19) Methods, Theta cycle skipping: bandwidth = gaussian kerner parameter?

      The bandwidth is indeed a parameter of the Gaussian smoothing kernel and is equal to the standard deviation.

      Reviewer #3 (Recommendations For The Authors)

      Below I offer a short list of minor comments and suggestions that may benefit the manuscript.

      (A) I was not able to access the Open Science Framework Repository. Can this be rectified?

      Thank you for checking the OSF repository. The data and analysis code are now publicly available.

      (B) In the discussion the authors should attempt to flesh out whether they can place theta cycle skipping into context with left/right sweeps or scan ahead phenomena, as shown in the Redish lab.

      Thank you for the excellent suggestion. We have now added a discussion of the possible link between theta cycle skipping and the previously reported scan-ahead theta sweeps.

      (C) What is the mechanism of cycle skipping? This could be relevant to intrinsic vs network oscillator models. Reference should also be made to the Deshmukh model of interference between theta and delta (Deshmukh, Yoganarasimha, Voicu, & Knierim, 2010).

      We had discussed a potential mechanism in the discussion (2nd to last paragraph in the revised manuscript), which now includes a citation of a recent computational study (Chu et al., 2023). We have now also added a reference to the interference model in Deshmukh et al, 2010.

      (D) Little background was given for the motivation and expectation for potential differences between the comparison of the dorsal and intermediate lateral septum. I don't believe that this is the same as the dorsal/ventral axis of the hippocampus, but if there's a physiological justification, the authors need to make it.

      We have added a paragraph to the introduction to explain the anatomical and physiological differences across the lateral septum subregions that provide our rationale for comparing dorsal and intermediate lateral septum (we excluded the ventral lateral septum because the number of cells recorded in this region was too low).

      (E) It would help to label "outbound" and "inbound" on several of the figures. All axes need to be labeled, with appropriate units indicated.

      We have carefully checked the figures and added inbound/outbound labels and axes labels where appropriate.

      (F) In Figure 6, the legend doesn't match the figure.

      Indeed, the legend was outdated. This has now been corrected.

      (G) The firing rate was non-uniform across the Y-maze. Does this mean that the cells tended to fire more in specific positions of the maze? If so, how would this affect the result? Would increased theta cycle skipping at the choice point translate to a lower firing rate at the choice point? Perhaps less overdispersion of the firing rate (Fenton et al., 2010)?

      Individual cells indeed show a non-uniform firing rate across the maze. To address the reviewer’s comment and test if theta cycle skipping cells were active preferentially near the choice point or other locations, we computed the mean-corrected spatial tuning curves for cell-trajectory pairs with and without significant theta cycle skipping. This additional analysis indicates that, on average, the population of theta cycle skipping cells showed a higher firing rate in the goal arms than in the stem of the maze as compared to non-skipping cells for outbound and inbound directions (shown in Figure 5 - figure supplement 1).

      (H) As mentioned above, it could be helpful to look at phase preference. Was there an increased phase preference at the choice point? Would half-cycle firing correlate with an increased or decreased phase preference? Based on prior work, one would expect increased phase preference, at least in CA1, at the choice point (Schomburg et al., 2014). In contrast, other work might predict phasic preference according to spatial location (Tingley & Buzsaki, 2018). Including phase analyses is a suggestion, of course. The manuscript is already sufficiently novel and informative. Yet, the authors should state why phase was not analyzed and that these questions remain for follow-up analyses. If the authors did analyze this and found negative results, it should be included in this manuscript.

      We thank the reviewer for their suggestion. We have not yet analyzed the theta phase preference of lateral septum cells or other relations to the theta phase. We agree that this would be a valuable extension of our work, but prefer to leave it for future analyses.

      (I) One of the most important aspects of the manuscript, is that there is now evidence of theta cycle skipping in the circuit loop between the EC, CA1, and LS. This now creates a foundation for circuit-based studies that could dissect the origin of route planning. Perhaps the authors should state this? In the same line of thinking, how would one determine whether theta cycle skipping is necessary for route planning as opposed to a byproduct of route planning? While this question is extremely complex, other studies have shown that spatial navigation and memory are still possible during the optogenetic manipulation of septal oscillations (Mouchati, Kloc, Holmes, White, & Barry, 2020; Quirk et al., 2021). However, pharmacological perturbation or lesioning of septal activity can have a more profound effect on spatial navigation (Bolding, Ferbinteanu, Fox, & Muller, 2019; Winson, 1978). As a descriptive study, I think it would be helpful to remind the readers of these basic concepts.

      We thank the reviewer for their comment and for pointing out possible future directions for linking theta cycle skipping to route planning. Experimental manipulations to directly test this link would be very challenging, but worthwhile to pursue. We now mention how circuit-based studies may help to test if theta cycle skipping in the broader subcortical-cortical network is necessary for route planning. Given that the discussion is already quite long, we decided to omit a more detailed discussion of the possible role of the medial septum (which is the focus of the papers cited by the reviewer).

      Very minor points

      (A) In the introduction, "one study" begins the sentence but there is a second reference.

      Thank you, we have rephrased the sentence.

      (B) Also in the introduction, it could be helpful to have an operational definition of theta cycle skipping (i.e., 'enhanced rhythmicity at half theta frequency').

      We followed the reviewer’s suggestion.

      (C) The others should be more explicit in the introduction about their main question. Theta cycle skipping exists in CA1, and then import some of the explanations mentioned in the discussion to the introduction (i.e., attractors states of multiple routes). The main question is then whether this phenomenon, and others from CA1, translate to the output in LS.

      We have edited the introduction to more clearly state the main question of our study, following the suggestion from the reviewer.

      (D) There are a few instances of extra closing parentheses.

      We checked the text but did not find instances of erroneous extra closing parentheses. There are instances of nested parentheses, which may have given the impression that closing parentheses were duplicated.

      (E) The first paragraph of the Discussion lacks sufficient references.

      We have now added references to the first paragraph of the discussion.

      (F) At the end of the 2nd paragraph in the Discussion, the comparison is missing. More than what? It's not until the next reference that one can assume that the authors are referring to a dorsal/ventral axis. However, the physiological motivation for this comparison is lacking. Why would one expect a dorsal/intermediate continuum for theta modulation as there is along the dorsal/ventral axis of the hippocampus?

      Thank you for spotting this omission. We have rewritten the paragraph to more clearly make the parallel between dorsal-ventral gradients in the lateral septum and hippocampus and how this relates to the topographical connections between the two structures.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer 1 (Public Review):

      Summary

      The manuscript uses state-of-the-art analysis technology to document the spatio-temporal dynamics of brain activity during the processing of threats. The authors offer convincing evidence that complex spatio-temporal aspects of brain dynamics are essential to describe brain operations during threat processing.

      Strengths

      Rigorous complex analyses well suited to the data.

      Weaknesses

      Lack of a simple take-home message about discovery of a new brain operation.

      We have addressed the concern under response to item 1 in Recommendations for the authors of Reviewer 2 below.

      Reviewer 1 (Recommendations for the authors):

      The paper presents sophisticated analyses of how the spatiotemporal activity of the brain processes threats. While the study is elegant and relevant to the threat processing literature, it could be improved by better clarification of novelty, scope, assumptions and implications. Suggestions are reported below.

      (1) Introduction: It is difficult to understand what is unsatisfactory in the present literature and why we need this study. For example, lines 57-64 report what works well in the work of Anderson and Fincham but do not really describe what this approach lacks, either in failing to explain real data in conceptual terms.

      We have edited the corresponding lines to better describe what such approaches generally lack:

      Introduction; Lines 63-66: However, the mapping between brain signals and putative mental states (e.g., “encoding”) remained speculative. More generally, state-based modeling of fMRI data would benefit from evaluation in contexts where the experimental paradigm affords a clearer mapping between discovered states and experimental manipulation.

      (2) Also, based on the introduction it is unclear if the focus is on understanding the processing of threat or in the methodological development of experimental design and analysis paradigms for more ecologically valid situations.

      In our present work, we tried to focus on understanding dynamics of threat processing while also contributing to methodological development of analysis of dynamic/ecologically inspired experiments. To that end, we have added a new paragraph at the end of Introduction to clarify the principal focus of our work:

      Introduction; Lines 111-118: Is the present contribution focused on threat processing or methodological developments for the analysis of more continuous/ecologically valid paradigms? Our answer is “both”. One goal was to contribute to the development of a framework that considers brain processing to be inherently dynamic and multivariate. In particular, our goal was to provide the formal basis for conceptualizing threat processing as a dynamic process (see (Fanselow and Lester, 1987)) subject to endogenous and exogenous contributions. At the same time, our study revealed how regions studied individually in the past (e.g., anterior insula, cingulate cortex) contribute to brain states with multi-region dynamics.

      (3) The repeated statement, based on the Fiete paper, that most analyses or models of brain activity do not include an exogenous drive seems an overstatement. There is plenty of literature that not only includes exogenous drives but also studies and documents them in detail. There are many examples, but a prominent one is the study of auditory processing. Essentially all human brain areas related to hearing (not only the activity of individual areas but also their communication) are entrained by the exogenous drive of speech (e.g. J. Gross et al, PLoS Biology 11 e1001752, 2013).

      We have altered the original phrasing, which now reads as:

      Introduction; Lines 93-95: Importantly, we estimated both endogenous and exogenous components of the dynamics, whereas some past work has not modeled both contributions (see discussion in (Khona and Fiete, 2022)).

      Discussion; Lines 454-455: Work on dynamics of neural circuits in systems neuroscience at times assumes that the target circuit is driven only by endogenous processes (Khona and Fiete, 2022).

      (4) Attractor dynamics is used as a prominent descriptor of fMRI activity, yet the discussion of how this may emerge from the interaction between areas is limited. Is it related to the way attractors emerge from physical systems or neural networks (e.g. Hopfield?).

      This is an important question that we believe will benefit from computational and mathematical modeling, but we consider it beyond the scope of the present paper.

      (5) Fig 4 shows activity of 4 regions, not 2 s stated in lines 201-202. Correct?

      Fig. 4 shows activity of two regions and also the average activity of regions belonging to two resting-state networks engaged during threat processing (discussed shortly after lines 201-202). To clarify the above concern, we have changed the following line:

      Results; Lines 228-230: In Fig. 4, we probed the average signals from two resting-state networks engaged during threat-related processing, the salience network which is particularly engaged during higher threat, and the default network which is engaged during conditions of relative safety.

      (6) It would be useful to state more clearly how Fig 7B, C differs from Fig 2A, B (my understanding it is that in the former they are isolating the stimulus-driven processes)

      We have clarified this by adding the following line in the Results:

      Results; Lines 290-292: Note that in Fig. 7B/C we evaluated exogenous contributions only for stimuli associated with each state/state transition reported in Fig. 2A/B (see also Methods).

      Reviewer 2 (Public Review):

      Summary

      This paper by Misra and Pessoa uses switching linear dynamical systems (SLDS) to investigate the neural network dynamics underlying threat processing at varying levels of proximity. Using an existing dataset from a threat-of-shock paradigm in which threat proximity is manipulated in a continuous fashion, the authors first show that they can identify states that each has their own linear dynamical system and are consistently associated with distinct phases of the threat-of-shock task (e.g., “peri-shock”, “not near”, etc). They then show how activity maps associated with these states are in agreement with existing literature on neural mechanisms of threat processing, and how activity in underlying brain regions alters around state transitions. The central novelty of the paper lies in its analyses of how intrinsic and extrinsic factors contribute to within-state trajectories and betweenstate transitions. A final set of analyses shows how the findings generalize to another (related) threat paradigm.

      Strengths

      The analyses for this study are conducted at a very high level of mathematical and theoretical sophistication. The paper is very well written and effectively communicates complex concepts from dynamical systems. I am enthusiastic about this paper, but I think the authors have not yet exploited the full potential of their analyses in making this work meaningful toward increasing our neuroscientific understanding of threat processing, as explained below.

      Weaknesses

      (1) I appreciate the sophistication of the analyses applied and/or developed by the authors. These methods have many potential use cases for investigating the network dynamics underlying various cognitive and affective processes. However, I am somewhat disappointed by the level of inferences made by the authors based on these analyses at the level of systems neuroscience. As an illustration consider the following citations from the abstract: “The results revealed that threat processing benefits from being viewed in terms of dynamic multivariate patterns whose trajectories are a combination of intrinsic and extrinsic factors that jointly determine how the brain temporally evolves during dynamic threat” and “We propose that viewing threat processing through the lens of dynamical systems offers important avenues to uncover properties of the dynamics of threat that are not unveiled with standard experimental designs and analyses”. I can agree to the claim that we may be able to better describe the intrinsic and extrinsic dynamics of threat processing using this method, but what is now the contribution that this makes toward understanding these processes?

      We have addressed the concern under response to item 1 in Recommendations for the authors below.

      (2) How sure can we be that it is possible to separate extrinsically and intrinsically driven dynamics?

      We have addressed the concern under response to item 2 in Recommendations for the authors below.

      Reviewer 2 (Recommendations for the authors):

      (1) To address the first point under weaknesses above: I would challenge the authors to make their results more biologically/neuroscientifically meaningful, in particular in the sections (in results and/or discussion) on how intrinsic and extrinsic factors contribute to within-state trajectories and between-state transitions, and make those explicit in both the abstract and the discussion (what exactly are the properties of the dynamics of threat that are uncovered?). The authors may also argue that the current approach lies the groundwork for such efforts, but does not currently provide such insights. If they would take this position, that should be made explicit throughout (which would make it more of a methodological paper).

      The SLDS approach provides, we believe, a powerful framework to describe system-level dynamics (of threat processing in the the present case). A complementary type of information can be obtained by studying the contribution of individual components (brain regions) within the larger system (brain), an approach that helps connect our approach to studies that typically focus on the contributions of individual regions, and contributes to providing more neurobiological interpretability to the results. Accordingly, we developed a new measure of region importance that captured the extent to which individual brain regions contributed to driving system dynamics during a given state.

      Abstract; Lines 22-25: Furthermore, we developed a measure of region importance that quantifies the contributions of an individual brain region to system dynamics, which complements the system-level characterization that is obtained with the state-space SLDS formalism.

      Introduction; Lines 95-99: A considerable challenge in state-based modeling, including SLDS, is linking estimated states and dynamics to interpretable processes. Here, we developed a measure of region importance that provides a biologically meaningful way to bridge this gap, as it quantifies how individual brain regions contribute to steering state trajectories.

      Results; Lines 302-321: Region importance and steering of dynamics: Based on time series data and input information, the SLDS approach identifies a set of states and their dynamics. While these states are determined in the latent space, they can be readily mapped back to the brain, allowing for the characterization of spatiotemporal properties across the entire brain. Since not all regions contribute equally to state properties, we propose that a region’s impact on state dynamics serves as a measure of its importance.

      We illustrate the concept for STATE 5 (“near miss”) in Fig. 8 (see Fig. S17 for all states). Fig. 8A shows importance in the top row and activity below as a function of time from state entry.The dynamics of importance and activity can be further visualized (Fig. 8B), where some regions of particularly high importance are illustrated together with the ventromedial PFC, a region that is typically not engaged during high-threat conditions. Notably, the importance of the dorsal anterior insula increased quickly in the first time points, and later decreased. In contrast, the importance of the periaqueductal gray was relatively high from the beginning of the state and decreased moderately later.

      Fig. 8C depicts the correlation between these measures as a function of time. For all but STATE 1, the correlation increased over time. Interestingly, for STATES 4-5, the correlation was low at the first and second time points of the state (and for STATE 2 at the first time point), and for STATE 3 the measures were actually anticorrelated; both cases indicate a dissociation between activity and importance. In summary, our results illustrate that univariate region activity can differ from multivariate importance, providing a fruitful path to understand how individual brain regions contribute to collective dynamic properties.

      Discussion; Lines 466-487: In the Introduction, we motivated our study in terms of determining multivariate and distributed patterns of activity with shared dynamics. At one end of the spectrum, it is possible to conceptualize the whole brain as dynamically evolving during a state; at the other end, we could focus on just a few “key” regions, or possibly a single one (at which point the description would be univariate). Here, we addressed this gap by studying the importance of regions to state dynamics: To what extent does a region steer the trajectory of the system? From a mathematical standpoint, our proposed measure is not merely a function of activity of a region but also of the coefficients of the dynamics matrix capturing its effect on across-region dynamics (Eichler, 2005; Smith et al., 2010).

      How distributed should the dynamics of threat be considered? One answer to this question is to consider the distribution of importance values for all states. For STATE 1 (“post shock”), a few regions displayed the highest importance values for a few time points. However, for the other states the distribution of importance values tended to be more uniform at each time point. Thus, based on our proposed importance measure, we conclude that threat-related processing is profitably viewed as substantially distributed. Furthermore, we found that while activity and importance were relatively correlated, they could also diverge substantially. Together, we believe that the proposed importance measure provides a valuable tool for understanding the rich dynamics of threat processing. For example, we discovered that the dorsal anterior insula is important not only during high-anxiety states (such as STATE 5; “near miss”) but also, surprisingly, for a state that followed the aversive shock event (STATE 1; “post shock”). Additionally, we noted that posterior cingulate cortex, widely known to play a central role in the default mode network, to have the highest importance among all other regions in driving dynamics of low-anxiety states (such as STATE 3 and STATE 4; “not near”).

      Methods; Lines 840-866: Region importance We performed a “lesion study”, where we quantified how brain regions contribute to state dynamics by eliminating (zeroing) model parameters corresponding to a given region, and observing the resulting changes in system dynamics. According to our approach, the most important regions are those that cause the greatest change in system dynamics when eliminated.

      The SLDS model represents dynamics in a low dimensional latent space and model parameters are not readily available at the level of individual regions. Thus, the first step was to project the dynamics equation onto the brain data prior to computing importance values. Thus, the linear dynamics equation in the latent space (Eq. 2) was mapped to the original data space of N = 85 ROIs using the emissions model (Eq. 1):

      where C<sup>†</sup> represents the Moore-Penrose pseudoinverse of C, and and denote the corresponding dynamics matrix, input matrix, and bias terms in the original data space.

      Based on the above, we defined the importance of the i<sup>th</sup> ROI at time t based on quantifying the impact of “lesioning” the i<sup>th</sup> ROI, i.e., by setting the i<sup>th</sup> column of , the i<sup>th</sup> row of ,   and the i<sup>th</sup> element of to 0, denoted , , and respectively. Formally, the importance of the i<sup>th</sup> ROI was defined as:

      where ‘∗’ indicates element-wise multiplication of a scalar with a vector, is the activity of i<sup>th</sup> ROI at time corresponds to the i<sup>th</sup> column of is the inner product between i<sup>th</sup> row of and input corresponds to the i<sup>th</sup> element of and represents an indicator vector corresponding to the i<sup>th</sup> ROI. Note that the term is a function of both the i<sup>th</sup> ROI’s activity as well as the coefficients of the dynamics matrix capturing the effect of region i on the one-step dynamics of the entire system (Eichler, 2005; Smith et al., 2010); the remaining terms capture the effect of the external inputs and the bias term on the one-step dynamics of the i<sup>th</sup> ROI.

      After computing for a given run, the resultant importance time series was normalized to zero mean and unit variance.

      (2) To address the second point under the weaknesses above: Given that the distinction between intrinsic and extrinsic dynamics appears central to the novelty of the paper, I would suggest the authors explicitly address this issue in the introduction and/or discussion sections.

      The distinction between intrinsic and extrinsic dynamics is a modeling assumption of SLDS. We used such an assumption because in experimental designs with experimenter manipulated inputs one can profitably investigate both types of contribution to dynamics. While we should not reify the model’s assumption, we can gain confidence in our separation of extrinsically and intrinsically driven dynamics through controlled experiments where we can manipulate external inputs, or by demonstrating time-scale separation of intrinsic and extrinsic dynamics and that they operate at different frequencies. This is an important question that requires additional computational/mathematical modeling, but we consider it beyond the scope of the current paper. We have added the following lines in the discussion section:

      Discussion; Lines 521-528: A further issue that we wish to discuss is related to the distinction between intrinsic and extrinsic dynamics, which is explicitly modeled in our SLDS approach (see Methods, equation 2). We believe this is a powerful approach because in experimental designs with experimenter manipulated inputs, one can profitably investigate both types of contribution to dynamics. However, complete separation between intrinsic and extrinsic dynamics is challenging to ascertain. More generally, one can gain confidence in their separation through controlled experiments where external inputs are manipulated, or by demonstrating timescale separation of intrinsic and extrinsic dynamics.

      (3) In the abstract, the statement “.. studies in systems neuroscience that frequently assume that systems are decoupled from external inputs” sounds paradoxical after first introducing how threat processing is almost exclusively studied using blocked and event-related task designs (which obviously rely on external inputs only). Please clarify this.

      In this work, we wished to state that the SLDS framework characterizes both endogenous and exogenous contributions to dynamics, whereas some past work has not modeled both contributions. To clarify, we have changed the corresponding line:

      Abstract; Lines 19-20: Importantly, we characterized both endogenous and exogenous contributions to dynamics.

      (4) In the abstract, the first mention of circles comes out of the blue; the paradigm needs to be introduced first to make this understandable.

      We have rephrased the corresponding text:

      Abstract; Lines 14-17: First, we demonstrated that the SLDS model learned the regularities of the experimental paradigm, such that states and state transitions estimated from fMRI time series data from 85 regions of interest reflected threat proximity and threat approach vs. retreat.

      (5 In Figure 3, the legend shows z-scores representing BOLD changes associated with states. However, the z-scores are extremely low (ranging between -.4 and .4). Can this be correct, given that maps are thresholded at p < ._001 (i.e., _z > 3_._09)? A similar small range of z-scores is shown in the legend of Fig 5. Please check the z-score ranges.

      The p-value threshold used in Fig. 3 is based on the voxelwise t-test conducted between the participantbased bootstrapped maps and null maps (see Methods : State spatial maps : “To identify statistically significant voxels, we performed a paired t-test between the participant-based boostrapped maps and the null maps.”). Thus, the p-value threshold in the figure does not correspond to the z-scores of the groupaveraged state-activation maps. Similarly in Fig. 5, we only visualized the state-wise attractors on a brain surface map without any thresholding. The purpose of using a z-score color bar was to provide a scale comparable to that of BOLD activity.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Thank you very much for the careful and positive reviews of our manuscript. We have addressed each comment in the attached revised manuscript. We describe the modifications below. To avoid confusion, we've changed supplementary figure and table captions to start with "Supplement Figure" and "Supplementary Table," instead of "Figure" and "Table."

      We have modified/added:

      ● Supplementary Table S1: AUC scores for the top 10 frequent epitope types (pathogens) in the testing set of epitope split.

      ● Supplementary Table S5: AUCs of TCR-epitope binding affinity prediction models with BLOSUM62 to embed epitope sequences.

      ● Supplementary Table S6: AUCs of TCR-epitope binding affinity prediction models trained on catELMo TCR embeddings and random-initialized epitope embeddings.

      ● Supplementary Table S7: AUCs of TCR-epitope binding affinity prediction models trained on catELMo and BLOSUM62 embeddings.

      ● Supplementary Figure 4: TCR clustering performance for the top 34 abundant epitopes representing 70.55% of TCRs in our collected databases.

      ● Section Discussion.

      ● Section 4.1 Data: TCR-epitope pairs for binding affinity prediction.

      ● Section 4.4.2 Epitope-specific TCR clustering.

      Public Reviews:

      Reviewer #1 (Public Review):

      In this manuscript, the authors described a computational method catELMo for embedding TCR CDR3 sequences into numeric vectors using a deep-learning-based approach, ELMo. The authors applied catELMo to two applications: supervised TCR-epitope binding affinity prediction and unsupervised epitope-specific TCR clustering. In both applications, the authors showed that catELMo generated significantly better binding prediction and clustering performance than other established TCR embedding methods. However, there are a few major concerns that need to be addressed.

      (1) There are other TCR CDR3 embedding methods in addition to TCRBert. The authors may consider incorporating a few more methods in the evaluation, such as TESSA (PMCID: PMC7799492), DeepTCR (PMCID: PMC7952906) and the embedding method in ATM-TCR (reference 10 in the manuscript). TESSA is also the embedding method in pMTnet, which is another TCR-epitope binding prediction method and is the reference 12 mentioned in this manuscript.

      TESSA is designed for characterizing TCR repertoires, so we initially excluded it from the comparison. Our focus was on models developed specifically for amino acid embedding rather than TCR repertoire characterization. However, to address the reviewer's inquiry, we conducted further evaluations. Since both TESSA and DeepTCR used autoencoder-based models to embed TCR sequences, we selected one used in TESSA for evaluation in our downstream prediction task, conducting ten trials in total. It achieved an average AUC of 75.69 in TCR split and 73.3 in epitope split. Notably, catELMo significantly outperformed such performance with an AUC of 96.04 in TCR split and 94.10 in epitope split.

      Regarding the embedding method in ATM-TCR, it simply uses BLOSUM as an embedding matrix which we have already compared in Section 2.1. Furthermore, we have provided the comparison results between our prediction model trained on catELMo embeddings with the state-of-the-art prediction models such as netTCR and ATM-TCR in Table 6 of the Discussion section.

      (2) The TCR training data for catELMo is obtained from ImmunoSEQ platform, including SARS-CoV2, EBV, CMV, and other disease samples. Meanwhile, antigens related to these diseases and their associated TCRs are extensively annotated in databases VDJdb, IEDB and McPAS-TCR. The authors then utilized the curated TCR-epitope pairs from these databases to conduct the evaluations for eptitope binding prediction and TCR clustering. Therefore, the training data for TCR embedding may already be implicitly tuned for better representations of the TCRs used in the evaluations. This seems to be true based on Table 4, as BERT-Base-TCR outperformed TCRBert. Could catELMo be trained on PIRD as TCRBert to demonstrate catELMo's embedding for TCRs targeting unseen diseases/epitopes?

      We would like to note that catELMo was trained exclusively on TCR sequences in an unsupervised manner, which means it has never been exposed to antigen information. We also ensured that the TCRs used in catELMo's training did not overlap with our downstream prediction data. Please refer to the section 4.1 Data where we explicitly stated, “We note that it includes no identical TCR sequences with the TCRs used for training the embedding models.”. Moreover, the performance gap (~1%) between BERT-Base-TCR and TCRBert, as observed in Table 4, is relatively small, especially when compared to the performance difference (>16%) between catELMo and TCRBert.

      To further address this concern, we conducted experiments using the same number of TCRs, 4,173,895 in total, sourced exclusively from healthy ImmunoSeq repertoires. This alternative catELMo model demonstrated a similar prediction performance (based on 10 trials) to the one reported in our paper, with an average AUC of 96.35% in TCR split and an average AUC of 94.03% in epitope split.

      We opted not to train catELMo on the PIRD dataset for several reasons. First, approximately 7.8% of the sequences in PIRD also appear in our downstream prediction data, which could be a potential source of bias. Furthermore, PIRD encompasses sequences related to diseases such as Tuberculosis, HIV, CMV, among others, which the reviewer is concerned about.

      (3) In the application of TCR-epitope binding prediction, the authors mentioned that the model for embedding epitope sequences was catElMo, but how about for other methods, such as TCRBert? Do the other methods also use catELMo-embedded epitope sequences as part of the binding prediction model, or use their own model to embed the epitope sequences? Since the manuscript focuses on TCR embedding, it would be nice for other methods to be evaluated on the same epitope embedding (maybe adjusted to the same embedded vector length).

      Furthermore, the authors found that catELMo requires less training data to achieve better performance. So one would think the other methods could not learn a reasonable epitope embedding with limited epitope data, and catELMo's better performance in binding prediction is mainly due to better epitope representation.

      Review 1 and 3 have raised similar concerns regarding the epitope embedding approach employed in our binding affinity prediction models. We address both comments together on page 6 where we discuss the epitope embedding strategies in detail.

      (4) In the epitope binding prediction evaluation, the authors generated the test data using TCR-epitope pairs from VDJdb, IEDB, McPAS, which may be dominated by epitopes from CMV. Could the authors show accuracy categorized by epitope types, i.e. the accuracy for TCR-CMV pair and accuracy for TCR-SARs-CoV2 separately?

      The categorized AUC scores have been added in Supplementary Table 7. We observed significant performance boosts from catELMo compared with other embedding models.

      (5) In the unsupervised TCR clustering evaluation, since GIANA and TCRdist direct outputs the clustering result, so they should not be affected by hierarchical clusters. Why did the curves of GIANA and TCRdist change in Figure 4 when relaxing the hierarchical clustering threshold?

      For fair comparisons, we performed GIANA and TCRdist with hierarchical clustering instead of the nearest neighbor search. We have clarified it in the revised manuscript as follows.

      “Both methods are developed on the BLOSUM62 matrix and apply nearest neighbor search to cluster TCR sequences. GIANA used the CDR3 of TCRβ chain and V gene, while TCRdist predominantly experimented with CDR1, CDR2, and CDR3 from both TCRα and TCRβ chains. For fair comparisons, we perform GIANA and TCRdist only on CDR3 β chains and with hierarchical clustering instead of the nearest neighbor search.”

      (6 & 7) In the unsupervised TCR clustering evaluation, the authors examined the TCR related to the top eight epitopes. However, there are much more epitopes curated in VDJdb, IEDB and McPAS-TCR. In real application, the potential epitopes is also more complex than just eight epitopes. Could the authors evaluate the clustering result using all the TCR data from the databases? In addition to NMI, it is important to know how specific each TCR cluster is. Could the authors add the fraction of pure clusters in the results? Pure cluster means all the TCRs in the cluster are binding to the same epitope, and is a metric used in the method GIANA.

      We would like to note that there is a significant disparity in TCR binding frequencies across different epitopes in current databases. For instance, the most abundant epitope (KLGGALQAK) has approximately 13k TCRs binding to it, while 836 out of 982 epitopes are associated with fewer than 100 TCRs in our dataset. Furthermore, there are 9347 TCRs having the ability to bind multiple epitopes. In order to robustly evaluate the clustering performance, we originally selected the top eight frequent epitopes from McPAS and removed TCRs binding multiple epitopes to create a more balanced dataset.

      We acknowledge that the real-world scenario is more complex than just eight epitopes. Therefore, we conducted clustering experiments using the top most abundant epitopes whose combined cognate TCRs make up at least 70% of TCRs across three databases (34 epitopes). This is illustrated in Supplementary Figure 5. Furthermore, we extended our analysis by clustering all TCRs after filtering out those that bind to multiple epitopes, resulting in 782 unique epitopes. We found that catELMo achieved the 3rd and 2nd best performance in NMI and Purity, respectively (see Table below). These are aligned with our previous observations of the eight epitopes.

      Author response table 1.

      Reviewer #2 (Public Review):

      In the manuscript, the authors highlighted the importance of T-cell receptor (TCR) analysis and the lack of amino acid embedding methods specific to this domain. The authors proposed a novel bi-directional context-aware amino acid embedding method, catELMo, adapted from ELMo (Embeddings from Language Models), specifically designed for TCR analysis. The model is trained on TCR sequences from seven projects in the ImmunoSEQ database, instead of the generic protein sequences. They assessed the effectiveness of the proposed method in both TCR-epitope binding affinity prediction, a supervised task, and the unsupervised TCR clustering task. The results demonstrate significant performance improvements compared to existing embedding models. The authors also aimed to provide and discuss their observations on embedding model design for TCR analysis: 1) Models specifically trained on TCR sequences have better performance than models trained on general protein sequences for the TCR-related tasks; and 2) The proposed ELMo-based method outperforms TCR embedding models with BERT-based architecture. The authors also provided a comprehensive introduction and investigation of existing amino acid embedding methods. Overall, the paper is well-written and well-organized.

      The work has originality and has potential prospects for immune response analysis and immunotherapy exploration. TCR-epitope pair binding plays a significant role in T cell regulation. Accurate prediction and analysis of TCR sequences are crucial for comprehending the biological foundations of binding mechanisms and advancing immunotherapy approaches. The proposed embedding method presents an efficient context-aware mathematical representation for TCR sequences, enabling the capture and analysis of their structural and functional characteristics. This method serves as a valuable tool for various downstream analyses and is essential for a wide range of applications. Thank you.

      Reviewer #3 (Public Review):

      Here, the authors trained catElMo, a new context-aware embedding model for TCRβ CDR3 amino acid sequences for TCR-epitope specificity and clustering tasks. This method benchmarked existing work in protein and TCR language models and investigated the role that model architecture plays in the prediction performance. The major strength of this paper is comprehensively evaluating common model architectures used, which is useful for practitioners in the field. However, some key details were missing to assess whether the benchmarking study is a fair comparison between different architectures. Major comments are as follows:

      • It is not clear why epitope sequences were also embedded using catELMo for the binding prediction task. Because catELMO is trained on TCRβ CDR3 sequences, it's not clear what benefit would come from this embedding. Were the other embedding models under comparison also applied to both the TCR and epitope sequences? It may be a fairer comparison if a single method is used to encode epitope sequence for all models under comparison, so that the performance reflects the quality of the TCR embedding only.

      In our study, we indeed used the same embedding model for both TCRs and epitopes in each prediction model, ensuring a consistent approach throughout.

      Recognizing the importance of evaluating the impact of epitope embeddings, we conducted experiments in which we used BLOSUM62 matrix to embed epitope sequences for all models. The results (Supplementary Table 5) are well aligned with the performance reported in our paper. This suggests that epitope embedding may not play as critical a role as TCR embedding in the prediction tasks. To further validate this point, we conducted two additional experiments.

      Firstly, we used catELMo to embed TCRs while employing randomly initialized embedding matrices with trainable parameters for epitope sequences. It yielded similar prediction performance as when catELMo was used for both TCR and epitope embedding (Supplementary Table 6). Secondly, we utilized BLOSUM62 to embed TCRs but employed catELMo for epitope sequence embedding, resulting in performance comparable to using BLOSUM62 for both TCRs and epitopes (Supplementary Table 4). These experiment results confirmed the limited impact of epitope embedding on downstream performance.

      We conjecture that these results may be attributed to the significant disparity in data scale between TCRs (~290k) and epitopes (less than 1k). Moreover, TCRs tend to exhibit high similarity, whereas epitopes display greater distinctiveness from one another. These features of TCRs require robust embeddings to facilitate effective separation and improve downstream performance, while epitope embedding primarily serves as a categorical encoding.

      We have included a detailed discussion of these findings in the revised manuscript to provide a comprehensive understanding of the role of epitope embeddings in TCR binding prediction.

      • The tSNE visualization in Figure 3 is helpful. It makes sense that the last hidden layer features separate well by binding labels for the better performing models. However, it would be useful to know if positive and negative TCRs for each epitope group also separate well in the original TCR embedding space. In other words, how much separation between these groups is due to the neural network vs just the embedding?

      It is important to note that we used the same downstream prediction model, a simple three-linear-layer network, for all the discussed embedding methods. We believe that the separation observed in the t-SNE visualization effectively reflects the ability of our embedding model. Also, we would like to mention that it can be hard to see a clear distinction between positive and negative TCRs in the original embedding space because embedding models were not trained on positive/negative labels. Please refer to the t-SNE of the original TCR embeddings below.

      Author response image 1.

      • To generate negative samples, the author randomly paired TCRs from healthy subjects to different epitopes. This could produce issues with false negatives if the epitopes used are common. Is there an estimate for how frequently there might be false negatives for those commonly occurring epitopes that most populations might also have been exposed to? Could there be a potential batch effect for the negative sampled TCR that confounds with the performance evaluation?

      Thank you for bringing this valid and interesting point up. Generating negative samples is non-trivial since only a limited number of non-binding TCR-pairs are publicly available and experimentally validating non-binding pairs is costly [1]. Standard practices for generating negative pairs are (1) paring epitopes with healthy TCRs [2, 3], and (2) randomly shuffling existing TCR-epitope pairs [4,5]. We used both approaches (the former included in the main results, and the latter in the discussion). In both scenarios, catELMo embeddings consistently demonstrated superior performance.

      We acknowledge the possibility of false negatives due to the finite-sized TCR database from which we randomly selected TCRs, however, we believe that the likelihood of such occurrences is low. Given the vast diversity of human TCR clonotypes, which can exceed 10^15[6], the chance of randomly selecting a TCR that specifically recognizes a target epitope is relatively small.

      In order to investigate the batch effect, we generated new negative pairs using different seeds and observed consistent prediction performance across these variations. However, we agree that there could still be a potential batch effect for the negative samples due to potential data bias.

      We have discussed the limitation of generative negative samples in the revised manuscript.

      • Most of the models being compared were trained on general proteins rather than TCR sequences. This makes their comparison to catELMO questionable since it's not clear if the improvement is due to the training data or architecture. The authors partially addressed this with BERT-based models in section 2.4. This concern would be more fully addressed if the authors also trained the Doc2vec model (Yang et al, Figure 2) on TCR sequences as baseline models instead of using the original models trained on general protein sequences. This would make clear the strength of context-aware embeddings if the performance is worse than catElmo and BERT.

      We agree it is important to distinguish between the effects of training data and architecture on model performance.

      In Section 2.4, as the reviewer mentioned, we compared catELMo with BERT-based models trained on the same TCR repertoire data, demonstrating that architecture plays a significant role in improving performance. Furthermore, in Section 2.5, we compared catELMo-shallow with SeqVec, which share the same architecture but were trained on different data, highlighting the importance of data on the model performance.

      To further address the reviewer's concern, we trained a Doc2Vec model on the TCR sequences that have been used for catELMo training. We observed significantly lower prediction performance compared to catELMo, with an average AUC of 50.24% in TCR split and an average AUC of 51.02% in epitope split, making the strength of context-aware embeddings clear.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) It is known that TRB CDR3, the CDR1, CDR2 on TRBV gene and the TCR alpha chain also contribute to epitope recognition, but were not modeled in catELMo. It would be nice for the authors to add this as a current limitation for catELMo in the Discussion section.

      We have discussed the limitation in the revised manuscript.

      “Our study focuses on modeling the TCRβ chain CDR3 region, which is known as the primary determinant of epitope binding. Other regions, such as CDR1 and CDR2 on the TRB V gene, along with the TCRα chain, may also contribute to specificity in antigen recognition. However, a limited number of available samples for those additional features can be a challenge for training embedding models. Future work may explore strategies to incorporate these regions while mitigating the challenges of working with limited samples.”

      (2) I tried to follow the instructions to train a binding affinity prediction model for TCR-epitope pairs, however, the cachetools=5.3.0 seems could not be found when running "pip install -r requirements.txt" in the conda environment bap. Is this cachetools version supported after Python 3.7 so the Python 3.6.13 suggested on the GitHub repo might not work?

      This has been fixed. We have updated the README.md on our github page.

      Reviewer #2 (Recommendations For The Authors):

      The article is well-constructed and well-written, and the analysis is comprehensive.

      The comments for minor issues that I have are as follows:

      (1) In the Methods section, it will be clearer if the authors interpret more on how the standard deviation is calculated in all tables. How to define the '10 trials'? Are they based on different random training and test set splits?

      ‘10 trials' refers to the process of splitting the dataset into training, validation, and testing sets using different seeds for each trial. Different trials have different training, validation, and testing sets. For each trial, we trained a prediction model on its training set and measured performance on its testing set. The standard deviation was calculated from the 10 measurements, estimating model performance variation across different random splits of the data.

      (2) The format of AUCs and the improvement of AUCs need to be consistent, i.e., with the percent sign.

      We have updated the format of AUCs.

      Reviewer #3 (Recommendations For The Authors):

      In addition to the recommendations in the public review, we had the following more minor questions and recommendations:

      • Could you provide some more background on the data, such as overlaps between the databases, and how the training and validation split was performed between the three databases? Also summary statistics on the length of TCR and epitope sequence data would be helpful.

      We have provided more details about data in our revision.

      • Could you comment on the runtime to train and embed using the catELMo and BERT models?

      Our training data is TCR sequences with relatively short lengths (averaging less than 20 amino acid residues). Such characteristic significantly reduces the computational resources required compared to training large-scale language models on extensive text corpora. Leveraging standard machines equipped with two GeForce RTX 2080 GPUs, we were able to complete the training tasks within a matter of days. After training, embedding one sequence can be accomplished in a matter of seconds.

      • Typos and wording:

      • Table 1 first row of "source": "immunoSEQ" instead of "immuneSEQ"

      This has been corrected.

      • L23 of abstract "negates the need of complex deep neural network architecture" is a little confusing because ELMo itself is a deep neural network architecture. Perhaps be more specific and add that the need is for downstream tasks.

      We have made it more specific in our abstract.

      “...negates the need for complex deep neural network architecture in downstream tasks.”

      References

      (1) Montemurro, Alessandro, et al. "NetTCR-2.0 enables accurate prediction of TCR-peptide binding by using paired TCRα and β sequence data." Communications biology 4.1 (2021): 1060.

      (2) Jurtz, Vanessa Isabell, et al. "NetTCR: sequence-based prediction of TCR binding to peptide-MHC complexes using convolutional neural networks." BioRxiv (2018): 433706.

      (3) Gielis, Sofie, et al. "Detection of enriched T cell epitope specificity in full T cell receptor sequence repertoires." Frontiers in immunology 10 (2019): 2820.

      (4) Cai, Michael, et al. "ATM-TCR: TCR-epitope binding affinity prediction using a multi-head self-attention model." Frontiers in Immunology 13 (2022): 893247.

      (5) Weber, Anna, et al. "TITAN: T-cell receptor specificity prediction with bimodal attention networks." Bioinformatics 37 (2021): i237-i244.

      (6) Lythe, Grant, et al. "How many TCR clonotypes does a body maintain?." Journal of theoretical biology 389 (2016): 214-224.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The authors seek to establish what aspects of nervous system structure and function may explain behavioral differences across individual fruit flies. The behavior in question is a preference for one odor or another in a choice assay. The variables related to neural function are odor responses in olfactory receptor neurons or in the second-order projection neurons, measured via calcium imaging. A different variable related to neural structure is the density of a presynaptic protein BRP. The authors measure these variables in the same fly along with the behavioral bias in the odor assays. Then they look for correlations across flies between the structure-function data and the behavior.

      Strengths:

      Where behavioral biases originate is a question of fundamental interest in the field. In an earlier paper (Honegger 2019) this group showed that flies do vary with regard to odor preference, and that there exists neural variation in olfactory circuits, but did not connect the two in the same animal. Here they do, which is a categorical advance, and opens the door to establishing a correlation. The authors inspect many such possible correlations. The underlying experiments reflect a great deal of work, and appear to be done carefully. The reporting is clear and transparent: All the data underlying the conclusions are shown, and associated code is available online.

      We are glad to hear the reviewer is supportive of the general question and approach.

      Weaknesses:

      The results are overstated. The correlations reported here are uniformly small, and don't inspire confidence that there is any causal connection. The main problems are

      Our revision overhauls the interpretation of the results to prioritize the results we have high confidence in (specifically, PC 2 of our Ca++ data as a predictor of OCT-MCH preference) versus results that are suggestive but not definitive (such as PC 1 of Ca++ data as a predictor of Air-OCT preference).

      It’s true that the correlations are small, with R2 values typically in the 0.1-0.2 range. That said, we would call it a victory if we could explain 10 to 20% of the variance of a behavior measure, captured in a 3 minute experiment, with a circuit correlate. This is particularly true because, as the reviewer notes, the behavioral measurement is noisy.

      (1) The target effect to be explained is itself very weak. Odor preference of a given fly varies considerably across time. The systematic bias distinguishing one fly from another is small compared to the variability. Because the neural measurements are by necessity separated in time from the behavior, this noise places serious limits on any correlation between the two.

      This is broadly correct, though to quibble, it’s our measurement of odor preference which varies considerably over time. We are reasonably confident that more variance in our measurements can be attributed to sampling error than changes to true preference over time. As evidence, the correlation in sequential measures of individual odor preference, with delays of 3 hours or 24 hours, are not obviously different. We are separately working on methodological improvements to get more precise estimates of persistent individual odor preference, using averages of multiple, spaced measurements. This is promising, but beyond the scope of this study.

      (2) The correlations reported here are uniformly weak and not robust. In several of the key figures, the elimination of one or two outlier flies completely abolishes the relationship. The confidence bounds on the claimed correlations are very broad. These uncertainties propagate to undermine the eventual claims for a correspondence between neural and behavioral measures.

      We are broadly receptive to this criticism. The lack of robustness of some results comes from the fundamental challenge of this work: measuring behavior is noisy at the individual level. Measuring Ca++ is also somewhat noisy. Correlating the two will be underpowered unless the sample size is huge (which is impractical, as each data point requires a dissection and live imaging session) or the effect size is large (which is generally not the case in biology). In the current version we tried in some sense to avoid discussing these challenges head-on, instead trying to focus on what we thought were the conclusions justified by our experiments with sample sizes ranging from 20 to 60. Our revision is more candid about these challenges.

      That said, we believe the result we view as the most exciting — that PC2 of Ca++ responses predicts OCT-MCH preference — is robust. 1) It is based on a training set with 47 individuals and a test set composed of 22 individuals. The p-value is sufficiently low in each of these sets (0.0063 and 0.0069, respectively) to pass an overly stringent Bonferroni correction for the 5 tests (each PC) in this analysis. 2) The BRP immunohistochemistry provides independent evidence that is consistent with this result — PC2 that predicts behavior (p = 0.03 from only one test) and has loadings that contrast DC2 and DM2. Taken together, these results are well above the field-standard bar of statistical robustness.

      In our revision, we are explicit that this is the (one) result we have high confidence in. We believe this result convincingly links Ca++ and behavior, and warrants spotlighting. We have less confidence in other results, and say so, and we hope this addresses concerns about overstating our results.

      (3) Some aspects of the statistical treatment are unusual. Typically a model is proposed for the relationship between neuronal signals and behavior, and the model predictions are correlated with the actual behavioral data. The normal practice is to train the model on part of the data and test it on another part. But here the training set at times includes the testing set, which tends to give high correlations from overfitting. Other times the testing set gives much higher correlations than the training set, and then the results from the testing set are reported. Where the authors explored many possible relationships, it is unclear whether the significance tests account for the many tested hypotheses. The main text quotes the key results without confidence limits.

      Our primary analyses are exactly what the reviewer describes, scatter plots and correlations of actual behavioral measures against predicted measures. We produced test data in separate experiments, conducted weeks to months after models were fit on training data. This is more rigorous than splitting into training and test sets data collected in a single session, as batch/environmental effects reduce the independence of data collected within a single session.

      We only collected a test set when our training set produced a promising correlation between predicted and actual behavioral measures. We never used data from test sets to train models. In our main figures, we showed scatter plots that combined test and training data, as the training and test partitions had similar correlations.

      We are unsure what the reviewer means by instances where we explored many possible relationships. The greatest number of comparisons that could lead to the rejection of a null hypothesis was 5 (corresponding to the top 5 PCs of Ca++ response variation or Brp signal). We were explicit that the p-values reported were nominal. As mentioned above, applying a Bonferroni correction for n=5 comparisons to either the training or test correlations from the Ca++ to OCT-MCH preference model remains significant at alpha=0.05.

      Our revision includes confidence intervals around ⍴signal for the PN PC2 OCT-MCH model, and for the ORN Brp-Short PC2 OCT-MCH model (lines 170-172, 238)

      Reviewer #2 (Public Review):

      Summary:

      The authors aimed to identify the neural sources of behavioral variation in a decision between odor and air, or between two odors.

      Strengths:

      -The question is of fundamental importance.

      -The behavioral studies are automated, and high-throughput.

      -The data analyses are sophisticated and appropriate.

      -The paper is clear and well-written aside from some strong wording.

      -The figures beautifully illustrate their results.

      -The modeling efforts mechanistically ground observed data correlations.

      We are glad to read that the reviewer sees these strengths in the study. We hope the current revision addresses the strong wording.

      Weaknesses:

      -The correlations between behavioral variations and neural activity/synapse morphology are (i) relatively weak, (ii) framed using the inappropriate words "predict", "link", and "explain", and (iii) sometimes non-intuitive (e.g., PC 1 of neural activity).

      Taking each of these points in turn:

      i) It would indeed be nicer if our empirical correlations are higher. One quibble: we primarily report relatively weak correlations between measurements of behavior and Ca++/Brp. This could be the case even when the correlation between true behavior and Ca++/Brp is higher. Our analysis of the potential correlation between latent behavioral and Ca++ signals was an attempt to tease these relationships apart. The analysis suggests that there could, in fact, be a high underlying correlation between behavior and these circuit features (though the error bars on these inferences are wide).

      ii) We worked to ensure such words are used appropriately. “Predict” can often be appropriate in this context, as a model predicts true data values. Explain can also be appropriate, as X “explaining” a portion of the variance of Y is synonymous with X and Y being correlated. We cannot think of formal uses of “link,” and have revised the manuscript to resolve any inappropriate word choice.

      iii) If the underlying biology is rooted in non-intuitive relationships, there’s unfortunately not much we can do about it. We chose to use PCs of our Ca++/Brp data as predictors to deal with the challenge of having many potential predictors (odor-glomerular responses) and relatively few output variables (behavioral bias). Thus, using PCs is a conservative approach to deal with multiple comparisons. Because PCs are just linear transformations of the original data, interpreting them is relatively easy, and in interpreting PC1 and PC2, we were able to identify simple interpretations (total activity and the difference between DC2 and DM2 activation, respectively). All in all, we remain satisfied with this approach as a means to both 1) limit multiple comparisons and 2) interpret simple meanings from predictive PCs.

      No attempts were made to perturb the relevant circuits to establish a causal relationship between behavioral variations and functional/morphological variations.

      We did conduct such experiments, but we did not report them because they had negative results that we could not definitively interpret. We used constitutive and inducible effectors to alter the physiology of ORNs projecting to DC2 and DM2. We also used UAS-LRP4 and UAS-LRP4-RNAi to attempt to increase and decrease the extent of Brp puncta in ORNs projecting to DC2 and DM2. None of these manipulations had a significant effect on mean odor preference in the OCT-MCH choice, which was the behavioral focus of these experiments. We were unable to determine if the effectors had the intended effects in the targeted Gal4 lines, particularly in the LRP experiments, so we could not rule out that our negative finding reflected a technical failure.

      Author response image 1.

      We believe that even if these negative results are not technical failures, they are not necessarily inconsistent with the analyses correlating features of DC2 and DM2 to behavior. Specifically, we suspect that there are correlated fluctuations in glomerular Ca++ responses and Brp across individuals, due to fluctuations in the developmental spatial patterning of the antennal lobe. Thus, the DC2-DM2 predictor may represent a slice/subset of predictors distributed across the antennal lobe. This would also explain how we “got lucky” to find two glomeruli as predictors of behavior, when we were only able to image a small portion of the glomeruli.

      Reviewer #3 (Public Review):

      Churgin et. al. seeks to understand the neural substrates of individual odor preference in the Drosophila antennal lobe, using paired behavioral testing and calcium imaging from ORNs and PNs in the same flies, and testing whether ORN and PN odor responses can predict behavioral preference. The manuscript's main claims are that ORN activity in response to a panel of odors is predictive of the individual's preference for 3-octanol (3-OCT) relative to clean air, and that activity in the projection neurons is predictive of both 3-OCT vs. air preference and 3-OCT vs. 4-methylcyclohexanol (MCH). They find that the difference in density of fluorescently-tagged brp (a presynaptic marker) in two glomeruli (DC2 and DM2) trends towards predicting behavioral preference between 3-oct vs. MCH. Implementing a model of the antennal lobe based on the available connectome data, they find that glomerulus-level variation in response reminiscent of the variation that they observe can be generated by resampling variables associated with the glomeruli, such as ORN identity and glomerular synapse density.

      Strengths:

      The authors investigate a highly significant and impactful problem of interest to all experimental biologists, nearly all of whom must often conduct their measurements in many different individuals and so have a vested interest in understanding this problem. The manuscript represents a lot of work, with challenging paired behavioral and neural measurements.

      Weaknesses:

      The overall impression is that the authors are attempting to explain complex, highly variable behavioral output with a comparatively limited set of neural measurements.

      We would say that we are attempting to explain a simple, highly variable behavioral measure with a comparatively limited set of neural measurements, i.e. we make no claims to explain the complex behavioral components of odor choice, like locomotion, reversals at the odor boundary, etc.

      Given the degree of behavioral variability they observe within an individual (Figure 1- supp 1) which implies temporal/state/measurement variation in behavior, it's unclear that their degree of sampling can resolve true individual variability (what they call "idiosyncrasy") in neural responses, given the additional temporal/state/measurement variation in neural responses.

      We are confident that different Ca++ recordings are statistically different. This is borne out in the analysis of repeated Ca++ recordings in this study, which finds that the significant PCs of Ca++ variation contain 77% of the variation in that data. That this variation is persistent over time and across hemispheres was assessed in Honegger & Smith, et al., 2019. We are thus confident that there is true individuality in neural responses (Note, we prefer not to call it “individual variability” as this could refer to variability within individuals, not variability across individuals.) It is a separate question of whether individual differences in neural responses bear some relation to individual differences in behavioral biases. That was the focus of this study, and our finding of a robust correlation between PC 2 of Ca++ responses and OCT-MCH preference indicates a relation. Because behavior and Ca++ were collected with an hours-to-day long gap, this implies that there are latent versions of both behavioral bias and Ca++ response that are stable on timescales at least that long.

      The statistical analyses in the manuscript are underdeveloped, and it's unclear the degree to which the correlations reported have explanatory (causative) power in accounting for organismal behavior.

      With respect, we do not think our statistical analyses are underdeveloped, though we acknowledge that the detailed reviewer suggestions included the helpful suggestion to include uncertainty in the estimation of confidence intervals around the point estimate of the strength of correlation between latent behavioral and Ca++ response states – we have added these for the PN PC2 linear model (lines 170-172).

      It is indeed a separate question whether the correlations we observed represent causal links from Ca++ to behavior (though our yoked experiment suggests there is not a behavior-to-Ca++ causal relationship — at least one where odor experience through behavior is an upstream cause). We attempted to be precise in indicating that our observations are correlations. That is why we used that word in the title, as an example. In the revision, we worked to ensure this is appropriately reflected in all word choice across the paper.

      Recommendations for the Authors:

      Reviewer #1 (Recommendations for the Authors):

      Detailed comments: Many of the problems can be identified starting from Figure 4, which summarizes the main claims. I will focus on that figure and its tributaries.

      Acknowledging that the strength of several of our inferences are weak compared to what we consider the main result (the relationship between PC2 of Ca++ and OCT-MCH preference),we have removed Figure 4. This makes the focus of the paper much clearer and appropriately puts focus on the results that have strong statistical support.

      (1) The process of "inferring" correlation among the unobserved latent states for neural sensitivity and behavioral bias is unconventional and risky. The larger the assumed noise linking the latent to the observed variables (i.e. the smaller r_b and r_c) the bigger the inferred correlation rho from a given observed correlation R^2_cb. In this situation, the value of the inferred rho becomes highly dependent on what model one assumes that links latent to observed states. But the specific model drawn in Fig 4 suppl 1 is just one of many possible guesses. For example, models with nonlinear interactions could produce different inference.

      We agree with the reviewer’s notes of caution. To be clear, we do not intend for this analysis to be the main takeaway of the paper and have revised it to make this clear. The signal we are most confident in is the simple correlation between measured Ca++ PC2 and measured behavior. We have added more careful language saying that the attempt to infer the correlation between latent signals is one attempt at describing the data generation process (lines 166-172), and one possible estimate of an “underlying” correlation.

      (2) If one still wanted to go through with this inference process and set confidence bounds on rho, one needs to include all the uncertainties. Here the authors only include uncertainty in the value of R^2_c,b and they peg that at +/-20% (Line 1367). In addition there is plenty of uncertainty associated also with R^2_c,c and R^2_b,b. This will propagate into a wider confidence interval on rho.

      We have replaced the arbitrary +/- 20% window with bootstrapping the pairs of (predicted preference by PN PC2, measured preference) points and getting a bootstrap distribution of R2c,b, which is, not surprisingly, considerably wider. Still, we think there is some value in this analysis as the 90% CI of 𝜌signal under this model is 0.24-0.95. That is, including uncertainty about the R2b,b and R2c,c in the model still implies a significant relationship between latent calcium and behavior signals.

      (2.1) The uncertainty in R^2_cb is much greater than +/-20%. Take for example the highest correlation quoted in Fig 4: R^2=0.23 in the top row of panel A. This relationship refers to Fig 1L. Based on bootstrapping from this data set, I find a 90% confidence interval of CI=[0.002, 0.527]. That's an uncertainty of -100/+140%, not +/-20%. Moreover, this correlation is due entirely to the lone outlier on the bottom left. Removing that single fly abolishes any correlation in the data (R^2=0.04, p>0.3). With that the correlation of rho=0.64, the second-largest effect in Fig 4, disappears.

      We acknowledge that removal of the outlier in Fig 1L abolishes the correlation between predicted and measured OCT-AIR preference. We have thus moved that subfigure to the supplement (now Figure 1 – figure supplement 10B), note that we do not have robust statistical support of ORN PC1 predicting OCT-AIR preference in the results (lines 177-178), and place our emphasis on PN PC2’s capacity to predict OCT-MCH preference throughout the text.

      (2.2) Similarly with the bottom line of Fig 4A, which relies on Fig 1M. With the data as plotted, the confidence interval on R^2 is CI=[0.007, 0.201], again an uncertainty of -100/+140%. There are two clear outlier points, and if one removes those, the correlation disappears entirely (R^2=0.06, p=0.09).

      We acknowledge that removal of the two outliers in Fig 1M between predicted and measured OCT-AIR preference abolishes the correlation. We have also moved that subfigure to the supplement (now Figure 1 – figure supplement 10F) and do not claim to have robust statistical support of PN PC1 predicting OCT-AIR preference.

      (2.3) Similarly, the correlation R^2_bb of behavior with itself is weak and comes with great uncertainty (Fig 1 Suppl 1, panels B-E). For example, panel D figures prominently in computing the large inferred correlation of 0.75 between PN responses and OCT-MCH choice (Line 171ff). That correlation is weak and has a very wide confidence interval CI=[0.018, 0.329]. This uncertainty about R^2_bb should be taken into account when computing the likelihood of rho.

      We now include bootstrapping of the 3 hour OCT-MCH persistence data in our inference of 𝜌signal.

      (2.4) The correlation R^2_cc for the empirical repeatability of Ca signals seems to be obtained by a different method. Fig 4 suppl 1 focuses on the repeatability of calcium recording at two different time points. But Line 625ff suggests the correlation R^2_cc=0.77 all derives from one time point. It is unclear how these are related.

      Because our calcium model predictors utilize principal components of the glomerulus-odor responses (the mean Δf/f in the odor presentation window), we compute R2c,c through adding variance explained along the PCs, up to the point in which the component-wise variance explained does not exceed that of shuffled data (lines 609-620 in Materials and Methods). In this revision we now bootstrap the calcium data on the level of individual flies to get a bootstrap distribution of R2c,c, and propagate the uncertainty forward in the inference of 𝜌signal.

      (2.5) To summarize, two of the key relationships in Fig 1 are due entirely to one or two outlier points. These should not even be used for further analysis, yet they underlie two of the claims in Fig 4. The other correlations are weak, and come with great uncertainty, as confirmed by resampling. Those uncertainties should be propagated through the inference procedure described in Fig 4. It seems possible that the result will be entirely uninformative, leaving rho with a confidence interval that spans the entire available range [0,1]. Until that analysis is done, the claims of neuron-to-behavior correlation in this manuscript are not convincing.

      It is important to note that we never thought our analysis of the relationship between latent behavior and calcium signals should be interpreted as the main finding. Instead, the observed correlation between measured behavior and calcium is the take-away result. Importantly, it is also conservative compared to the inferred latent relationship, which in our minds was always a “bonus” analysis. Our revisions are now focused on highlighting the correlations between measured signals that have strong statistical support.

      As a response to these specific concerns, we have propagated uncertainty in all R2’s (calcium-calcium, behavior-behavior, calcium-behavior) in our new inference for 𝜌signal, yielding a new median estimate for PN PC 2 underlying OCT-MCH preference of 0.68, with a 90% CI of 0.24-0.95. (Lines 171-172 in results, Inference of correlation between latent calcium and behavior states section in Materials and Methods).

      (3) Other statistical methods:

      (3.1) The caption of Fig 4 refers to "model applied to train+test data". Does that mean the training data were included in the correlation measurement? Depending on the number of degrees of freedom in the model, this could have led to overfitting.

      We have removed Figure 4 and emphasize the key results in Figure 1 and 2 that we see statistically robust signal of PN PC 2 explaining OCT-MCH preference variation in both a training set and a testing set of flies (Fig 2 – figure supplement 1C-D).

      (3.2) Line 180 describes a model that performed twice as well on test data (31% EV) as it did on training data (15%). What would explain such an outcome? And how does that affect one's confidence in the 31% number?

      The test set recordings were conducted several weeks after the training set recordings, which were used to establish PN PC 2 as a correlate of OCT-MCH preference. The fact that the test data had a higher R2 likely reflects sampling error (these two correlation coefficients are not significantly different). Ultimately this gives us more confidence in our model, as the predictive capacity is maintained in a totally separate set of flies.

      (3.340 Multiple models get compared in performance before settling on one. For example, sometimes the first PC is used, sometimes the second. Different weighting schemes appear in Fig 2. Do the quoted p-values for the correlation plots reflect a correction for multiple hypothesis testing?

      For all calcium-behavior models, we restricted our analysis to 5 PCs, as the proportion of calcium variance explained by each of these PCs was higher than that explained by the respective PC of shuffled data — i.e., there were at most five significant PCs in that data. We thus performed at most 5 hypothesis tests for a given model. PN PC 2 explained 15% of OCT-MCH preference variation, with a p-value of 0.0063 – this p-value is robust to a conservative Bonferroni correction to the 5 hypotheses considered at alpha=0.05.

      The weight schemes in Figure 2 and Figure 1 – figure supplement 10 reflect our interpretations of the salient features of the PCs and are follow-up analysis of the single principal component hypothesis tests. Thus they do not constitute additional tests that should be corrected. We now state in the methods explicitly that all reported p-values are nominal (line 563).

      (3.4) Line 165 ff: Quoting rho without giving the confidence interval is misleading. For example, the rho for the presynaptic density model is quoted as 0.51, which would be a sizeable correlation. But in fact, the posterior on rho is almost flat, see caption of Fig 4 suppl 1, which lists the CI as [0.11, 0.85]. That means the experiments place virtually no constraint on rho. If the authors had taken no data at all, the posterior on rho would be uniform, and give a median of 0.5.

      We now provide a confidence interval around 𝜌signal for the PN PC 2 model (lines 170-172). But per above, and consistent with the new focus of this revision, we view the 𝜌signal inference as secondary to the simple, significant correlation between PN PC 2 and OCT-MCH preference.

      (4) As it stands now, this paper illustrates how difficult it is to come to a strong conclusion in this domain. This may be worth some discussion. This group is probably in a better position than any to identify what are the limiting factors for this kind of research.

      We thank the reviewer for this suggestion and have added discussion of the difficulties in detecting signals for this kind of problem. That said, we are confident in stating that there is a meaningful correlation between PC 2 of PN Ca++ responses and OCT-MCH behavior given our model’s performance in predicting preference in a test set of flies, and in the consistent signal in ORN Bruchpilot.

      Reviewer #3 (Recommendations for the Authors):

      Two major concerns, one experimental/technical and one conceptual:

      (1) I appreciate the difficulty of the experimental design and problem. However, the correlations reported throughout are based on neural measurements in only 5 glomeruli (~10% of the olfactory system) at early stages of olfactory processing.

      We acknowledge that only imaging 5 glomeruli is regrettable. We worked hard to develop image analysis pipelines that could reliably segment as many glomeruli as possible from almost all individual flies. In the end, we concluded that it was better to focus our analysis on a (small) core set of glomeruli for which we had high confidence in the segmentation. Increasing the number of analyzed glomeruli is high on the list of improvements for subsequent studies. Happily, we are confident that we are capturing a significant, biologically meaningful correlation between PC 2 of PN calcium (dominated by the responses in DC2 and DM2) and OCT-MCH preference.

      3-OCT and MCH activate many glomeruli in addition to the five studied, especially at the concentrations used. There is also limited odor-specificity in their response matrix: notably responses are more correlated in all glomeruli within an individual, compared to responses across individuals (they note this in lines 194-198, though I don't quite understand the specific point they make here). This is a sign of high experimental variability (typically the dynamic range of odor response within an individual is similar to the range across individuals) and makes it even more difficult to resolve underlying individual variation.

      We respectfully disagree with the reviewer’s interpretation here. There is substantial odor-specificity in our response matrix. This is evident in both the ORN and PN response matrices (and especially the PN matrix) as variation in the brightness across rows. Columns, which correspond to individuals, are more similar than rows, which correspond to odor-glomerulus pairs. The dynamic range within an individual (within a column, across rows) is indeed greater than the variation among individuals (within a row, across columns).

      As an (important) aside, the odor stimuli are very unusual in this study. Odors are delivered at extremely high concentrations (variably 10-25% sv, line 464, not exactly sure what "variably' means- is the stimulus intensity not constant?) as compared to even the highest concentrations used in >95% of other studies (usually <~0.1% sv delivered).

      We used these concentrations for a variety of reasons. First, following the protocol of Honegger and Smith (2020), we found that dilutions in this range produce a linear input-output relationship, i.e. doubling or halving one odorant yields proportionate changes in odor-choice behavior metrics. Second, such fold dilutions are standard for tunnel assays of the kind we used. Claridge-Chang et al. (2009) used 14% and 11% for MCH and OCT respectively, for instance. Finally, the specific dilution factor (i.e., within the range of 10-25%) was adjusted on a week-by-week basis to ensure that in an OCT-MCH choice, the mean preference was approximately 50%. This yields the greatest signal of individual odor preference. We have added this last point to the methods section where the range of dilutions is described (lines 442-445).

      A parsimonious interpretation of their results is that the strongest correlation they see (ORN PC1 predicts OCT v. air preference) arises because intensity/strength of ORN responses across all odors (e.g. overall excitability of ORNs) partially predicts behavioral avoidance of 3-OCT. However, the degree to which variation in odor-specific glomerular activation patterns can explain behavioral preference (3-OCT v. MCH) seems much less clear, and correspondingly the correlations are weaker and p-values larger for the 3-OCT v. MCH result.

      With respect, we disagree with this analysis. The correlation between ORN PC 1 and OCT v. air preference (R2 \= 0.23) is quite similar to that of PN PC 2 and OCT vs MCH preference (R2 \= 0.20). However, the former is dependent on a single outlying point, whereas the latter is not. The latter relationship is also backed up by the BRP imaging and modeling. Therefore in the revision we have de-emphasized the OCT v. air preference model and emphasized the OCT v. MCH preference models.

      (2) There is a broader conceptual concern about the degree of logical consistency in the authors' interpretation of how neural variability maps to behavioral variability. For instance, the two odors they focus on, 3-OCT and MCH, barely activate ORNs in 4 of the 5 glomeruli they study. Most of the correlation of ORN PC1 vs. behavioral choice for 3-OCT vs. air, then, must be driven by overall glomerular activation by other odors (but remains predictive since responses across odors appear correlated within an individual). This gives pause to the interpretation that 3-OCT-evoked ORN activity in these five glomeruli is the neural substrate for variability in the behavioral response to 3-OCT.

      Our interpretation of the ORN PC1 linear model is not that 3-OCT-evoked ORN activity is the neural substrate for variability – instead, it is the general responsiveness of an individual’s AL across multiple odors (this is our interpretation of the the uniformly positive loadings in ORN PC1). It is true that OCT and MCH do not activate ORNs as strongly as other odorants – our analysis rests on the loadings of the PCs that capture all odor/glomerulus combinations available in our data. All that said, since a single outlier in Figure 1L dominates the relationship, therefore we have de-emphasized these particular results in our revision.

      This leads to the most significant concern, which is that the paper does not provide strong evidence that odor-specific patterns of glomerular activation in ORNs and PNs underlie individual behavioral preference between different odors (that each drive significant levels of activity, e.g. 3-OCT v. MCH), or that the ORN-PN synapse is a major driver of individual behavioral variability. Lines 26-31 of the abstract are not well supported, and the language should be softened.

      We have modified the abstract to emphasize our confidence in PN calcium correlating with odor-vs-odor preference (removing the ORN & odor-vs-air language).

      Their conclusions come primarily from having correlated many parameters reduced from the ORN and PN response matrices against the behavioral data. Several claims are made that a given PC is predictive of an odor preference while others are not, however it does not appear that the statistical tests to support this are shown in the figures or text.

      For each linear model of calcium dynamics predicting preference, we restricted our analysis to the first 5 principal components. Thus, we do not feel that we correlated many parameters against the behavioral data. As mentioned below, the correlations identified by this approach comfortably survive a conservative Bonferroni correction. In this revision, a linear model with a single predictor – the projection onto PC 2 of PN calcium – is the result we emphasize in the text, and we report R2 between measured and predicted preference for both a training set of flies and for a test set of flies (Figure 1M and Figure 2 – figure supplement 1).

      That is, it appears that the correlation of models based on each component is calculated, then the component with the highest correlation is selected, and a correlation and p-value computed based on that component alone, without a statistical comparison between the predictive values of each component, or to account for effectively performing multiple comparisons. (Figure 1, k l m n o p, Figure 3, d f, and associated analyses).

      To reiterate, this was our process: 1) Collect a training data set of paired Ca++ recordings and behavioral preference scores. 2) Compute the first five PCs of the Ca++ data, and measure the correlation of each to behavior. 3) Identify the PC with the best correlation. 4) Collect a test data set with new experimental recordings. 5) Apply the model identified in step 3. For some downstream analyses, we combined test and training data, but only after confirming the separate significance of the training and test correlations.

      The p-values associated with the PN PC 2 model predicting OCT-MCH preference are sufficiently low in each of the training and testing sets (0.0063 and 0.0069, respectively) to pass a conservative Bonferroni multiple hypothesis correction (one hypothesis for each of the 5 PCs) at an alpha of 0.05.

      Additionally, the statistical model presented in Figure 4 needs significantly more explanation or should be removed- it's unclear how they "infer" the correlation, and the conclusions appears inconsistent with Figure 3 - Figure Supplement 2.

      We have removed Figure 4 and have improved upon our approach of inferring the strength of the correlation between latent calcium and behavior in the Methods, incorporating bootstrapping of all sources of data used for the inference (lines 622-628). At the same time, we now emphasize that this analysis is a bonus of sorts, and that the simple correlation between Ca++ and behavior is the main result.

      Suggestions:

      (1) If the authors want to make the claim that individual variation in ORN or PN odor representations (e.g. glomerular activation patterns) underlie differences in odor preference (MCH v. OCT), they should generalize the weak correlation between ORN/PN activity and behavior to additional glomeruli and pair of odors, where both odors drive significant activity. Otherwise, the claims in the abstract should be tempered.

      We have modified the abstract to focus on the effect we have the highest confidence in: contrasting PN calcium activation of DM2 and DC2 predicting OCT-MCH preference.

      (2) One of the most valuable contributions a study like this could provide is to carefully quantify the amount of measurement variation (across trials, across hemispheres) in neural responses relative to the amount of individual variation (across individuals). Beyond the degree of variation in the amplitude of odor responses, the rank ordering of odor response strength between repeated measurements (to try to establish conditions that account for adaptation, etc.), between hemispheres, and between individuals is important. Establishing this information is foundational to this entire field of study. The authors take a good first step towards this in Figure 1J and Figure 1, supplement 5C, but the plots do not directly show variance, and the comparison is flawed because more comparisons go into the individual-individual crunch (as evidenced by the consistently smaller range of quartiles). The proper way to do this is by resampling.

      We do not know what the reviewer means by “individual-individual crunch,” unfortunately. Thus, it is difficult to determine why they think the analysis is flawed. We are also uncertain about the role of resampling in this analysis. The medians, interquartile ranges and whiskers in the panels referenced by the reviewer are not confidence intervals as might be determined by bootstrap resampling. Rather, these are direct statistics on the coding distances as measured – the raw values associated with these plots are visualized in Figure 1H.

      In our revision we updated the heatmaps in Figure 1 – figure supplement 3 to include recordings across the lobes and trials of each individual fly, and we have added a new supplementary figure, Figure 1 – figure supplement 4, to show the correspondence between recordings across lobes or trials, with associated rank-order correlation coefficients. Since the focus of this study was whether measured individual differences predict individual behavioral preference, a full characterization of the statistics of variation in calcium responses was not the focus, though it was the focus of a previous study (Honegger & Smith et al., 2019).

      To help the reader understand the data, we would encourage displaying data prior to dimensionality reduction - why not show direct plots of the mean and variance of the neural responses in each glomerulus across repeats, hemispheres, individuals?

      We added a new supplementary figure, Figure 1 – figure supplement 4, to show the correspondence between recordings across lobes or trials.

      A careful analysis of this point would allow the authors to support their currently unfounded assertion that odor responses become more "idiosyncratic" farther from the periphery (line 135-36); presumably they mean beyond just noise introduced by synaptic transmission, e.g. "idiosyncrasy" is reproducible within an individual. This is a strong statement that is not well-supported at present - it requires showing the degree of similarity in the representation between hemispheres is more similar within a fly than between flies in PNs compared to ORNs (see Hige... Turner, 2015).

      Here are the lines in question: “PN responses were more variable within flies, as measured across the left and right hemisphere ALs, compared to ORN responses (Figure 1 – figure supplement 5C), consistent with the hypothesis that odor representations become more idiosyncratic farther from the sensory periphery.”

      That responses are more idiosyncratic farther from the periphery is therefore not an “unfounded assertion.” It is clearly laid out as a hypothesis for which we can assess consistency in the data. We stand by our original interpretation: that several observations are consistent with this finding, including greater distance in coding space in PNs compared to ORNs, particularly across lobes and across flies. In addition, higher accuracy in decoding individual identity from PN responses compared to ORN responses (now appearing as Figure 1 – figure supplement 6A) is also consistent with this hypothesis.

      Still, to make confusion at this sentence less likely, we have reworded it as “suggesting that odor representations become more divergent farther from the sensory periphery.” (lines 139-140)

      (3) Figure 3 is difficult to interpret. Again, the variability of the measurement itself within and across individuals is not established up front. Expression of exogenous tagged brp in ORNs is also not guaranteed to reflect endogenous brp levels, so there is an additional assumption at that level.

      Figure 3 – figure supplement 1 Panels A-C display the variability of measurements (Brp volume, total fluorescence and fluorescence density) both within (left/right lobes) and across individuals (the different data points). We agree that exogenous tagged Brp levels will not be identical to endogenous levels. The relationship appears significant despite this caveat.

      Again there are statistical concerns with the correlations. For instance, the claim that "Higher Brp in DM2 predicted stronger MCH preference... " on line 389 is not statistically supported with p<0.05 in the ms (see Figure 3 G as the closest test, but even that is a test of the difference of DM2 and DC2, not DM2 alone).

      We have changed the language to focus on the pattern of the loadings in PC 2 of Brp-Short density and replaced “predict.” (lines 366-369).

      Can the authors also discuss what additional information is gained from the expansion microscopy in the figure supplement, and how it compares to brp density in DC2 using conventional methods?

      The expansion microscopy analysis was an attempt to determine what specific aspect of Brp expression was predictive of behavior, on the level of individual Brp puncta, as a finer look compared to the glomerulus-wide fluorescence signal in the conventional microscopy approach. Since this method did not yield a large sample size, at best we can say it provided evidence consistent with the observation from confocal imaging that Brp fluorescent density was the best measure in terms of predicting behavior.

      I would prefer to see the calcium and behavioral datasets strengthened to better establish the relationship between ORN/PN responses and behavior, and to set aside the anatomical dataset for a future work that investigates mechanisms.

      We are satisfied that our revisions put appropriate emphasis on a robust result relating calcium and behavior measurements: the relationship between OCT-MCH preference and idiosyncratic PN calcium responses. Finding that idiosyncratic Brp density has similar PC 2 loadings that also significantly predict behavior is an important finding that increases confidence in the calcium-behavior finding. We agree with the reviewer that these anatomical findings are secondary to the calcium-behavior analyses, but think they warrant a place in the main findings of the study. As the reviewer suggests, we are conducting follow-on studies that focus on the relationship between neuroanatomical measures and odor preference.

      (4) The mean imputation of missing data may have an effect on the conclusions that it is possible to draw from this dataset. In particular, as shown in Figure 1, supplemental figure 3, there is a relatively large amount of missing data, which is unevenly distributed across glomeruli and between the cell types recorded from. Strikingly, DC2 is missing in a large fraction of ORN recordings, while it is present in nearly all the PN recordings. Because DC2 is one of the glomeruli implicated in predicting MCH-OCT preference, this lack of data may be particularly likely to effect the evaluation of whether this preference can be predicted from the ORN data. Overall, mean imputation of glomerulus activity prior to PCA will artificially reduce the amount of variance contributed by the glomerulus. It would be useful to see an evaluation of which results of this paper are robust to different treatments of this missing data.

      We confirmed that the linear model of predicted OCT-MCH using PN PC2 calcium was minimally altered when we performed imputation via alternating least squares using the pca function with option ‘als’ to infill missing values on the calcium matrix 1000 times and taking the mean infilled matrix (see MATLAB documentation and Figure 1 – figure supplement 5 of Werkhoven et al., 2021). Fitted slope value for model using mean-infilled data presented in article: -0.0806 (SE = 0.028, model R2 \= 0.15), fitted slope value using ALS-imputed model: -0.0806 (SE 0.026, model R2 \= 0.17).

      Additional comments:

      (1) On line 255 there is an unnecessary condition: "non-negative positive".

      Thank you – non-negative has been removed.

      (2) In Figure 4 and the associated analysis, selection of +/- 20% interval around the observed $R^2$ appears arbitrary. This could be based on the actual confidence interval, or established by bootstrapping.

      We have replaced the +/- 20% rule by bootstrapping the calculation of behavior-behavior R2, calcium-calcium R2, and calcium-behavior R2 and propagating the uncertainties forward (Inference of correlation between latent calcium and behavior states section in Materials and Methods).

      (3) On line 409 the claim is made "These sources of variation specifically implicate the ORN-PN synapse..." While the model recapitulates the glomerulus specific variation of activity under PN synapse density variation, it also occurs under ORN identity variation, which calls into question whether the synapse distribution itself is specifically implicated, or if any variation that is expected to be glomerulus specific would be equally implicated.

      We agree with this observation. We found that varying either the ORNs or the PNs that project to each glomeruli can produce patterns of PN response variation similar to what is measured experimentally. This is consistent with the idea that the ORN-PN synapse is a key site of behaviorally-relevant variation.

      (4) Line 214 "... we conclude that the relative responses of DM2 vs DC2 in PNs largely explains an individual's preference." is too strong of a claim, based on the fact that using the PC2 explains much more of the variance, while using the stated hypothesis noticeable decreases the predictive power ($R^2$ = 0.2 vs $R^2$ = 0.12 )

      We have changed the wording here to “we conclude that the relative responses of DM2 vs DC2 in PNs compactly predict an individual’s preference.” (lines 192-193)

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      This study investigated the mechanism by which PGE2 inhibits the release of insulin from pancreatic beta cells in response to glucose. The researchers used a combination of cell line experiments and studies in mice with genetic ablation of the Kv2.2 channel. Their findings suggest a novel pathway where PGE2 acts through EP2/EP4 receptors to activate PKA, which directly phosphorylates a specific site (S448) on the Kv2.2 channel, inhibiting its activity and reducing GSIS.

      Strengths:

      - The study elegantly demonstrates a potential pathway connecting PGE2, EP2/EP4 receptors, PKA, and Kv2.2 channel activity, using embryonic cell line.

      - Additional experiments in INS1 and primary mouse beta cells with altered Kv2.2 function partially support the inhibitory role of PGE2 on GSIS through Kv2.2 inhibition.

      Weaknesses:

      - A critical limitation is the use of HEK293T cells, which are not pancreatic beta cells. Functional aspects can differ significantly between these cell types.

      - The study needs to address the apparent contradiction of PKA activating insulin secretion in beta cells, while also inhibiting GSIS through the proposed mechanism.

      - A more thorough explanation is needed for the discrepancies observed between the effects of PGE2 versus Kv2.2 knockdown/mutation on the electrical activity of beta cells and GSIS.

      Thank you for your positive evaluation and constructive feedback on our study. We appreciate the concern regarding the use of HEK293T cells, which are not pancreatic beta cells and may exhibit functional differences. In response, we have repeated our key experiments using INS1 cells and primary mouse beta cells, which are more representative of the native beta cell environment. These additional experiments confirm our hypothesis and further support the role of Kv2.2 in PGE2-induced inhibition of GSIS. In beta cells, glucose-induced PKA activation is highly localized. As a result, while some PKA pathways promote insulin secretion, others may inhibit it. To directly demonstrate that PGE2-induced PKA phosphorylation of Kv2.2 is involved in the inhibitory effect on GSIS, we overexpressed the S448A mutant Kv2.2 channel in INS-1(832/13) cells. Our results show that Kv2.2-S448A channels significantly attenuate the inhibitory effect of PGE2 on GSIS, further supporting the critical role of Kv2.2 phosphorylation at S448. These data have been added to the revised Figure 7C.

      Reviewer #2 (Public Review):

      The authors identified new target elements for prostaglandin E2 (PGE2) through which insulin release can be regulated in pancreatic beta cells under physiological conditions. In vitro extracellular exposure to PGE2 could directly and dose-dependently inhibit the potassium channel Kv2.2. In vitro pharmacology revealed that this inhibition occurs through the EP2/4 receptors, which activate protein kinase A (PKA). By screening specific sites of the Kv2.2 channel, the target phosphorylation site (S448) for PKA regulation was found. The physiological relevance of the described signaling cascade was investigated and confirmed in vivo, using a Kv2.2 knockdown mouse model.

      The strength of this manuscript is the novelty of the (EP2/4-PKA-Kv2.2 channel) molecular pathway described and the comprehensive methodological toolkit the authors have relied upon.

      The introduction is detailed and contains all the information necessary to place the claims in context. Although the dataset is comprehensive and a logical lead is consistently built, there is one important point to consider: to clarify that the described signaling pathway is characteristic of normal physiological conditions and thus differs from pathological changes. It would be useful to carry out basic experiments in a diabetes model (regardless of whether this is in mice or rats).

      Thank you for your positive evaluation and insightful comment. We have clarified in the Discussion section that our findings pertain specifically to physiological conditions. We acknowledge the importance of investigating the signaling pathway in a pathological context and plan to conduct experiments using a diabetes model in future studies to explore how this pathway may differ under such conditions.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) Figure 3A-C: PKA activation regulates different functional aspects in beta cells and HEK293T cells. It is well known that PKA activation enhances insulin secretion in beta cells, therefore the mechanisms that allow the same pathway at the same time to inhibit GSIS are not clear and should be addressed by experiments in beta cells.

      Thank you for your insightful comment. Specificity and versatility in cAMP-PKA signaling are governed by the spatial localization and temporal dynamics of the signal. In beta cells, glucose-induced PKA activation is highly localized (Tengholm and Gylfe, 2017). As a result, while some PKA pathways promote insulin secretion, others may inhibit it. For example, a global increase in cAMP, such as through treatment with Db-cAMP, can simultaneously activate both stimulatory and inhibitory PKA pathways, reflecting a more integrated, complex response. In previous studies, 1 mM Db-cAMP was shown to enhance GSIS in INS-1 cells (Dezaki et al., 2011). We observed that 1 mM Db-cAMP increased GSIS, but lower concentrations (10 mM) decreased GSIS (as shown in Author response image 1). These findings suggest that not all PKA signaling events increase GSIS. To further investigate the role of PGE2-induced PKA phosphorylation of Kv2.2 in the inhibition of GSIS, we overexpressed the S448A mutant of Kv2.2 in INS-1 (832/13) cells. Our results showed that the Kv2.2-S448A mutant significantly attenuated the inhibitory effect of PGE2 on GSIS. These new data have been incorporated into the revised Figure 7C.

      Author response image 1.

      Effect of Db-cAMP on GSIS in INS-1 cells. Statistics for the effect of different concentrations of Db-cAMP on GSIS in INS-1(832/13) cells. One-way ANOVA with Bonferroni post hoc test. *p < 0.05; ***p < 0.001; ****p < 0.0001; n.s., not significant.

      (2) Figure 3G: One would expect that the phospho-mimetic mutation, S448D, will have an opposite effect to S448A and a similar effect as PGE2 or PKA activator in Figure 3B. There is no explanation by the authors for having the same effect in S448A and S448D.

      Thank you for your thoughtful comment. Indeed, the S448D mutation exhibited a similar effect to PGE2 on Kv2.2 channels, as we observed significantly smaller currents compared to wild-type Kv2.2 (Figure 3F). The S448D mutation mimics the phosphorylated state of S448, and since PGE2 regulates Kv2.2 channels by phosphorylating this residue, it has no further effect on the S448D mutant (Figure 3G). In contrast, the S448A mutation prevents phosphorylation at this site, which explains why PGE2 has no effect on the currents of S448A mutant Kv2.2 channels (Figure 3H). These results confirm that PGE2 modulates Kv2.2 channels specifically through phosphorylation of S448, as evidenced by the lack of effect on both the S448A and S448D mutants.

      (3) Figure 4E: Since both PGE2 and Kv2.2 KD inhibit the activity of the channel, it doesn't definitively prove whether PGE2 acts through Kv2.2 in INS-1 cells. A complementary experiment should be done in which overactivation of Kv2.2 rescues the effect of PGE2. For example, with the S448A form of the channel.

      We appreciate your comment and valuable suggestion. Knockdown of Kv2.2 abrogated the inhibitory effect of PGE2 on I<sub>K</sub> currents in INS-1 cells (Figure 4E and F), which strongly indicates that PGE2 acts through Kv2.2. While we agree that the suggested complementary experiment with Kv2.2 overactivation (e.g., using the S448A mutant) could provide additional insights, we believe the current data sufficiently support our conclusion, as the knockdown of Kv2.2 eliminates the observed PGE2 effect, providing direct evidence of the channel's involvement.

      (4) Figure 5C: This result requires further explanation. If PGE2 downregulates Kv2.2 activity and has an inhibitory effect on GSIS, why does Kv2.2 KD have the opposite effect?

      The knockdown of Kv2.2 (Fig. 5C) reduced action potential (AP) firing rates compared to the scramble control (Fig. 5B), which is expected because Kv2.2 is critical for maintaining AP firing. When Kv2.2 is knocked down, the reduced AP firing diminishes the system’s responsiveness to further modulation by PGE2. This is because PGE2 exerts its effects primarily through Kv2.2 channels. Therefore, in the Kv2.2 knockdown condition, PGE2 does not exert an additional inhibitory effect on AP firing rates, as the channels critical for its action are already impaired.

      (5) Figure 5D - The EP1-EP4 receptor antibodies should be validated at least in INS-1(832/13) cells using knockdowns.

      Thank you for your suggestion. We have validated the EP1-EP4 receptor antibodies in INS-1(832/13) cells using knockdown experiments. The validation results, including confirmation of specificity and knockdown efficiency, are provided in Supplemental Figure S2.

      (6) Figure 7B - These experiments don't necessarily prove that PGE2 acts directly through Kv2.2 inhibition. Using the S448A mutation in these experiments could prove this point.

      Thank you for this valuable suggestion. We have now overexpressed the S448A mutant Kv2.2 channels in INS-1(832/13) cells, and the results demonstrate that Kv2.2-S448A channels significantly reduce the inhibitory effect of PGE2 on GSIS. These new data have been incorporated into the revised Figure 7C.

      Reviewer #2 (Recommendations For The Authors):

      (1) Deficiencies and inaccuracies in the description of the methods (animal numbers, name of vendors, abbreviations) and the typos in the figures (axis label) require correction.

      Thank you for pointing this out. We have carefully reviewed the manuscript and the figures, making the necessary corrections to address the deficiencies in the methods section and the typos in the figure axis labels.

      (2) Reducing the number of figures (Figures 7/C-E: knockout mouse line test and Figure1/HEK cell experiments could be part of supplementary) and paragraphs would make the manuscript more compact and powerful. It would also ease its reading for non-experts.

      Thank you for your suggestion. We have moved Figures 7C-E to the supplementary data (Supplemental Figure S1) to streamline the main manuscript.

      (3) Multiple immunostainings for EP receptors in insulinoma cells or pancreatic islets would be representative.

      Due to the rabbit-derived nature of the antibodies (EP1, EP2, EP4), performing multiple immunostainings on the same samples is not feasible due to potential cross-reactivity. However, the immunohistochemistry images demonstrate that each antibody labels more than 90% of the cells, indicating that β-cell express different subtypes of EP receptors simultaneously.

      (4) The antagonists chosen (AH6809, AH23848) are non-specific. Experiments should be re-run (at least some) under more stringent conditions.

      Thank you for your suggestion. AH6809 and AH23848 are well-documented, widely used antagonists in the literature. To further strengthen our findings, we have included additional, widely-used antagonists: the EP2-specific antagonist TG4155 and the EP4-specific antagonist GW627368. The results obtained with these new antagonists were consistent with those observed using AH6809 and AH23848. These updated data are now included in the revised Figure 4I and 4J.

      (5) It would be very helpful to indeed emphasise that this work is for physiological conditions and that it is (or is not) modified in diabetes. Maybe even irrelevant for diabetes (?). This needs to be clarified and supported by data even if one could assume the authors intend to have a follow-up entirely dedicated to pathological changes, perhaps.

      Thank you for this insightful comment. We have clarified in the Discussion that our findings are specific to physiological conditions. To address this point, we have added the following statement:

      "Importantly, our findings pertain to physiological conditions. While we demonstrate the inhibitory effects of PGE2 on Kv2.2 channels in normal b-cells, the role of this pathway under diabetic conditions remains to be investigated and will be the focus of future studies."

      Dezaki K, Damdindorj B, Sone H, Dyachok O, Tengholm A, Gylfe E, Kurashina T, Yoshida M, Kakei M, Yada T (2011) Ghrelin attenuates cAMP-PKA signaling to evoke insulinostatic cascade in islet beta-cells. Diabetes 60:2315-2324.

      Tengholm A, Gylfe E (2017) cAMP signalling in insulin and glucagon secretion. Diabetes Obes Metab 19 Suppl 1:42-53.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This study attempts to resolve an apparent paradox of rapid evolutionary rates of multi-copy gene systems by using a theoretical model that integrates two classic population models. While the conceptual framework is intuitive and thus useful, the specific model is perplexing and difficult to penetrate for non-specialists. The data analysis of rRNA genes provides inadequate support for the conclusions due to a lack of consideration of technical challenges, mutation rate variation, and the relationship between molecular processes and model parameters.

      Overall Responses:

      Since the eLife assessment succinctly captures the key points of the reviews, the reply here can be seen as the overall responses to the summed criticisms. We believe that the overview should be sufficient to address the main concerns, but further details can be found in the point-by-point responses below. The overview covers the same grounds as the provisional responses (see the end of this rebuttal) but is organized more systematically in response to the reviews. The criticisms together fall into four broad areas. 

      First, the lack of engagement with the literature, particularly concerning Cannings models and non-diffusive limits. This is the main rebuttal of the companion paper (eLife-RP-RA-2024-99990). The literature in question is all in the WF framework and with modifications, in particular, with the introduction of V(K). Nevertheless, all WF models are based on population sampling. The Haldane model is an entirely different model of genetic drift, based on gene transmission. Most importantly, the WF models and the Haldane model differ in the ability to handle the four paradoxes presented in the two papers. These paradoxes are all incompatible with the WF models.

      Second, the poor presentation of the model that makes the analyses and results difficult to interpret. In retrospect, we fully agree and thank all the reviewers for pointing them out. Indeed, we have unnecessarily complicated the model. Even the key concept that defines the paradox, which is the effective copy number of rRNA genes, is difficult to comprehend. We have streamlined the presentation now. Briefly, the complexity arose from the general formulation permitting V(K) ≠ E(K) even for single copy genes. (It would serve the same purpose if we simply let V(K) = E(K) for single copy genes.) The sentences below, copied from the new abstract, should clarify the issue. The full text in the Results section has all the details.

      “On average, rDNAs have C ~ 150 - 300 copies per haploid in humans. While a neutral mutation of a single-copy gene would take 4N generations (N being the population size of an ideal population) to become fixed, the time should be 4NC* generations for rRNA genes (C* being the effective copy number). Note that C* >> 1, but C* < (or >) C would depend on the drift strength. Surprisingly, the observed fixation time in mouse and human is < 4N, implying the paradox of C* < 1.”

      Third, the confusion about which rRNA gene is being compared with which homology, as there are hundreds of them. We should note that the effective copy number C* indicates that the rRNA gene arrays do not correspond with the “gene locus” concept. This is at the heart of the confusion we failed to remove clearly. We now use the term “pseudo-population” to clarify the nature of rDNA variation and evolution. The relevant passage is reproduced from the main text shown below.

      “The pseudo-population of ribosomal DNA copies within each individual

      While a human haploid with 200 rRNA genes may appear to have 200 loci, the concept of "gene loci" cannot be applied to the rRNA gene clusters. This is because DNA sequences can spread from one copy to others on the same chromosome via replication slippage. They can also spread among copies on different chromosomes via gene conversion and unequal crossovers (Nagylaki 1983; Ohta and Dover 1983; Stults, et al. 2008; Smirnov, et al. 2021). Replication slippage and unequal crossovers would also alter the copy number of rRNA genes. These mechanisms will be referred to collectively as the homogenization process. Copies of the cluster on the same chromosome are known to be nearly identical in sequences (Hori, et al. 2021; Nurk, et al. 2022). Previous research has also provided extensive evidence for genetic exchanges between chromosomes (Krystal, et al. 1981; Arnheim, et al. 1982; van Sluis, et al. 2019).

      In short, rRNA gene copies in an individual can be treated as a pseudo-population of gene copies. Such a pseudo-population is not Mendelian but its genetic drift can be analyzed using the branching process (see below). The pseudo-population corresponds to the "chromosome community" proposed recently (Guarracino, et al. 2023). As seen in Fig. 1C, the five short arms harbor a shared pool of rRNA genes that can be exchanged among them. Fig. 1D presents the possible molecular mechanisms of genetic drift within individuals whereby mutations may spread, segregate or disappear among copies. Hence, rRNA gene diversity or polymorphism refers to the variation across all rRNA copies, as these genes exist as paralogs rather than orthologs. This diversity can be assessed at both individual and population levels according to the multi-copy nature of rRNA genes.”

      Fourth, the lack of consideration of many technical challenges. We have responded to the criticisms point-by-point below. One of the main criticisms is about mutation rate differences between single-copy and rRNA genes. We did in fact alluded to the parity in mutation rate between them in the original text but should have presented this property more prominently as is done now. Below is copied from the revised text:

      “We now consider the evolution of rRNA genes between species by analyzing the rate of fixation (or near fixation) of mutations. Polymorphic variants are filtered out in the calculation. Note that Eq. (3) shows that the mutation rate, m, determines the long-term evolutionary rate, l. Since we will compare the l values between rRNA and single-copy genes, we have to compare their mutation rates first by analyzing their long-term evolution. As shown in Table S1, l falls in the range of 50-60 (differences per Kb) for single copy genes and 40 – 70 for the non-functional parts of rRNA genes. The data thus suggest that rRNA and single-copy genes are comparable in mutation rate. Differences between their l values will have to be explained by other means.”

      While the overview should address the key issues, we now present the point-by-point response below. 

      Public Reviews:

      Reviewer #1 (Public Review):

      The manuscript by Wang et al is, like its companion paper, very unusual in the opinion of this reviewer. It builds off of the companion theory paper's exploration of the "Wright-Fisher Haldane" model but applies it to the specific problem of diversity in ribosomal RNA arrays.

      The authors argue that polymorphism and divergence among rRNA arrays are inconsistent with neutral evolution, primarily stating that the amount of polymorphism suggests a high effective size and thus a slow fixation rate, while we, in fact, observe relatively fast fixation between species, even in putatively non-functional regions.

      They frame this as a paradox in need of solving, and invoke the WFH model.

      The same critiques apply to this paper as to the presentation of the WFH model and the lack of engagement with the literature, particularly concerning Cannings models and non-diffusive limits. However, I have additional concerns about this manuscript, which I found particularly difficult to follow.

      Response 1: We would like to emphasize that, despite the many modified WF models, there has not been a model for quantifying genetic drift in multi-copy gene systems, due to the complexity of two levels of genetic drift – within individuals as well as between individuals of the population. We will address this question in the revised manuscript (Ruan, et al. 2024) and have included a mention of it in the text as follows:

      “In the WF model, gene frequency is governed by 1/N (or 1/2_N_ in diploids) because K would follow the Poisson distribution whereby V(K) = E(K). As E(K) is generally ~1, V(K) would also be ~ 1. In this backdrop, many "modified WF" models have been developed(Der, et al. 2011), most of them permitting V(K) ≠ E(K) (Karlin and McGregor 1964; Chia and Watterson 1969; Cannings 1974). Nevertheless, paradoxes encountered by the standard WF model apply to these modified WF models as well because all WF models share the key feature of gene sampling (see below and (Ruan, et al. 2024)). ”

      My first, and most major, concern is that I can never tell when the authors are referring to diversity in a single copy of an rRNA gene compared to when they are discussing diversity across the entire array of rRNA genes. I admit that I am not at all an expert in studies of rRNA diversity, so perhaps this is a standard understanding in the field, but in order for this manuscript to be read and understood by a larger number of people, these issues must be clarified.

      Response 2: We appreciate the reviewer’s feedback and acknowledge that the distinction between the diversity of individual rRNA gene copies and the diversity across the entire array of rRNA genes may not have been clearly defined in the original manuscript. The diversity in our manuscript is referring to the genetic diversity of the population of rRNA genes in the cell. To address this concern, we have revised the relevant paragraph in the text:

      “Hence, rRNA gene diversity or polymorphism refer to the variation across all rRNA copies, as these genes exist as paralogs rather than orthologs. This diversity can be assessed at both individual and population levels according to the multi-copy nature of rRNA genes.”

      Additionally, we have updated the Methods section to include a detailed description of how diversity is measured as follows:

      “All mapping and analysis are performed among individual copies of rRNA genes.

      Each individual was considered as a psedo-population of rRNA genes and the diversity of rRNA genes was calculated using this psedo-population of rRNA genes.”

      The authors frame the number of rRNA genes as roughly equivalent to expanding the population size, but this seems to be wrong: the way that a mutation can spread among rRNA gene copies is fundamentally different than how mutations spread within a single copy gene. In particular, a mutation in a single copy gene can spread through vertical transmission, but a mutation spreading from one copy to another is fundamentally horizontal: it has to occur because some molecular mechanism, such as slippage, gene conversion, or recombination resulted in its spread to another copy. Moreover, by collapsing diversity across genes in an rRNA array, the authors are massively increasing the mutational target size.   

      For example, it's difficult for me to tell if the discussion of heterozygosity at rRNA genes in mice starting on line 277 is collapsed or not. The authors point out that Hs per kb is ~5x larger in rRNA than the rest of the genome, but I can't tell based on the authors' description if this is diversity per single copy locus or after collapsing loci together. If it's the first one, I have concerns about diversity estimation in highly repetitive regions that would need to be addressed, and if it's the second one, an elevated rate of polymorphism is not surprising, because the mutational target size is in fact significantly larger.

      Response 3: As addressed in previous Response2, the measurement of diversity or heterozygosity of rRNA genes is consistently done by combining copies, as there is no concept of single gene locus for rDNAs. We agree that by combining the diversity across multiple rRNA gene copies into one measurement, the mutational target size is effectively increased, leading to higher observed levels of diversity than one gene. This is in line with our text:

      “If we use the polymorphism data, it is as if rDNA array has a population size 5.2 times larger than single-copy genes. Although the actual copy number on each haploid is ~ 110, these copies do not segregate like single-copy genes and we should not expect N* to be 100 times larger than N. The HS results confirm the prediction that rRNA genes should be more polymorphic than single-copy genes.”

      Under this consensus, the reviewer points out that the having a large number of rRNA genes is not equivalent to having a larger population size, because the spreading of mutations among rDNA copies within a species involves two stages: within individual (horizontal transmission) and between individuals (vertical transmission). Let’s examine how the mutation spreading mechanisms influence the population size of rRNA genes.

      First, an increase in the copy number of rRNA genes dose increase the actual population size (CN) of rRNA genes. If reviewer is referring to the effective population size of rRNA genes in the context of diversity (N* = CN/V*(K)), then an increase in C would also increase N*. In addition, the linkage among copies would reduce the drift effect, leading to increase diversity. Conversely, homogenization mechanism, like gene conversion and unequal crossing-over would reduce genetic variations between copies and increase V*(K), leading to lower diversity. Therefore, the C* =C/V*(K) in mice is about 5 times larger for rRNA genes than the rest of the genome (which mainly single-copy genes), even though the actual copy number is about 110, indicating a high homogenization rate.

      Even if these issues were sorted out, I'm not sure that the authors framing, in terms of variance in reproductive success is a useful way to understand what is going on in rRNA arrays. The authors explicitly highlight homogenizing forces such as gene conversion and replication slippage but then seem to just want to incorporate those as accounting for variance in reproductive success. However, don't we usually want to dissect these things in terms of their underlying mechanism? Why build a model based on variance in reproductive success when you could instead explicitly model these homogenizing processes? That seems more informative about the mechanism, and it would also serve significantly better as a null model, since the parameters would be able to be related to in vitro or in vivo measurements of the rates of slippage, gene conversion, etc.

      In the end, I find the paper in its current state somewhat difficult to review in more detail, because I have a hard time understanding some of the more technical aspects of the manuscript while so confused about high-level features of the manuscript. I think that a revision would need to be substantially clarified in the ways I highlighted above.

      Response 4: We appreciate your perspective on modeling the homogenizing processes of rRNA gene arrays.

      We employ the WFH model to track the drift effect of the multi-copy gene system. In the context of the Haldane model, the term K is often referred to as reproductive success, but it might be more accurate to interpret it as “transmission rate” in this study. As stated in the caption of Figure 1D, two new mutations can have very large differences in individual output (K) when transmitted to the next generation through homogenization process.

      Regarding why we did not explicitly model different mechanisms of homogenization, previous elegant models of multigene families have involved mechanisms like unequal crossing over(Smith 1974a; Ohta 1976; Smith 1976) or gene conversion (Nagylaki 1983; Ohta 1985) for concerted evolution, or using conversion to approximate the joint effect of conversion and crossing over (Ohta and Dover 1984). However, even when simplifying the gene conversion mechanism, modeling remains challenging due to controversial assumptions, such as uniform homogenization rate across all gene members (Dover 1982; Ohta and Dover 1984). No models can fully capture the extreme complexity of factors, while these unbiased mechanisms are all genetic drift forces that contribute to changes in mutant transmission. Therefore, we opted for a more simplified and collective approach using V*(K) to see the overall strength of genetic drift.

      We have discussed the reason for using V*(K) to collectively represent the homogenization effect in Discussion. As stated in our manuscript:

      “There have been many rigorous analyses that confront the homogenizing mechanisms directly. These studies (Smith 1974b; Ohta 1976; Dover 1982; Nagylaki 1983; Ohta and Dover 1983) modeled gene conversion and unequal cross-over head on. Unfortunately, on top of the complexities of such models, the key parameter values are rarely obtainable. In the branching process, all these complexities are wrapped into V*(K) for formulating the evolutionary rate. In such a formulation, the collective strength of these various forces may indeed be measurable, as shown in this study.”

      Reviewer #2 (Public Review):

      Summary:

      Multi-copy gene systems are expected to evolve slower than single-copy gene systems because it takes longer for genetic variants to fix in the large number of gene copies in the entire population. Paradoxically, their evolution is often observed to be surprisingly fast. To explain this paradox, the authors hypothesize that the rapid evolution of multi-copy gene systems arises from stronger genetic drift driven by homogenizing forces within individuals, such as gene conversion, unequal crossover, and replication slippage. They formulate this idea by combining the advantages of two classic population genetic models -- adding the V(k) term (which is the variance in reproductive success) in the Haldane model to the Wright-Fisher model. Using this model, the authors derived the strength of genetic drift (i.e., reciprocal of the effective population size, Ne) for the multi-copy gene system and compared it to that of the single-copy system. The theory was then applied to empirical genetic polymorphism and divergence data in rodents and great apes, relying on comparison between rRNA genes and genome-wide patterns (which mostly are single-copy genes). Based on this analysis, the authors concluded that neutral genetic drift could explain the rRNA diversity and evolution patterns in mice but not in humans and chimpanzees, pointing to a positive selection of rRNA variants in great apes.

      Strengths:

      Overall, the new WFH model is an interesting idea. It is intuitive, efficient, and versatile in various scenarios, including the multi-copy gene system and other cases discussed in the companion paper by Ruan et al.

      Weaknesses:

      Despite being intuitive at a high level, the model is a little unclear, as several terms in the main text were not clearly defined and connections between model parameters and biological mechanisms are missing. Most importantly, the data analysis of rRNA genes is extremely over-simplified and does not adequately consider biological and technical factors that are not discussed in the model. Even if these factors are ignored, the authors' interpretation of several observations is unconvincing, as alternative scenarios can lead to similar patterns. Consequently, the conclusions regarding rRNA genes are poorly supported. Overall, I think this paper shines more in the model than the data analysis, and the modeling part would be better presented as a section of the companion theory paper rather than a stand-alone paper. My specific concerns are outlined below.

      Response 5: We appreciate the reviewer’s feedback and recognize the need for clearer definitions of key terms. We have made revisions to ensure that each term is properly defined upon its first use.

      Regarding the model’s simplicity, as in the Response4, our intention was to create a framework that captures the essence of how mutant copies spread by chance within a population, relying on the variance in transmission rates for each copy (V(K)). By doing so, we aimed to incorporate the various homogenization mechanisms that do not affect single-copy genes, highlighting the substantially stronger genetic drift observed in multi-copy systems compared to single-copy genes. We believe that simplifying the model was necessary to make it more accessible and practical for real-world data analysis and provides a useful approximation that can be applied broadly. It is clearly an underestimate the actual rate as some forces with canceling effects might not have been accounted for.

      (1) Unclear definition of terms

      Many of the terms in the model or the main text were not clearly defined the first time they occurred, which hindered understanding of the model and observations reported. To name a few:

      (i) In Eq(1), although C* is defined as the "effective copy number", it is unclear what it means in an empirical sense. For example, Ne could be interpreted as "an ideal WF population with this size would have the same level of genetic diversity as the population of interest" or "the reciprocal of strength of allele frequency change in a unit of time". A few factors were provided that could affect C*, but specifically, how do these factors impact C*? For example, does increased replication slippage increase or decrease C*? How about gene conversion or unequal cross-over? If we don't even have a qualitative understanding of how these processes influence C*, it is very hard to make interpretations based on inferred C*. How to interpret the claim on lines 240-241 (If the homogenization is powerful enough, rRNA genes would have C*<1)? Please also clarify what C* would be, in a single-copy gene system in diploid species.

      Response 6: We apology for the confusion caused by the lack of clear definitions in the initial manuscript. We recognize that this has led to misunderstandings regarding the concept we presented. Our aim was to demonstrate the concerted evolution in multi-copy gene systems, involving two levels of “effective copy number” relative to single-copy genes: first, homogenization within populations then divergence between species. We used C* and Ne* to try to designated the two levels driven by the same homogenization force, which complicated the evolutionary pattern.

      To address these issues, we have simplified the model and revised the abstract to prevent any misunderstandings:

      “On average, rDNAs have C ~ 150 - 300 copies per haploid in humans. While a neutral mutation of a single-copy gene would take 4_N_ (N being the population size) generations to become fixed, the time should be 4_NC* generations for rRNA genes where 1<< C* (C* being the effective copy number; C* < C or C* > C would depend on the drift strength). However, the observed fixation time in mouse and human is < 4_N, implying the paradox of C* < 1. Genetic drift that encompasses all random neutral evolutionary forces appears as much as 100 times stronger for rRNA genes as for single-copy genes, thus reducing C* to < 1.”

      Thus, it should be clear that the fixation time as well as the level of polymorphism represent the empirical measures of C*.We have also revised the relevant paragraph in the text to define C* and V*(K) and removed Eq. 2 for clarity:

      “Below, we compare the strength of genetic drift in rRNA genes vs. that of single-copy genes using the Haldane model (Ruan, et al. 2024). We shall use * to designate the equivalent symbols for rRNA genes; for example, E(K) vs. E*(K). Both are set to 1, such that the total number of copies in the long run remains constant.

      For simplicity, we let V(K) = 1 for single-copy genes. (If we permit V(K) ≠ 1, the analyses will involve the ratio of V*(K) and V(K) to reach the same conclusion but with unnecessary complexities.) For rRNA genes,  V*(K) ≥ 1 may generally be true because K for rDNA mutations are affected by a host of homogenization factors including replication slippage, unequal cross-over, gene conversion and other related mechanisms not operating on single copy genes. Hence,

      where C is the average number of rRNA genes in an individual and V*(K) reflects the homogenization process on rRNA genes (Fig. 1D). Thus,

      C* = C/V*(K)

      represents the effective copy number of rRNA genes in the population, determining the level of genetic diversity relative to single-copy genes. Since C is in the hundreds and V*(K) is expected to be > 1, the relationship of 1 << C* ≤ C is hypothesized. Fig. 1D is a simple illustration that the homogenizing process may enhance V*(K) substantially over the WF model.

      In short, genetic drift of rRNA genes would be equivalent to single copy genes in a population of size NC* (or N*). Since C* >> 1 is hypothesized, genetic drift for rRNA genes is expected to be slower than for single copy genes.”

      (ii) In Eq(1), what exactly is V*(K)? Variance in reproductive success across all gene copies in the population? What factors affect V*(K)? For the same population, what is the possible range of V*(K)/V(K)? Is it somewhat bounded because of biological constraints? Are V*(K) and C*(K) independent parameters, or does one affect the other, or are both affected by an overlapping set of factors?

      Response 7: - In Eq(1), what exactly is V*(K)?  In Eq(1), V*(K) refers to the variance in the number of progeny to whom the gene copy of interest is transmitted (K) over a specific time interval. When considering evolutionary divergence between species, V*(K) may correspond to the divergence time.

      - What factors affect V*(K)? For the same population, what is the possible range of V*(K)/V(K)? Is it somewhat bounded because of biological constraints?  “V*(K) for rRNA genes is likely to be much larger than V(K) for single-copy genes, because K for rRNA mutations may be affected by a host of homogenization factors including replication slippage, unequal cross-over, gene conversion and other related mechanisms not operating on single-copy genes. For simplicity, we let V(K) = 1 (as in a WF population) and V*(K) ≥ 1.” Thus, the V*(K)/V(K) = V*(K) can potentially reach values in the hundreds, and may even exceed C, resulting in C*(= C/V*(K)) values less than 1. Biological constraints that could limit this variance include the minimum copy number within individuals, sequence constraints in functional regions, and the susceptibility of chromosomes with large arrays to intrachromosomal crossover (which may lead to a reduction in copy number)(Eickbush and Eickbush 2007), potentially reducing the variability of K.

      - Are V*(K) and C*(K) independent parameters, or does one affect the other, or are both affected by an overlapping set of factors?  There is no C*(K), the C* is defined as follows in the text:

      “C* = C/V*(K) represents the effective copy number of rRNA genes, reflecting the level of genetic diversity relative to single-copy genes. Since C is in the hundreds and V*(K) is expected to be > 1, the relationship of 1 << C* ≤ C is hypothesized.” The factors influencing V*(K) directly affect C* due to this relationship.

      (iii) In the multi-copy gene system, how is fixation defined? A variant found at the same position in all copies of the rRNA genes in the entire population?

      Response 8: We appreciate the reviewer's suggestion and have now provided a clear definition of fixation in the context of multi-copy genes within the manuscript.

      “For rDNA mutations, fixation must occur in two stages – fixation within individuals and among individuals in the population. (Note that a new mutation can be fixed via homogenization, thus making rRNA gene copies in an individual a pseudo-population.)”

      The evolutionary dynamics of multi-copy genes differ from those of single-copy (Mendelian) genes, which mutate, segregate and evolve independently in the population. Fixation in multi-copy genes, such as rRNA genes, is influenced by their ability to transfer genetic information among their copies through nonreciprocal exchange mechanisms, like gene conversion and unequal crossover (Ohta and Dover 1984). These processes can cause fluctuations in the number of mutant copies within an individual's lifetime and facilitate the spread of a mutant allele across all copies even in non-homologous chromosomes. Over time, this can result in the mutant allele replacing all preexisting alleles throughout the population, leading to fixation (Ohta 1976) meaning that the same variant will eventually be present at the corresponding position in all copies of the rRNA genes across the entire population. Without such homogenization processes, fixation would be unlikely to be obtained in multi-copy genes.

      (iv) Lines 199-201, HI, Hs, and HT are not defined in the context of a multi-copy gene system. What are the empirical estimators?

      Response 9: We appreciate the reviewer's comment and would like to clarify the definitions and empirical estimators for within the context of a multi-copy gene system in the text:

      “A standard measure of genetic drift is the level of heterozygosity (H). At the mutation-selection equilibrium

      where μ is the mutation rate of the entire gene and Ne is the effective population size. In this study, Ne = N for single-copy gene and Ne = C*N for rRNA genes. The empirical measure of nucleotide diversity H is given by

      where L is the gene length (for each copy of rRNA gene, L ~ 43kb) and pi is the variant frequency at the i-th site.

      We calculate H of rRNA genes at three levels – within-individual, within-species and then, within total samples (HI, HS and HT, respectively). HS and HT are standard population genetic measures (Hartl, et al. 1997; Crow and Kimura 2009). In calculating HS, all sequences in the species are used, regardless of the source individuals. A similar procedure is applied to HT. The HI statistic is adopted for multi-copy gene systems for measuring within-individual polymorphism. Note that copies within each individual are treated as a pseudo-population (see Fig. 1 and text above). With multiple individuals, HI is averaged over them.”

      (v) Line 392-393, f and g are not clearly defined. What does "the proportion of AT-to-GC conversion" mean? What are the numerator and denominator of the fraction, respectively?

      Response 10: We appreciate the reviewer's comment and have revised the relevant text for clarity as well as improved the specific calculation methods for f and g in the Methods section.

      “We first designate the proportion of AT-to-GC conversion as f and the reciprocal, GC-to-AT, as g. Specifically, f represents the proportion of fixed mutations where an A or T nucleotide has been converted to a G or C nucleotide (see Methods). Given f ≠ g, this bias is true at the site level.”

      Methods:

      “Specifically, f represents the proportion of fixed mutations where an A or T nucleotide has been converted to a G or C nucleotide. The numerator for f is the number of fixed mutations from A-to-G, T-to-C, T-to-G, or A-to-C. The denominator is the total number of A or T sites in the rDNA sequence of the specie lineage.

      Similarly, g is defined as the proportion of fixed mutations where a G or C nucleotide has been converted to an A or T nucleotide. The numerator for g is the number of fixed mutations from G-to-A, C-to-T, C-to-A, or G-to-T. The denominator is the total number of G or C sites in the rDNA sequence of the specie lineage.

      The consensus rDNA sequences for the species lineage were generated by Samtools consensus (Danecek, et al. 2021) from the bam file after alignment. The following command was used:

      ‘samtools consensus -@ 20 -a -d 10 --show-ins no --show-del yes input_sorted.bam output.fa’.”

      (2) Technical concerns with rRNA gene data quality

      Given the highly repetitive nature and rapid evolution of rRNA genes, myriads of things could go wrong with read alignment and variant calling, raising great concerns regarding the data quality. The data source and methods used for calling variants were insufficiently described at places, further exacerbating the concern.

      (i) What are the accession numbers or sample IDs of the high-coverage WGS data of humans, chimpanzees, and gorillas from NCBI? How many individuals are in each species? These details are necessary to ensure reproducibility and correct interpretation of the results.

      Response 11: We apologize for not including the specific details of the sample information in the main text. All accession numbers and sample IDs for the WGS data used in this study, including mice, humans, chimpanzee, and gorilla, are already listed in Supplementary Tables S4-S5. We have revised the table captions and referenced them at the appropriate points in the Methods to ensure clarity.

      “The genome sequences of human (n = 8), chimpanzee (n = 1) and gorilla (n = 1) were sourced from National Center for Biotechnology Information (NCBI) (Supplementary Table 4). … Genomic sequences of mice (n = 13) were sourced from the Wellcome Sanger Institute’s Mouse Genome Project (MGP) (Keane, et al. 2011).

      The concern regarding the number of individuals needed to support the results will be addressed in Response 13.

      (ii) Sequencing reads from great apes and mice were mapped against the human and mouse rDNA reference sequences, respectively (lines 485-486). Given the rapid evolution of rRNA genes, even individuals within the same species differ in copy number and sequences of these genes. Alignment to a single reference genome would likely lead to incorrect and even failed alignment for some reads, resulting in genotyping errors. Differences in rDNA sequence, copy number, and structure are even greater between species, potentially leading to higher error rates in the called variants. Yet the authors provided no justification for the practice of aligning reads from multiple species to a single reference genome nor evidence that misalignment and incorrect variant calling are not major concerns for the downstream analysis.

      Response 12: While the copy number of rDNA varies in each individuals, the sequence identity among copies is typically very high (median identity of 98.7% (Nurk, et al. 2022)). Therefore, all rRNA genes were aligned against to the species-specific reference sequences, where the consensus nucleotide nearly accounts for >90% of the gene copies in the population. In minimize genotyping errors, our analysis focused exclusively on single nucleotide variants (SNVs) with only two alleles, discarding other mutation types.

      Regarding sequence divergence between species, which may have greater sequence variations, we excluded unmapped regions with high-quality reads coverage below 10. In calculation of substitution rate, we accounted for the mapping length (L), as shown in the column 3 in Table 3-5.

      We appreciate the reviewer’s comments and have provide details in the Methods.

      (vi) It is unclear how variant frequency within an individual was defined conceptually or computed from data (lines 499-501). The population-level variant frequency was calculated by averaging across individuals, but why was the averaging not weighted by the copy number of rRNA genes each individual carries? How many individuals are sampled for each species? Are the sample sizes sufficient to provide an accurate estimate of population frequencies?

      Response 13: Each individual was considered as a psedo-population of rRNA genes, varaint frequency within an individual was the proportions of mutant allele in this psedo-population. The calculation of varaint frequency is based on the number of supported reads of each individual.

      The reason for calculating population-level variant frequency by averaging across individuals is relevant in the calculation of FIS and FST. In calculating FST, the standard practice is to weigh each population equally. So, when we show FST in humans, we do not consider whether there are more Africans, Caucasians or Asians. There is a reason for not weighing them even though the population sizes could be orders of magnitude different, say, in the comparison between an ethnic minority and the main population. In the case of FIS, the issue is moot. Although copy number may range from 150 to 400 per haploid, most people have 300 – 500 copies with two haploids.

      As for the concern regarding the number the individuals needed to support of the results:

      Considering the nature of multi-copy genes, where gene members undergo continuous exchanges at a much slower rate compared to the rapid rate of random distribution of chromosomes at each generation of sexual reproduction, even a few variant copies that arise during an individual's lifetime would disperse into the gene pool in the next generation (Ohta and Dover 1984). Thus, there is minimal difference between individuals. Our analysis is also aligns with this theory, particularly in human population (FIS = 0.059), where each individual carries the majority of the population's genetic diversity. Therefore, even a single chimpanzee or gorilla individual caries sufficient diversity with its hundreds of gene copies to calculate divergence with humans.

      (vii) Fixed variants are operationally defined as those with a frequency>0.8 in one species. What is the justification for this choice of threshold? Without knowing the exact sample size of the various species, it's difficult to assess whether this threshold is appropriate.

      Response 14: First, the mutation frequency distribution is strongly bimodal (see Figure below) with a peak at zero and the other at 1. This high frequency peak starts to rise slowly at 0.8, similar to FST distribution in Figure 4C. That is why we use it as the cutoff although we would get similar results at the cutoff of 0.90 (see Table below). Second, the sample size for the calculation of mutant frequency is based on the number of reads which is usually in the tens of thousands. Third, it does not matter if the mutation frequency calculation is based on one individuals or multiple individuals because 95% of the genetic diversity of the population is captured by the gene pool within each individual.

      Author response image 1.

      Author response table 1.

      The A/T to G/C and G/C to A/T changes in apes and mouse.

      New mutants with a frequency >0.9 within an individual are considered as (nearly) fixed, except for humans, where the frequency was averaged over 8 individuals in the Table 2.

      The X-squared values for each species are as follows: 58.303 for human, 7.9292 for chimpanzee, and 0.85385 for M. m. domesticus.

      (viii) It is not explained exactly how FIS, FST, and divergence levels of rRNA genes were calculated from variant frequency at individual and species levels. Formulae need to be provided to explain the computation.

      Response 15: After we clearly defined the HI, HS, and HT in Response9, understanding FIS and F_ST_ becomes straightforward.

      “Given the three levels of heterozygosity, there are two levels of differentiation. First, FIS is the differentiation among individuals within the species, defined by

      FIS = [HS - HI]/HS  

      FIS is hence the proportion of genetic diversity in the species that is found only between individuals. We will later show FIS ~ 0.05 in human rDNA (Table 2), meaning 95% of rDNA diversity is found within individuals.

      Second, FST is the differentiation between species within the total species complex, defined as

      FST = [HT – HS]/HT 

      FST is the proportion of genetic diversity in the total data that is found only between species.”

      (3) Complete ignorance of the difference in mutation rate difference between rRNA genes and genome-wide average

      Nearly all data analysis in this paper relied on comparison between rRNA genes with the rest (presumably single-copy part) of the genome. However, mutation rate, a key parameter determining the diversity and divergence levels, was completely ignored in the comparison. It is well known that mutation rate differs tremendously along the genome, with both fine and large-scale variation. If the mutation rate of rRNA genes differs substantially from the genome average, it would invalidate almost all of the analysis results. Yet no discussion or justification was provided.

      Response 16: We appreciate the reviewer's observation regarding the potential impact of varying mutation rates across the genome. To address this concern, we compared the long-term substitution rates on rDNA and single-copy genes between human and rhesus macaque, which diverged approximately 25 million years ago. Our analysis (see Table S1 below) indicates that the substitution rate in rDNA is actually slower than the genome-wide average. This finding suggests that rRNA genes do not experience a higher mutation rate compared to single-copy genes, as stated in the text:

      “Note that Eq. (3) shows that the mutation rate, m, determines the long-term evolutionary rate, l. Since we will compare the l values between rRNA and single-copy genes, we have to compare their mutation rates first by analyzing their long-term evolution. As shown in Table S1, l falls in the range of 50-60 (differences per Kb) for single copy genes and 40 – 70 for the non-functional parts of rRNA genes. The data thus suggest that rRNA and single-copy genes are comparable in mutation rate. Differences between their l values will have to be explained by other means.”

      However, given the divergence time (Td) being equal to or smaller than Tf, even if the mutation rate per nucleotide is substantially higher in rRNA genes, these variants would not become fixed after the divergence of humans and chimpanzees without the help of strong homogenization forces. Thus, the presence of divergence sites (Table 5) still supports the conclusion that rRNA genes undergo much stronger genetic drift compared to single-copy genes.

      Related to mutation rate: given the hypermutability of CpG sites, it is surprising that the evolution/fixation rate of rRNA estimated with or without CpG sites is so close (2.24% vs 2.27%). Given the 10 - 20-fold higher mutation rate at CpG sites in the human genome, and 2% CpG density (which is probably an under-estimate for rDNA), we expect the former to be at least 20% higher than the latter.

      Response 17: While it is true that CpG sites exhibit a 10-20-fold higher mutation rate, the close evolution/fixation rates of rDNA with and without CpG sites (2.24% vs 2.27%) may be attributed to the fact that fixation rates during short-term evolutionary processes are less influenced by mutation rates alone. As observed in the Human-Macaque comparison in the table above, the substitution rate of rDNA in non-functional regions with CpG sites is 4.18%, while it is 3.35% without CpG sites, aligning with your expectation of 25% higher rates where CpG sites are involved.

      This discrepancy between the expected and observed fixation rates may be due to strong homogenization forces, which can rapidly fix or eliminate variants, thereby reducing the overall impact of higher mutation rates at CpG sites on the observed fixation rate. This suggests that the homogenization mechanisms play a more dominant role in the fixation process over short evolutionary timescales, mitigating the expected increase in fixation rates due to CpG hypermutability.

      Among the weaknesses above, concern (1) can be addressed with clarification, but concerns (2) and (3) invalidate almost all findings from the data analysis and cannot be easily alleviated with a complete revamp work.

      Recommendations for the authors:

      Reviewing Editor Comments:

      Both reviewers found the manuscript confusing and raised serious concerns. They pointed out a lack of engagement with previous literature on modeling and the presence of ill-defined terms within the model, which obscure understanding. They also noted a significant disconnection between the modeling approach and the biological processes involved. Additionally, the data analysis was deemed problematic due to the failure to consider essential biological and technical factors. One reviewer suggested that the modeling component would be more suitable as a section of the companion theory paper rather than a standalone paper. Please see their individual reviews for their overall assessment.

      Reviewer #2 (Recommendations For The Authors):

      Beyond my major concerns, I have numerous questions about the interpretation of various findings:

      Lines 62-63: Please explain under what circumstance Ne=N/V(K) is biologically nonsensical and why.

      Response 18: “Biologically non-sensical” is the term used in (Chen, et al. 2017). We now used the term “biologically untenable” but the message is the same. How does one get V(K) ≠ E(K) in the WF sampling? It is untenable under the WF structure. Kimura may be the first one to introduce V(K) ≠ E(K) into the WF model and subsequent papers use the same sort of modifications that are mathematically valid but biologically dubious. As explained extensively in the companion paper, the modifications add complexities but do not give the WF models powers to explain the paradoxes.

      Lines 231-234: The claim about a lower molecular evolution rate (lambda) is inaccurate - under neutrality, the molecular evolution rate is always the same as the mutation rate. It is true that when the species divergence Td is not much greater than fixation time Tf, the observed number of fixed differences would be substantially smaller than 2*mu*Td, but the lower divergence level does not mean that the molecular evolution is slower. In other words, in calculating the divergence level, it is the time term that needs to be adjusted rather than the molecular evolution rate.

      Response 19: Thanks, we agree that the original wording was not accurate. It is indeed the substitution rate rather than the molecular evolution rate that is affected when species divergence time Td is not much greater than the fixation time Tf. We have revised the relevant text in the manuscript to correct this and ensure clarity.

      Lines 277-279: Hs for rRNA is 5.2x fold than the genome average. This could be roughly translated as Ne*/Ne=5.2. According to Eq 2: (1/Ne*)/(1/Ne)= Vh/C*, it can be drived that mean Ne*/Ne=C*/Vh. Then why do the authors conclude "C*=N*/N~5.2" in line 278? Wouldn't it mean that C*/Vh is roughly 5.2?

      Response 20: We apologize for the confusion. To prevent misunderstandings, we have revised Equation 1 and deleted Equation 2 from the manuscript. Please refer to the Response6 for further details.

      Lines 291-292: What does "a major role of stage I evolution" mean? How does it lead to lower FIS?

      Response 21: We apologize for the lack of clarity in our original description, and we have revised the relevant content to make them more directly.

      “In this study, we focus on multi-copy gene systems, where the evolution takes place in two stages: both within (stage I) and between individuals (stage II).”

      FIS for rDNA among 8 human individuals is 0.059 (Table 2), much smaller than 0.142 in M. m. domesticus mice, indicating minimal genetic differences across human individuals and high level of genetic identity in rDNAs between homologous chromosomes among human population. … Correlation of polymorphic sites in IGS region is shown in Supplementary Fig. 1. The results suggest that the genetic drift due to the sampling of chromosomes during sexual reproduction (e.g., segregation and assortment) is augmented substantially by the effects of homogenization process within individual. Like those in mice, the pattern indicates that intra-species polymorphism is mainly preserved within individuals.”

      Line 297-300: why does the concentration at very allele frequency indicate rapid homogenization across copies? Suppose there is no inter-copy homogenization, and each copy evolves independently, wouldn't we still expect the SFS to be strongly skewed towards rare variants? It is completely unclear how homogenization processes are expected to affect the SFS.

      Response 22: We appreciate the reviewer’s insightful comments and apologize for any confusion in our original explanation. To clarify:

      If there is no inter-copy homogenization and each copy evolves independently, it would effectively result in an equivalent population size that is C times larger than that of single-copy genes. However, given the copies are distributed on five chromosomes, if the copies within a chromosome were fully linked, there would be no fixation at any sites. Considering the data presented in Table 4, where the substitution rate in rDNA is higher than in single-copy genes, this suggests that additional forces must be acting to homogenize the copies, even across non-homologous chromosomes.

      Regarding the specific data presented in the Figure 3, the allele frequency spectrum is based on human polymorphism sites and is a folded spectrum, as the ancestral state of the alleles was not determined. High levels of homogenization would typically push variant mutations toward the extremes of the SFS, leading to fewer intermediate-frequency alleles and reduced heterozygosity. The statement that "allele frequency spectrum is highly concentrated at very low frequency within individuals" was intended to emphasize the localized distribution of variants and the high identity at each site. However, we recognize that it does not accurately reflect the role of homogenization and this conclusion cannot be directly inferred from the figure as presented. Therefore, we have removed the sentence in the text.

      The evidence of gBGC in rRNA genes in great apes does not help explain the observed accelerated evolution of rDNA relative to the rest of the genome. Evidence of gBGC has been clearly demonstrated in a variety of species, including mice. It affects not only rRNA genes but also most parts of the genome, particularly regions with high recombination rates. In addition, gBGC increases the fixation probability of W>S mutations but suppresses the fixation of S>W mutations, so it is not obvious how gBGC will increase or decrease the molecular evolution rate overall.

      Response 23: We have thoroughly rewritten the last section of Results. The earlier writing has misplaced the emphasis, raising many questions (as stated above). To answer them, we would have to present a new set of equations thus adding unnecessary complexities to the paper. Here is the streamlined and more logical flow of the new section.

      First, Tables 4 and 5 have shown the accelerated evolution of the rRNA genes. We have now shown that rRNA genes do not have higher mutation rates. Below is copied from the revised text:

      “We now consider the evolution of rRNA genes between species by analyzing the rate of fixation (or near fixation) of mutations. Polymorphic variants are filtered out in the calculation. Note that Eq. (3) shows that the mutation rate, m, determines the long-term evolutionary rate, l. Since we will compare the l values between rRNA and single-copy genes, we have to compare their mutation rates first by analyzing their long-term evolution. As shown in Table S1 l falls in the range of 50-60 (differences per Kb) for single copy genes and 40 – 70 for the non-functional parts of rRNA genes. The data thus suggest that rRNA and single-copy genes are comparable in mutation rate. Differences between their l values will have to be explained by other means.”

      Second, we have shown that the accelerated evolution in mice is likely due to genetic drift, resulting in faster fixation of neutral variants. We also show that this is unlikely to be true in humans and chimpanzees; hence selection is the only possible explanation. The section below is copied from the revised text. It shows the different patterns of gene conversions between mice and apes, in agreement with the results of Tables 4 and 5. In essence, it shows that the GC ratio in apes is shifting to a new equilibrium, which is equivalent to a new adaptive peak. Selection is driving the rDNA genes to move to the new adaptive peak.

      Revision - “Thus, the much accelerated evolution of rRNA genes between humans and chimpanzees cannot be entirely attributed to genetic drift. In the next and last section, we will test if selection is operating on rRNA genes by examining the pattern of gene conversion. 

      3) Positive selection for rRNA mutations in apes, but not in mice – Evidence from gene conversion patterns

      For gene conversion, we examine the patterns of AT-to-GC vs. GC-to-AT changes. While it has been reported that gene conversion would favor AT-to-GC over GC-to-AT conversion (Jeffreys and Neumann 2002; Meunier and Duret 2004) at the site level, we are interested at the gene level by summing up all conversions across sites. We designate the proportion of AT-to-GC conversion as f and the reciprocal, GC-to-AT, as g. Both f and g represent the proportion of fixed mutations between species (see Methods). So defined, f and g are influenced by the molecular mechanisms as well as natural selection. The latter may favor a higher or lower GC ratio at the genic level between species. As the selective pressure is distributed over the length of the gene, each site may experience rather weak pressure.

      Let p be the proportion of AT sites and q be the proportion of GC sites in the gene. The flux of AT-to-GC would be pf and the flux in reverse, GC-to-AT, would be qg. At equilibrium, pf = qg. Given f and g, the ratio of p and q would eventually reach p/q \= g/f. We now determine if the fluxes are in equilibrium (pf =qg). If they are not, the genic GC ratio is likely under selection and is moving to a different equilibrium.

      In these genic analyses, we first analyze the human lineage (Brown and Jiricny 1989; Galtier and Duret 2007). Using chimpanzees and gorillas as the outgroups, we identified the derived variants that became nearly fixed in humans with frequency > 0.8 (Table 6). The chi-square test shows that the GC variants had a significantly higher fixation probability compared to AT. In addition, this pattern is also found in chimpanzees (p < 0.001). In M. m. domesticus (Table 6), the chi-square test reveals no difference in the fixation probability between GC and AT (p = 0.957). Further details can be found in Supplementary Figure 2. Overall, a higher fixation probability of the GC variants is found in human and chimpanzee, whereas this bias is not observed in mice.

      Tables 6-7 here

      Based on Table 6, we could calculate the value of p, q, f and g (see Table 7). Shown in the last row of Table 7, the (pf)/(qg) ratio is much larger than 1 in both the human and chimpanzee lineages. Notably, the ratio in mouse is not significantly different from 1. Combining Tables 4 and 7, we conclude that the slight acceleration of fixation in mice can be accounted for by genetic drift, due to gene conversion among rRNA gene copies. In contrast, the different fluxes corroborate the interpretations of Table 5 that selection is operating in both humans and chimpanzees.”

      References

      Arnheim N, Treco D, Taylor B, Eicher EM. 1982. Distribution of ribosomal gene length variants among mouse chromosomes. Proc Natl Acad Sci U S A 79:4677-4680.

      Brown T, Jiricny J. 1989. Repair of base-base mismatches in simian and human cells. Genome / National Research Council Canada = Génome / Conseil national de recherches Canada 31:578-583.

      Cannings C. 1974. The latent roots of certain Markov chains arising in genetics: A new approach, I. Haploid models. Advances in Applied Probability 6:260-290.

      Chen Y, Tong D, Wu CI. 2017. A New Formulation of Random Genetic Drift and Its Application to the Evolution of Cell Populations. Mol Biol Evol 34:2057-2064.

      Chia AB, Watterson GA. 1969. Demographic effects on the rate of genetic evolution I. constant size populations with two genotypes. Journal of Applied Probability 6:231-248.

      Crow JF, Kimura M. 2009. An Introduction to Population Genetics Theory: Blackburn Press.

      Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, et al. 2021. Twelve years of SAMtools and BCFtools. Gigascience 10.

      Datson NA, Morsink MC, Atanasova S, Armstrong VW, Zischler H, Schlumbohm C, Dutilh BE, Huynen MA, Waegele B, Ruepp A, et al. 2007. Development of the first marmoset-specific DNA microarray (EUMAMA): a new genetic tool for large-scale expression profiling in a non-human primate. Bmc Genomics 8:190.

      Der R, Epstein CL, Plotkin JB. 2011. Generalized population models and the nature of genetic drift. Theoretical Population Biology 80:80-99.

      Dover G. 1982. Molecular drive: a cohesive mode of species evolution. Nature 299:111-117.

      Eickbush TH, Eickbush DG. 2007. Finely orchestrated movements: evolution of the ribosomal RNA genes. Genetics 175:477-485.

      Galtier N, Duret L. 2007. Adaptation or biased gene conversion? Extending the null hypothesis of molecular evolution. Trends in Genetics 23:273-277.

      Gibbs RA, Rogers J, Katze MG, Bumgarner R, Weinstock GM, Mardis ER, Remington KA, Strausberg RL, Venter JC, Wilson RK, et al. 2007. Evolutionary and Biomedical Insights from the Rhesus Macaque Genome. Science 316:222-234.

      Guarracino A, Buonaiuto S, de Lima LG, Potapova T, Rhie A, Koren S, Rubinstein B, Fischer C, Abel HJ, Antonacci-Fulton LL, et al. 2023. Recombination between heterologous human acrocentric chromosomes. Nature 617:335-343.

      Hartl DL, Clark AG, Clark AG. 1997. Principles of population genetics: Sinauer associates Sunderland.

      Hori Y, Shimamoto A, Kobayashi T. 2021. The human ribosomal DNA array is composed of highly homogenized tandem clusters. Genome Res 31:1971-1982.

      Jeffreys AJ, Neumann R. 2002. Reciprocal crossover asymmetry and meiotic drive in a human recombination hot spot. Nat Genet 31:267-271.

      Karlin S, McGregor J. 1964. Direct Product Branching Processes and Related Markov Chains. Proceedings of the National Academy of Sciences 51:598-602.

      Keane TM, Goodstadt L, Danecek P, White MA, Wong K, Yalcin B, Heger A, Agam A, Slater G, Goodson M, et al. 2011. Mouse genomic variation and its effect on phenotypes and gene regulation. Nature 477:289-294.

      Krystal M, D'Eustachio P, Ruddle FH, Arnheim N. 1981. Human nucleolus organizers on nonhomologous chromosomes can share the same ribosomal gene variants. Proceedings of the National Academy of Sciences of the United States of America 78:5744-5748.

      Meunier J, Duret L. 2004. Recombination drives the evolution of GC-content in the human genome. Molecular Biology and Evolution 21:984-990.

      Nagylaki T. 1983. Evolution of a large population under gene conversion. Proc Natl Acad Sci U S A 80:5941-5945.

      Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, Vollger MR, Altemose N, Uralsky L, Gershman A, et al. 2022. The complete sequence of a human genome. Science 376:44-53.

      Ohta T. 1985. A model of duplicative transposition and gene conversion for repetitive DNA families. Genetics 110:513-524.

      Ohta T. 1976. Simple model for treating evolution of multigene families. Nature 263:74-76.

      Ohta T, Dover GA. 1984. The Cohesive Population Genetics of Molecular Drive. Genetics 108:501-521.

      Ohta T, Dover GA. 1983. Population genetics of multigene families that are dispersed into two or more chromosomes. Proc Natl Acad Sci U S A 80:4079-4083.

      Ruan Y, Wang X, Hou M, Diao W, Xu S, Wen H, Wu C-I. 2024. Resolving Paradoxes in Molecular Evolution: The Integrated WF-Haldane (WFH) Model of Genetic Drift. bioRxiv:2024.2002.2019.581083.

      Smirnov E, Chmúrčiaková N, Liška F, Bažantová P, Cmarko D. 2021. Variability of Human rDNA. Cells 10.

      Smith GP. 1976. Evolution of Repeated DNA Sequences by Unequal Crossover. Science 191:528-535.

      Smith GP. 1974a. Unequal crossover and the evolution of multigene families. Cold Spring Harbor symposia on quantitative biology 38:507-513.

      Smith GP. 1974b. Unequal Crossover and the Evolution of Multigene Families.  38:507-513.

      Stults DM, Killen MW, Pierce HH, Pierce AJ. 2008. Genomic architecture and inheritance of human ribosomal RNA gene clusters. Genome Res 18:13-18.

      van Sluis M, Gailín M, McCarter JGW, Mangan H, Grob A, McStay B. 2019. Human NORs, comprising rDNA arrays and functionally conserved distal elements, are located within dynamic chromosomal regions. Genes Dev 33:1688-1701.

      Wall JD, Frisse LA, Hudson RR, Di Rienzo A. 2003. Comparative linkage-disequilibrium analysis of the beta-globin hotspot in primates. Am J Hum Genet 73:1330-1340.

    1. Author Response

      The following is the authors’ response to the current reviews.

      Responses to the reviewers

      We thank the editor and reviewers for their insightful feedback and valuable suggestions on our revised manuscript. In this reply, we provided further clarifications and made changes accordingly. Reviewers’ comments are in bold, and our responses are immediately below. Changes in the main text are presented in italics, accompanied by the specific line numbers in the revised manuscript where these changes can be found. Below, we respond to each reviewer’s comments in turn.

      Reviewer #1 (Public Review):

      Ps observed 24 objects and were asked which afforded particular actions (14 action types). Affordances for each object were represented by a 14-item vector, values reflecting the percentage of Ps who agreed on a particular action being afforded by the object. An affordance similarity matrix was generated which reflected similarity in affordances between pairs of objects. Two clusters emerged, reflecting correlations between affordance ratings in objects smaller than body size and larger than body size. These clusters did not correlate themselves. There was a trough in similarity ratings between objects ~105 cm and ~130 cm, arguably reflecting the body size boundary. The authors subsequently provide some evidence that this clear demarcation is not simply an incidental reflection of body size, but likely causally related. This evidence comes in the flavour of requiring Ps to imagine themselves as small as a cat or as large as an elephant and showing a predicted shift in the affordance boundary. The manuscript further demonstrates that ChatGPT (theoretically interesting because it's trained on language alone without sensorimotor information; trained now on words rather than images) showed a similar boundary.

      The authors also conducted a small MRI study task where Ps decide whether a probe action was affordable (graspable?) and created a congruency factor according to the answer (yes/no). There was an effect of congruency in posterior fusiform and superior parietal lobule for objects within body size range, but not outside. No effects in LOC or M1.

      The major strength of this manuscript in my opinion is the methodological novelty. I felt the correlation matrices were a clever method for demonstrating these demarcations, the imagination manipulation was also exciting, and the ChatGPT analysis provided excellent food for thought. These findings are important for our understanding of the interactions between action and perception, and hence for researchers from a range of domains of cognitive neuroscience.

      The major element that limits conclusions is that an MRI study with 12 P in this context can really only provide pilot data. Certainly the effects are not strong enough for 12 P to generate much confidence. The others of my concerns have been addressed in the revision.

      Reviewer #1 (Recommendations For The Authors):

      I think that the authors need to mention in the abstract that the MRI study constitutes a small pilot.

      Response: We appreciate the reviewer’s positive evaluation and constructive suggestions. In response to the concern about the limited number of participants in the fMRI study, we fully acknowledge the implications this has on the generalizability and robustness of our findings related to the congruency effect. To clarity, we have explicitly stated its preliminary nature of the MRI study in the abstract [line 22]: “A subsequent fMRI experiment offered preliminary evidence of affordance processing exclusively for objects within the body size range, but not for those beyond.”

      Reviewer #2 (Public Review):

      Summary

      In this work, the authors seek to test a version of an old idea, which is that our perception of the world and our understanding of the objects in it are deeply influenced by the nature of our bodies and the kinds of behaviours and actions that those objects afford. The studies presented here muster three kinds of evidence for a discontinuity in the encoding of objects, with a mental "border" between objects roughly of human body scale or smaller, which tend to relate to similar kinds of actions that are yet distinct from the kinds of actions implied by human-or-larger scale objects. This is demonstrated through observers' judgments of the kinds of actions different objects afford; through similar questioning of AI large-language models (LLMs); and through a neuroimaging study examining how brain regions implicated in object understanding make distinctions between kinds of objects at human and larger-than-human scales.

      Strengths 

      The authors address questions of longstanding interest in the cognitive neurosciences -- namely how we encode and interact with the many diverse kinds of objects we see and use in daily life. A key strength of the work lies in the application of multiple approaches. Examining the correlations among kinds of objects, with respect to their suitability for different action kinds, is novel, as are the complementary tests of judgments made by LLMs. The authors include a clever manipulation in which participants are asked to judge action-object pairs, having first adopted the imagined size of either a cat or an elephant, showing that the discontinuity in similarity judgments effectively moved to a new boundary closer to the imagined scale than the veridical human scale. The dynamic nature of the discontinuity hints that action affordances may be computed dynamically, "on the fly", during actual action behaviours with objects in the real world.

      Weaknesses 

      A limitation of the tests of LLMs may be that it is not always known what kinds of training material was used to build these models, leading to a possible "black box" problem. Further, presuming that those models are largely trained on previous human-written material, it may not necessarily be theoretically telling that the "judgments" of these models about action-object pairs shows human-like discontinuities. Indeed, verbal descriptions of actions are very likely to mainly refer to typical human behaviour, and so the finding that these models demonstrate an affordance discontinuity may simply reflect those statistics, rather than providing independent evidence for affordance boundaries.

      The relatively small sample size of the brain imaging experiment, and some design features (such as the task participants performed, and the relatively narrow range of objects tested) provide some limits on the extent to which it can be taken as support for the authors' claims.

      Response: We thank the reviewer for the positive evaluation and the constructive comments. We agree that how LLMs work is a “black box”, and thus it is speculative to assume them to possess any human-like ability, because, as the reviewer pointed out, “these models demonstrate an affordance discontinuity may simply reflect those statistics.” Indeed, our manuscript has expressed a similar idea [line 338]: “We speculated that ChatGPT models may have formed the affordance boundary through a human prism ingrained within its linguistic training corpus.” That is, our intention was not to suggest that such information could replace sensorimotor-based interaction or achieve human-level capability, but rather to highlight that embodied interaction is necessary. Additionally, the scope of the present study does not extend to elucidating the mechanisms behind LLMs’ resemblance of affordance boundary, whether through statistical learning or actual comprehension. To clarify this point, in the revised manuscript, we have clarified that the mechanisms underlying the observed affordance boundary in LLMs may be different from human cognitive processes, and advocated future studies to explore this possibility [line 415]: “Nevertheless, caution should be taken when interpreting the capability of LLMs like ChatGPT, which are often considered “black boxes.” That is, our observation indicates that certain sensorimotor information is embedded within human language materials presumably through linguistic statistics, but it is not sufficient to assert that LLMs have developed a human-like ability to represent affordances. Furthermore, such information alone may be insufficient for LLMs to mimic the characteristics of the affordance perception in biological intelligence. Future studies are needed to elucidate such limitation.”

      Regarding the concern about the models’ results not “providing independent evidence for affordance boundaries”, our objective in employing LLMs was to explore if an affordance boundary could emerge from conceptual knowledge without direct sensorimotor experience, rather than to validate the existence of the affordance boundary per se.

      As for the concern about the limitations imposed by the small sample size and certain design features of our brain imaging experiment, please see our reply to Reviewer #1.

      Reviewer #3 (Public Review):

      Summary:

      Feng et al. test the hypothesis that human body size constrains the perception of object affordances, whereby only objects that are smaller than the body size will be perceived as useful and manipulable parts of the environment, whereas larger objects will be perceived as "less interesting components."

      To test this idea, the study employs a multi-method approach consisting of three parts:

      In the first part, human observers classify a set of 24 objects that vary systematically in size (e.g., ball, piano, airplane) based on 14 different affordances (e.g., sit, throw, grasp). Based on the average agreement of ratings across participants, the authors compute the similarity of affordance profiles between all object pairs. They report evidence for two homogenous object clusters that are separated based on their size with the boundary between clusters roughly coinciding with the average human body size. In follow-up experiments, the authors show that this boundary is larger/smaller in separate groups of participants who are instructed to imagine themselves as an elephant/cat.

      In the second part, the authors ask different large language models (LLMs) to provide ratings for the same set of objects and affordances and conduct equivalent analyses on the obtained data. Some, but not all, of the models produce patterns of ratings that appear to show similar boundary effects, though less pronounced and at a different boundary size than in humans.

      In the third part, the authors conduct an fMRI experiment. Human observers are presented with four different objects of different sizes and asked if these objects afford a small set of specific actions. Affordances are either congruent or incongruent with objects. Contrasting brain activity on incongruent trials against brain activity on congruent trials yields significant effects in regions within the ventral and dorsal visual stream, but only for small objects and not for large objects.

      The authors interpret their findings as support for their hypothesis that human body size constrains object perception. They further conclude that this effect is cognitively penetrable, and only partly relies on sensorimotor interaction with the environment (and partly on linguistic abilities).

      Strengths:

      The authors examine an interesting and relevant question and articulate a plausible (though somewhat underspecified) hypothesis that certainly seems worth testing. Providing more detailed insights into how object affordances shape perception would be highly desirable. Their method of analyzing similarity ratings between sets of objects seems useful and the multi-method approach is original and interesting.

      Weaknesses:

      The study presents several shortcomings that clearly weaken the link between the obtained evidence and the drawn conclusions. Below I outline my concerns in no particular order:

      (1) It is not entirely clear to me what the authors are proposing and to what extent the conducted work actually speaks to this. For example, in the introduction, the authors write that they seek to test if body size serves not merely as a reference for object manipulation but also "plays a pivotal role in shaping the representation of objects." This motivation seems rather vague motivation and it is not clear to me how it could be falsified.

      Overall, the lack of theoretical precision makes it difficult to judge the appropriateness of the approaches and the persuasiveness of the obtained results. I would strongly suggest clarifying the theoretical rationale and explaining in more detail how the chosen experiments allow them to test falsifiable predictions.

      (2) The authors used only a very small set of objects and affordances in their study and they do not describe in sufficient detail how these stimuli were selected. This renders the results rather exploratory and clearly limits their potential to discover general principles of human perception. Much larger sets of objects and affordances and explicit data-driven approaches for their selection would provide a more convincing approach and allow the authors to rule out that their results are just a consequence of the selected set of objects and actions.

      (3) Relatedly, the authors could be more thorough in ruling out potential alternative explanations. Object size likely correlates with other variables that could shape human similarity judgments and the estimated boundary is quite broad (depending on the method, either between 80 and 150 cm or between 105 to 130 cm). More precise estimates of the boundary and more rigorous tests of alternative explanations would add a lot to strengthen the authors' interpretation.

      (4) While I appreciate the manipulation of imagined body size, as a clever way to solidify the link between body size and affordance perception, I find it unfortunate that it is implemented in a between-subjects design, as this clearly leaves open the possibility of pre-existing differences between groups. I certainly disagree with the authors' statement that their findings suggest "a causal link between body size and affordance perception."

      (5) The use of LLMs in the current study is not clearly motivated and I find it hard to understand what exactly the authors are trying to test through their inclusion. As it currently stands, I find it hard to discern how the presence of perceptual boundaries in LLMs could constitute evidence for affordance-based perception.

      (6) Along the same lines, the fMRI study also provides little evidence to support the authors' claims. The use of congruency effects as a way of probing affordance perception is not well motivated. Importantly (and related to comment 2 above), the very small set of objects and affordances in this experiment heavily complicates any conclusions about object size being the crucial variable determining the occurrence of congruency effects.

      Overall, I consider the main conclusions of the paper to be far beyond the reported data. Articulating a clearer theoretical framework with more specific hypotheses as well as conducting more principled analyses on more comprehensive data sets could help the authors obtain stronger tests of their ideas.

      Response: We appreciate the insightful inquiries regarding our manuscript. Below, we explained the theoretical motivation and rationale of each part of our experiments.

      In response to the reviewer’s insights, we have modified the expression “plays a pivotal role in shaping the representation of objects” in the revised manuscript and have restated the general question of our study in the introduction. Our motivation is on the long-lasting debate over the representation versus direct perception of affordance, specifically examining the “representationalization” of affordance. That is, we tested whether object affordance simply covaried directly with continuous constraints such as object size, a perspective aligned with the representation-free (direct perception) view, or whether affordance became representationalized, adhering to the representation-based view, constrained by body size. Such representationalization would generate a categorization between objects that are affordable and the environment that exceeds affordance.

      To test these hypotheses, we first delineated the affordance of various objects. We agree with the reviewer that in this step a broader selection of objects and actions could mitigate the risk of our results being influenced by the specific selection of objects and actions. However, our results are unlikely to be biased, because our selection was guided by two key criteria, rather than being arbitrary. First, the objects were selected from the dataset in Konkle and Oliva's study (2011), which systematically investigated object size’ impact on object recognition, thus providing a well-calibrated range of sizes (i.e., from 14 cm to 7,618 cm) reflective of real-world objects. Second, the selected actions covered a wide range of daily humans-objects/environments interactions, from single-point movements (e.g., hand, foot) to whole-body movements (e.g., lying, standing) based on the kinetics human action video dataset (Kay et al., 2017). Thus, this set of objects and actions is a representative sampling of typical human experiences.

      Upon demonstrating a trough in perceived affordance similarity, we recognized the location of the affordance boundary coincidentally fell within the range of human body size. We agree with the reviewer that this observation of the coincidence between body size and the location of boundary alone is not sufficient for a mechanistic explanation, because variables co-varying with object sizes might also generate this coincidence. The identification of a more precise location for the boundary unlikely rules out alternative explanations of this kind. To establish a causal link between body size and the affordance boundary, we opted for a direct manipulation of body sizes through imagination, while keeping all other variables constant across conditions. This approach allowed us to examine whether and how the affordance boundary shifts in response to body size changes.

      Regarding the between-subjects design of the imagination experiment, we wish to clarify that this design aimed to prevent carryover effects. Although a within-subjects design indeed is more sensitive in detecting manipulation effects by accounting for subject variability, it risks contamination across conditions. Specifically, transitioning immediately between different imagined body sizes poses a challenge, and sequential participation could induce undesirable response strategies, such as deliberately altering responses to the same objects in different conditions. The between-subjects design, which susceptible to participant variability (e.g., “pre-existing differences between groups” suggested by the reviewer), avoids such contamination. In addition, we employed random assignment of participants to different conditions (cat-size versus elephant-size).

      The body imagination experiment provided causal evidence of an embodied discontinuity, suggesting the boundary is tied to the agent’s motor capacity, rather than amodal sources. The LLMs experiment then sought to test a prediction from the embodied theories of cognition: the supramodality of object perception. Especially, we asked whether the embodied discontinuity is supramodally accessible, using LLMs to assess whether affordance perception discretization is supramodally accessible beyond the sensorimotor domain through linguistic understanding. From this perspective, our LLM experiment was employed not to affirm affordance-based perception but to examine and support a prediction by the embodied theories of cognition.

      Finally, our preliminary fMRI study aimed to conceptually replicate the perceptual discontinuity and explore it neural correlates using a subset of objects and actions from the behaviour experiments. This approach was chosen to achieve stable neural responses and enhance study power, employing the congruent effect (congruent - incongruent) as a metric for affordance processing (e.g., Kourtis et al., 2018), which reflects facilitated responses when congruent with objects’ affordances (e.g., Ellis & Tucker, 2000). Nevertheless, we recognize the limitation of a relatively small sample sizes, for details please see our reply to the reviewer #1.

      In summary, our findings contribute to the discourse on computationalism’s representation concept and influence of these representations, post-discretization, on processes beyond the sensorimotor domain. We hope that these additional explanations and revisions effectively address the concerns raised and demonstrate our commitment to enhancing the quality of our work in light of your valuable feedback. By acknowledging these limitations and directions for future research, we hope to further the discourse on affordance perception and embodied cognition.

      References

      Ellis, R., & Tucker, M. (2000). Micro‐affordance: The potentiation of components of action by seen objects. British Journal of Psychology, 91(4), 451-471.

      Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., ... & Zisserman, A. (2017). The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.

      Konkle, T., & Oliva, A. (2011). Canonical visual size for real-world objects. Journal of Experimental Psychology: human perception and performance, 37(1), 23.

      Kourtis, D., Vandemaele, P., & Vingerhoets, G. (2018). Concurrent cortical representations of function-and size-related object affordances: an fMRI study. Cognitive, Affective, & Behavioral Neuroscience, 18, 1221-1232.


      The following is the authors’ response to the original reviews.

      Responses to the reviewers

      We deeply appreciate the reviewers’ comments. In response to the concerns raised, we have revised the manuscript accordingly. Below we address each of the reviewers’ comments in turn. Reviewers’ comments are in bold, and our responses are immediately below. Changes in the main text are presented in italics, followed by corresponding page and line numbers in the revised manuscript. We also highlighted tracks of change in the revised manuscript.

      Reviewer #1 (Public Review):

      (1) The main behavioural work appears well-powered (>500 Ps). This sample reduces to 100 for the imagination study, after removing Ps whose imagined heights fell within the human range (100-200 cm). Why 100-200 cm? 100 cm is pretty short for an adult. Removing 80% of data feels like conclusions from the imagination study should be made with caution.

      R1: Sorry for the confusion. We did not remove 80% of the participants; instead, a separate sample of participants was recruited in the imagination experiment. The size of this sample (100 participants) was indeed smaller than the first experiment (528 participants), because the first experiment was set for exploratory purposes and was designed to be over-powered. Besides, inspection of the data of the first sample showed that the affordance pattern became stable after the first 50 participants. We explained this consideration in the revised manuscript:

      (p 21, ln 490) “…, another one hundred and thirty-nine participants from the same population were recruited from the same platform. We chose a smaller sample size for the imagination experiment compared to that for the object-action relation judgement task, because inspection of the data of the first sample showed that the affordance pattern became stable after the first 50 participants.”

      The average adult human height ranges from 140-170 cm for women and 150180 cm for men (NCD-RisC, 2016). Accordingly, the criterion of 100-200 cm covered this range and was set to ensure that participants unambiguously imagined a body schema different from that of human, as the tallest domestic cat below 100 cm according to the Guinness World Records and an elephant above 200 cm according to Crawley et al. (2017). We clarified these considerations in the revised manuscript:

      (p 21, ln 494) “To maximize the validity of the manipulation, data from participants whose imagined height fell within the average human size range (100cm - 200cm) were excluded from further analysis. Consequently, 100 participants (49 males, aged from 17 to 39 years, mean age = 23.2 years) remained in the analysis. This exclusion criterion was broader than the standard adult human height range of 140cm to 180cm (NCD-RisC, 2016). This approach ensured that our analysis focused on participants who unambiguously imagined a body schema different from humans, yet within the known height range of cats and elephants.”

      In addition, we also reanalysed the data with a more conservative criterion of 140cm to 180cm, and the results remained.

      (2) There are only 12 Ps in the MRI study, which I think should mean the null effects are not interpreted. I would not interpret these data as demonstrating a difference between SPL and LOC/M1, but rather that some analyses happened to fall over the significance threshold and others did not.

      R2: We would like to clarify that the null hypothesis of this fMRI study is the lack of two-way interaction between object size and object-action congruency, which was rejected by the observed significant interaction. That is, the interpretation of the present study did not rely on accepting any null effect.

      Having said this, we admit that the fMRI experiment is exploratory and the sample size is small (12 participants), which might lead to low power in estimating the affordance effect. In the revision, we acknowledge this issue explicitly:

      (p 16, ln 354) “…, supporting the idea that affordance is typically represented only for objects within the body size range. While it is acknowledged that the sample size of the fMRI study was small (12 participants), necessitating cautious interpretation of its results, the observed neural-level affordance discontinuity is notable. That is, qualitative differences in neural activity between objects within the affordance boundary and those beyond replicated our behavioral findings. This convergent evidence reinforced our claim that objects were discretized into two broad categories along the continuous size axis, with affordance only being manifested for objects within the boundary.”

      (3) I found the MRI ROI selection and definition a little arbitrary and not really justified, which rendered me even more cautious of the results. Why these particular sensory and motor regions? Why M1 and not PMC or SMA? Why SPL and not other parietal regions? Relatedly, ROIs were defined by thresholding pF and LOC at "around 70%" and SPL and M1 "around 80%", and it is unclear how and why these (different) thresholds were determined.

      R3: Our selection of these specific sensory and motor regions was based on prior literature reporting their distinct contribution to affordance perception (e.g., Borghi, 2005; Sakreida et al., 2016). The pFs was chosen as a representative region of the ventral visual stream, involved in object identification and classification, and the SPL was chosen as a representative region of the dorsal visual stream, involved in object perception and manipulation. The primary motor cortex (M1) has also been reported involved in affordance processing (e.g., McDannald et al., 2018), and we chose this region to probe the affordance congruency effect in the motor execution stage of the sense-think-act pathway. We did not choose the premotor cortex (PMC) and the supplementary motor area (SMA) because they were proposedly also involved in processes beyond motor execution (e.g., Hertrich et al., 2016; Kantak et al., 2012), and if any effect was observed, one cannot exclusively attribute the effect to motor execution. As for the parietal regions, our choice of the SPL not IPL/IPS is based on the meta-analysis of affordance processing areas where only the SPL shows consistent activation for both stable and variable affordances (Sakreida et al., 2016). We chose the SPL to capture effects on either type of affordances. In revision, we explained these considerations in the revised manuscript:

      (p 14, ln 280) “In addition to the pFs and SPL, we also examined the congruency effect in the lateral occipital cortex (LO), which is involved in object representation (e.g., Grill-Spector et al., 2000; Konkle & Caramazza, 2013) and provides inputs to both the pFs and SPL (Hebart et al., 2018). Meanwhile, the primary motor cortex (M1), which receives inputs from the dorsal stream (Vainio & Ellis, 2020), is involved in affordance processing (e.g., McDannald et al., 2018) and action executions (Binkofski et al., 2002).”

      (p 29, ln 684) “We chose the pFs, LO, SPL, and M1 as ROIs based on existing literature highlighting their distinct contributions to affordance perception (Borghi, 2005; Sakreida et al., 2016).”

      Regarding ROI thresholding, we apologize for the lack of clarity in reporting the thresholds in the original manuscript. The thresholds were different between ventral regions (from Zhen et al., 2015) and dorsal regions (from Fan et al., 2016) because they are from two different atlases. The former was constructed by probability maps of task-state fMRI activity during localizer contrast with stationary images and the latter by a parcellation of the brain's functional connectivity; therefore, the numerical values in these two atlases are not comparable. To extract ROIs with comparable sizes, we selected a threshold of 55% for the pFs, 90% for the LO, 78% for the SPL, and 94% for the M1 in the original manuscript.

      To rule out the possibility that the results were distorted by the specific choice of thresholds, we re-ran the analysis with a threshold 80% for all ROIs (resulting in 456 voxels in the lpFs, 427 voxels in the rpFs, 1667 voxels in the lLO, 999 voxels in the rLO, 661 voxels in the lSPL, 310 voxels in the rSPL, 231 voxels in the lM1, and 327 voxels in the rM1) with the 2-by-2 repeated-measures ANOVA. Our results remained the same qualitatively. A significant interaction between object type and congruency was observed in the pFs (F(1,11) = 24.87, p <.001, 𝜂2=.69) and SPL (F(1,11) = 14.62, p =.003, 𝜂2=.57). The simple effect analysis revealed the congruency effect solely for objects within body size range (pFs: p =.003; SPL: p <.001), not for objects beyond (ps >.30). For the M1 and LO, neither significant main effects (ps >.11) nor interactions were found (ps >.20).

      We clarified our choice of thresholds in the methods section in the revised manuscript:

      (p 29, ln 686) “Eight ROIs depicted in Fig. 3b were constructed based on the overlap between the whole-brain map activated by both objects within and beyond and corresponding functional atlases (the pFs and LO from Zhen et al., 2015; the SPL and M1 from Fan et al., 2016). To achieve ROIs of similar sizes, we applied varying thresholds to each cortical area: for the pFs and LO, the atlases were thresholded at 55% and 90%, resulting in 266 voxels in the lpFs, 427 in the rpFs, 254 in the lLO and 347 in the rLO; for the SPL and M1, the atlases were thresholded at 78% and 94%, resulting in 661 voxels in the lSPL, 455 in the rSPL, 378 in the lM1, and 449 in the rM1. In the subsequent analysis, homologous areas spanning both cortical hemispheres were merged.”

      (4) Discussion and theoretical implications. The authors discuss that the MRI results are consistent with the idea we only represent affordances within body size range. But the interpretation of the behavioural correlation matrices was that there was this similarity also for objects larger than body size, but forming a distinct cluster. I therefore found the interpretation of the MRI data inconsistent with the behavioural findings.

      R4: We speculated that the similarity in action perception among objects beyond the body size range may be due to these objects being similarly conceptualized as ‘environment’, in contrast to the objects within the body size range, which are categorized differently, namely as the ‘objects for the animal.’ Accordingly, in cortical regions involved in object processing, objects conceptualized as ‘environment’ unlikely showed the congruency effect, distinct from objects within the body size range. We have explained this point in the revised manuscript:

      (p 17, ln 370) “…which resonates the embodied influence on the formation of abstract concepts (e.g., Barsalou, 1999; Lakoff & Johnson, 1980) of objects and environment. Consistently, our fMRI data did not show the congruency effect for objects beyond the body size range, distinct from objects within this range, suggesting a categorization influenced by objects’ relative size to the human body.”

      (5) In the discussion, the authors outline how this work is consistent with the idea that conceptual and linguistic knowledge is grounded in sensorimotor systems. But then reference Barsalou. My understanding of Barsalou is the proposition of a connectionist architecture for conceptual representation. I did not think sensorimotor representation was privileged, but rather that all information communicates with all other to constitute a concept.

      R5: We are sorry for the confusion. We do not intend to argue that the sensorimotor representation is privileged. Instead, we would like to simply emphasize their engagement in concept. According to our understanding, Barsalou’s Perceptual Symbol Theory proposes that grounded concepts include sensorimotor information, and conceptual knowledge is grounded in the same neural system that supports action (Barsalou, 1999). This is consistent with our proposal that the affordance boundary locked to an animal’s sensorimotor capacity might give rise to a conceptual-ish representation of object-ness specific to the very animal. We have clarified this point in the introduction and discussion on the conceptual knowledge and sensorimotor information:

      In the introduction (p 2, ln 59) “…, and the body may serve as a metric that facilitates meaningful engagement with the environment by differentiating objects that are accessible for interactions from those not. Further, grounded cognition theory (see Barsalou, 2008 for a review) suggests that the outputs of such differentiation might transcend sensorimotor processes and integrate into supramodal concepts and language. From this perspective, we proposed two hypotheses...”

      In the discussion (p 18, ln 392) “Indeed, it has been proposed that conceptual knowledge is grounded in the same neural system that supports action (Barsalou, 1999; Glenberg et al., 2013; Wilson & Golonka, 2013), thereby suggesting that sensorimotor information, along with other modal inputs, may be embedded in language (e.g., Casasanto, 2011; Glenberg & Gallese, 2012; Stanfield & Zwaan, 2001), as the grounded theory proposed (see Barsalou, 2008 for a review).”

      (6) More generally, I believe that the impact and implications of this study would be clearer for the reader if the authors could properly entertain an alternative concerning how objects may be represented. Of course, the authors were going to demonstrate that objects more similar in size afforded more similar actions. It was impossible that Ps would ever have responded that aeroplanes afford grasping and balls afford sitting, for instance. What do the authors now believe about object representation that they did not believe before they conducted the study? Which accounts of object representation are now less likely?

      R6: We thank the reviewer for this suggestion. The theoretical motivation of the present study is to explore whether, for continuous action-related physical features (such as object size relative to the agents), affordance perception introduces discontinuity and qualitative dissociation, i.e., to allow the sensorimotor input to be assigned into discrete states/kinds, as representations envisioned by the computationalists; alternatively, whether the activity may directly mirror the input, free from discretization/categorization/abstraction, as proposed by the Replacement proposal of some embodied theories on cognition.

      By addressing this debate, we hoped to shed light on the nature of representation in, and resulted from, the vision-for-action processing. Our finding of affordance discontinuity suggests that sensorimotor input undergoes discretization implied in the computationalism idea of representation. Further, not contradictory to the claims of the embodied theories, these representations do shape processes out of the sensorimotor domain, but after discretization.

      We have now explained our hypotheses and alternatives explicitly in the revised introduction and discussion:

      In the introduction (p 2, ln 45) “However, the question of how object perception is influenced by the relative size of objects in relation to the human body remains open. Specifically, it is unclear whether this relative size simply acts as a continuous variable for locomotion reference, or if it affects differentiating and organizing object representation based on their ensued affordances.”

      In the discussion (p 14, ln 295) “One long-lasting debate on affordance centers on the distinction between representational and direct perception of affordance. An outstanding theme shared by many embodied theories of cognition is the replacement hypothesis (e.g., Van Gelder, 1998), which challenges the necessity of representation as posited by computationalism’s cognitive theories (e.g., Fodor, 1975). This hypothesis suggests that input is discretized/categorized and subjected to abstraction or symbolization, creating discrete stand-ins for the input (e.g., representations/states). Such representationalization would lead to a categorization between the affordable (the objects) and those beyond affordance (the environment), in contrast to the perspective offered by embodied theories. The present study probed this ‘representationalization’ of affordance by examining whether affordance perception introduces discontinuity and qualitative dissociation in response to continuous action-related physical features (such as object size relative to the agents), which allows sensorimotor input to be assigned into discrete states/kinds, in line with the representation-based view under the constraints of body size. Alternatively, it assessed whether activity directly mirrors the input, free from discretization/categorization/abstraction, in line with the representation-free view.

      First, our study found evidence demonstrating discretization in affordance perception. Then, through the body imagination experiment, we provided causal evidence suggesting that this discretization originates from sensorimotor interactions with objects rather than amodal sources, such as abstract object concepts independent of agent motor capability. Finally, we demonstrated the supramodality of this embodied discontinuity by leveraging the recent advances in AI. We showed that the discretization in affordance perception is supramodally accessible to disembodied agents such as large language models (LLMs), which lack sensorimotor input but can access linguistic materials built upon discretized representations. These results collectively suggest that sensorimotor input undergoes discretization, as implied in the computationalism’s idea of representation. Note that, these results are not contradictory to the claim of the embodied theories, as these representations do shape processes beyond the sensorimotor domain but after discretization.

      This observed boundary in affordance perception extends the understanding of the discontinuity in perception in response to the continuity of physical inputs (Harnad, 1987; Young et al., 1997).”

      Reviewer #1 (Recommendations For The Authors):

      a) I would recommend providing further justification for why 100-200 cm were used as the cut-offs reflecting acceptable imagined body size. Were these decisions preregistered anywhere? If so, please state.

      Ra: Please see R1.

      b) I would encourage the authors to call the MRI a small pilot study throughout, including in the abstract.

      Rb: We completely agree and have indicated the preliminary nature of this study in the revised version:

      (p 11, ln 236) “To test this speculation, we ran an fMRI experiment with a small number of participants to preliminarily investigate the neural basis of the affordance boundary in the brain by measuring neural activity in the dorsal and ventral visual streams when participants were instructed to evaluate whether an action was affordable by an object (Fig. 3a).”

      c) Please provide much further justification of ROI selection, why these thresholds were chosen, and therefore why they are different across regions.

      Rc: Please see R3.

      d) Further elucidation in the discussion would help the reader interpret the MRI data, which should always be interpreted also in light of the behavioural findings.

      Rd: Please see R4.

      e) The authors may wish to outline precisely what they claim concerning the nature of conceptual/linguistic representation. Is sensorimotor information privileged or just part of the distributed representation of concepts?

      Re: This is a great point. For details of corresponding revision, please see R5.

      f) There are some nods to alternative manners in which we plausibly represent objects (e.g. about what the imagination study tells us) but I think this theoretical progression should be more prominent.

      Rf: We thank the reviewer for this suggestion. For details of corresponding revision, please see R6.

      Reviewer #2 (Public Review):

      (1) A limitation of the tests of LLMs may be that it is not always known what kinds of training material was used to build these models, leading to a possible "black box" problem. Further, presuming that those models are largely trained on previous human-written material, it may not necessarily be theoretically telling that the "judgments" of these models about action-object pairs show human-like discontinuities. Indeed, verbal descriptions of actions are very likely to mainly refer to typical human behaviour, and so the finding that these models demonstrate an affordance discontinuity may simply reflect those statistics, rather than evidence that affordance boundaries can arise independently even without "organism-environment interactions" as the authors claim here.

      R1: We agree that how LLMs work is a “black box”, and thus it is speculative to assume them to possess any human-like ability, because, as the reviewer pointed out, “these models demonstrate an affordance discontinuity may simply reflect those statistics.” Indeed, our manuscript has expressed a similar idea: “We speculated that ChatGPT models may have formed the affordance boundary through a human prism ingrained within its linguistic training corpus. (p 16 ln 338)”. That is, we did not intend to claim that such information is sufficient to replace sensorimotor-based interaction, or to restore human-level capability, for which we indeed speculated that embodied interaction is necessary. In the revised manuscript, we have clarified our stand that the mechanism generating the observed affordance boundary in LLMs might be different from that in human cognition, and urged future studies to explore this possibility:

      (p 18, ln 413) “…, as well as alignment methods used in fine-tuning the model (Ouyang et al., 2022). Nevertheless, caution should be taken when interpreting the capabilities of LLMs like ChatGPT, which are often considered “black boxes.” That is, our observation indicates that some degree of sensorimotor information is embedded within human language materials presumably through linguistic statistics, but it is not sufficient to assert that LLMs have developed a human-like ability to represent affordances. Furthermore, such information alone may be insufficient for LLMs to mimic the characteristics of the affordance perception in biological intelligence. Future studies are needed to elucidate such limitation.”

      Indeed, because of this potential dissociation, our LLM study might bear novel implications for the development of AI agents. We elaborated on them in the revised discussion on LLMs:

      (p 19, ln 427) “…, represents a crucial human cognitive achievement that remains elusive for AI systems. Traditional AI (i.e., task-specific AI) has been confined with narrowly defined tasks, with substantial limitations in adaptability and autonomy. Accordingly, these systems have served primarily as tools for humans to achieve specific outcomes, rather than as autonomous agents capable of independently formulating goals and translating them into actionable plans. In recent years, significant efforts have been directed towards evolving traditional AI into more agent-like entities, especially in domains like navigation, object manipulation, and other interactions with the physical world. Despite these advancements, the capabilities of AI still fall behind human-level intelligence. On the other hand, embodied cognition theories suggest that sensorimotor interactions with the environment are foundational for various cognitive domains. From this point of view, endowing AI with human-level abilities in physical agent-environment interactions might provide an unreplaceable missing piece for achieving Artificial General Intelligence (AGI). This development would significantly facilitate AI’s role in robotics, particularly in actions essential for survival and goal accomplishment, a promising direction for the next breakthrough in AI (Gupta et al., 2021; Smith & Gasser, 2005).

      However, equipping a disembodied AI with the ability for embodied interaction planning within a specific environment remains a complex challenge. By testing the potential representationalization of action possibilities (affordances) in both humans and LLMs, the present study suggests a new approach to enhancing AI’s interaction ability with the environment. For instance, our finding of supramodal affordance representation may indicate a possible pathway for disembodied LLMs to engage in embodied physical interactions with their surroundings. From an optimistic view, these results suggest that LLM-based agents, if appropriately designed, may leverage affordance representations embedded in language to interact with the physical world. Indeed, by clarifying and aligning such representations with the physical constitutes of LLM-based agents, and even by explicitly constructing an agent-specific object space, we may foster the sensorimotor interaction abilities of LLM-based agents. This progression could lead to achieving animal-level interaction abilities with the world, potentially sparking new developments in the field of embodied cognition theories.”

      (2) The authors include a clever manipulation in which participants are asked to judge action-object pairs, having first adopted the imagined size of either a cat or an elephant, showing that the discontinuity in similarity judgments effectively moved to a new boundary closer to the imagined scale than the veridical human scale. The dynamic nature of the discontinuity suggests a different interpretation of the authors' main findings. It may be that action affordance is not a dimension that stably characterises the long-term representation of object kinds, as suggested by the authors' interpretation of their brain findings, for example. Rather these may be computed more dynamically, "on the fly" in response to direct questions (as here) or perhaps during actual action behaviours with objects in the real world.

      R2: We thank the reviewer for pointing out the dynamic nature of affordance perception in our study. This feature indeed reinforced our attribution of the boundary into an affordance-based process instead of a conceptual or semantic process, the latter of which would predict the action possibilities being a fixed belief about the objects, instead of being dynamically determined according to the feature of the agent-object dyads. In addition, this dynamic does not contradict with our interpretation of the observed boundary in affordance perception. With this observation, we speculated that continuous input was abstracted or representationalized into discontinued categories, and the boundary between these categories was drawn according to the motor capacity of the agent. The finding of the boundary adapting to manipulation on body schema suggests that the abstraction/representationalization dynamically updates according to the current belief of motor capacity and body schema of the animal. In addition, we agree that future studies are needed to examine the dynamics of the abstraction/representationalization of affordance, probably by investigating the evolvement of affordance representation during ongoing actual interactions with novel objects or manipulated motor capability. These points are now addressed in the revision:

      (p 17, ln 380) “Therefore, this finding suggests that the affordance boundary is cognitively penetrable, arguing against the directness of affordance perception (e.g., Gibson, 1979; Greeno, 1994; Prindle et al., 1980) or the exclusive sensorimotor origin of affordances (e.g., Gallagher, 2017; Thompson, 2010; Hutto & Myin, 2012; Chemero, 2013). Further, this finding that the boundary adapted to manipulation on body schema suggests that the abstraction/representationalization may be dynamically updated in response to the current motor capacity and body schema of the agent, suggesting that the affordance-based process is probably determined dynamically by the nature of the agent-object dyads, rather than being a fixed belief about objects. Future studies could explore the dynamics of affordance representationalization, probably by investigating how affordance representations evolve during active interactions with novel objects or under conditions of altered motor capabilities. Finally, our findings also suggest that disembodied conceptual knowledge pertinent to action likely modulates affordance perception.”

      Reviewer #2 (Recommendations For The Authors):

      a) As described, I think the authors could improve their discussion of the LLM work and consider more deeply possible different interpretations of their findings with those models. Are they really providing an independent data point about how objects may be represented, or instead is this a different, indirect way of asking humans the same questions (given the way in which these models are trained)?

      Ra: Please see R1.

      b) Some of the decisions behind the design of the fMRI experiment, and some of the logic of its interpretation, could be made clearer. Why those four objects per se? What kinds of confounds, such as familiarity, or the range of possible relevant actions per object, might need to be considered? Is there the possibility that relative performance on the in-scanner behavioural task may be in part responsible for the findings? Why were those specific regions of interest chosen and not others? The authors find that the dorsal and ventral regions make a univariate distinction between congruent and incongruent trials, but only for human-scale objects, but it was not clear from the framework that the authors adopted why that distinction should go in that direction (e.g. congruent > incongruent) nor why there shouldn't also be a distinction for the "beyond" objects? Finally, might some of these brain questions better be approached with an RSA or similar approach, as that would seem to better map onto the behavioural studies?

      Rb: We thank the reviewer for the detailed suggestions.

      Regarding the fMRI study, we have provided further justification on its rationale in the revised manuscript:

      (p 11, ln 231) “The distinct categories of reported affordances demarcated by the boundary imply that the objects on either side of the boundary may be represented differently in the brain. We thus speculated that the observed behavioral discontinuity is likely underpinned by distinct neural activities, which give rise to these discrete ‘representations’ separated by the boundary.”

      The objects used in the fMRI study were selected by taking into account the objective of the fMRI study, which was to provide the neural basis for the affordance discontinuity found in behaviour experiments. In other words, the fMRI study is not an exploratory experiment, but a validation experiment. To this end, we deliberately selected a small range of common objects to ensure that participants were sufficiently familiar with them, as confirmed through their oral reports. Furthermore, to ensure a fair comparison between the two categories of objects in terms of action possibility range, we predetermined an equal number of congruent and incongruent actions for each category. This arrangement was intended to eliminate any bias that might arise from different amount of action choices associated with each category. Therefore, the present object and action sets in the fMRI study, which were based on the behavior experiments, are sufficient for its purpose.

      Regarding the possibility that the performance of the in-scanner behavioural task may be in part responsible for the findings, we analysed participants’ performance. Not surprisingly, participants demonstrated high consistency and accuracy in their responses:

      𝑀𝑒𝑎𝑛𝐶𝑜𝑛𝑔𝑟𝑢𝑒𝑛𝑡_𝑂𝑏𝑗𝑒𝑐𝑡𝑊𝑖𝑡ℎ𝑖𝑛 = 0.991, SD = 0.018;

      𝑀𝑒𝑎𝑛𝐼𝑛𝑐𝑜𝑛𝑔𝑟𝑢𝑒𝑛𝑡_𝑂𝑏𝑗𝑒𝑐𝑡𝑊𝑖𝑡ℎ𝑖𝑛 = 0.996, SD = 0.007;

      𝑀𝑒𝑎𝑛𝐶𝑜𝑛𝑔𝑟𝑢𝑒𝑛𝑡_𝑂𝑏𝑗𝑒𝑐𝑡𝐵𝑒𝑦𝑜𝑛𝑑 = 0.996, SD = 0.004;

      𝑀𝑒𝑎𝑛𝐼𝑛𝑐𝑜𝑛𝑔𝑟𝑢𝑒𝑛𝑡𝑂𝑏𝑗𝑒𝑐𝑡𝐵𝑒𝑦𝑜𝑛𝑑 = 0.998, SD = 0.002

      in all conditions, suggesting constant active engagement with the task. Thus, the inscanner behaviour unlikely resulted in the lack of congruency effect for the ‘beyond’ objects observed in the brain.

      Regarding the selection of ROIs, our decision to focus on these specific sensory and motor regions was based on existing literature highlighting their distinct contribution to affordance perception (Borghi, 2005; Sakreida et al., 2016). The pFs was chosen for its role in object identification and classification, while the SPL was chosen for its involvement in object manipulation. Additionally, the primary motor cortex (M1) is known to be engaged in affordance processing (e.g., McDannald et al., 2018), which was included to investigate the affordance congruency effect during the motor execution stage of the sense-think-act pathway. These considerations are detailed in the revised manuscript:

      (p 14, ln 280) “In addition to the pFs and SPL, we also examined the congruency effect in the lateral occipital cortex (LO), which is involved in object representation (e.g., Grill-Spector et al., 2000; Konkle & Caramazza, 2013) and provides inputs to both the pFs and SPL (Hebart et al., 2018). Meanwhile, the primary motor cortex (M1), which receives inputs from the dorsal stream (Vainio & Ellis, 2020), is involved in affordance processing (e.g., McDannald et al., 2018) and action executions (Binkofski et al., 2002).”

      (p 29, ln 684) “We chose the pFs, LO, SPL, and M1 as ROIs based on existing literature highlighting their distinct contributions to affordance perception (Borghi, 2005; Sakreida et al., 2016).”

      Regarding the congruency effect, in our study, we followed the established fMRI research paradigm of employing the congruent effect as a measure of affordance processing (e.g., Kourtis et al., 2018), and the rationale behind the directionality of the distinction in our framework (congruent > incongruent) is grounded in the concept of affordance, in which the mere perception of a graspable object facilitates motor responses that are congruent with certain qualities of the object (e.g., Ellis & Tucker, 2000). From the interaction of congruency by object type, we observed only congruency effect for objects within rather than objects beyond. We speculate that the objects beyond the affordance boundary is generally beyond the motor capacities of the very animal, being too large for the animal to manipulate, thus no congruency effect was found. We have added these clarifications in the revised manuscript:

      (p 11, ln 244) “The congruency effect, derived from the contrast of Congruent versus Incongruent conditions, is a well-established measure of affordance processing (e.g., Kourtis et al., 2018).”

      (p 16, ln 340) “In contrast, objects larger than that range typically surpass the animal’s motor capabilities, rendering them too cumbersome for effective manipulation. Consequently, these larger objects are less likely to be considered as typical targets for manipulation by the animal, as opposed to the smaller objects. That is, they are perceived not as the “objects” in the animal’s eye, but as part of the background environment, due to their impracticality for direct interactions.”

      Regarding the RSA analysis, we agree with the reviewer that RSA may offer a more direct comparison with similarities among objects. However, our primary objective in this fMRI study was to explore the neural basis of the affordance boundary observed in the behavioural study, rather than explaining the similarities in neural responses between different objects. For this reason, we did not conduct RSA analysis.

      c) Page 4 Re statistical evaluation of the discontinuity in judgments, the authors might consider a Bayesian approach, which would be stronger than using "all ps > 0.05" to argue that within-boundary similarities are consistent and high.

      Rc: We thank the reviewer for the suggestion on the Bayesian approach for significance tests, which has been now added in the revised manuscript:

      In the results (p 4, ln 105) “This trough suggested an affordance boundary between size rank 4 and 5, while affordance similarities between neighboring ranks remained high (rs > 0.45) and did not significantly differ from each other (ps > 0.05, all 𝐵𝐹10 < 10) on either side of the boundary (Fig. 1d, left panel, green lines).”

      In the methods (p 25, ln 597) “Pearson and Filon’s (1898) Z, implemented in R package “cocor” (Diedenhofen & Musch, 2015) was used to evaluate the significance of these similarities (alpha level = .05, one-tail test). For significance tests, Bayesian statistical analyses were conducted using the web version of the “bayesplay” R package (Colling, 2021). Specifically, the data (likelihood) model was specified as a normal distribution, where the correlation coefficients were transformed to Fisher’s z. The null hypothesis was specified as a standard normal distribution centred at zero. Conversely, the alternative hypothesis was specified as a normal distribution centred at 2. Bayes factors (BF10) were calculated and interpreted using the classification scheme suggested by Wagenmakers et al. (2011), wherein a Bayes factor greater than 10 is considered strong evidence for accepting H1 over H0.”

      d) Page 4 One question I had about the big objects is whether their internal similarity and dissimilarity to smaller objects, might largely arise if most of the answers about actions for those larger objects are just "no"? This depends on the set of possible actions that were considered: the authors chose 14 from a previous study but did not describe these further or consider possible strengths/limitations of this selection. This is a very important point that needs addressing - to what extent are these findings "fragile" in that they relate only to that specific selection of 14 action kinds?

      Rd: The action judgements for objects beyond body size were not mostly “no”; in fact, there was no significant difference between average action possibilities related to objects beyond (25%) and within (26%). Rather, the dissimilarity between objects within and those beyond likely arose from the difference in most-plausible action set they related. For example, the top three actions related to objects within are “grasp”, “hold” and “throw”, while those related to objects beyond are “sit”, “lift” and “stand”, as stated in our original manuscript: “A further analysis on the affordances separated by the boundary revealed that objects within human body size range were primarily subjected to hand-related actions such as grasping, holding and throwing. These affordances typically involve object manipulation with humans’ effectors. In contrast, objects beyond the size range of human body predominantly afforded actions such as sitting and standing, which typically require locomotion or posture change of the whole body around or within the objects (p 11 ln 229)”.

      Regarding the validity of action selection, the selection of the objects and affordances in this study was guided by two key criteria. First, the objects were selected from the dataset published in Konkle and Oliva's study (2011), which systematically investigates the effect of object size on object recognition. Therefore, the range of object sizes, from 14 cm to 7,618 cm, is well-calibrated and represents a typical array of object sizes found in the real world. Second, the actions were selected to cover a wide range of daily humans-objects/environments interactions, from singlepoint movements (e.g., hand, foot) to whole-body movements (e.g., lying, standing), based on the kinetics human action video dataset (Kay et al., 2017). Thus, this set of objects and actions is a sufficiently representative of typic human experiences. In revision, we have clarified these two criteria in the methods section:

      (p 22, ln 517) “The full list of objects, their diagonal size, and size rankings were provided in Supplementary Table S6. The objects were selected from the dataset in Konkle and Oliva’s study (2011) to cover typic object sizes in the world (ranging from 14 cm to 7,618 cm), and actions related to these objects were selected to span a spectrum of daily humans-objects/environments interactions, from single-point movements (e.g., hand, foot) to whole-body movements (e.g., lying, standing), based on the Kinetics Human Action Video Dataset (Kay et al., 2017).”

      Having said this, we agree with reviewer that a larger set of objects and actions will facilitate finer localization of the representational discontinuity, which can be addressed in future studies

      (p 16, ln 344): “…, due to their impracticality for direct interactions. Future studies should incorporate a broader range of objects and a more comprehensive set of affordances for finer delineation of the representational discontinuity between objects and the environment.”

      e) Page 12 "no region showed the congruency effect for objects beyond the body size" in a whole brain analysis. What about a similar analysis for the humanscale objects? We must also keep in mind that with N=12 there may be relatively little power to detect such effects at the random-effects level, so this null finding may not be very informative.

      Re: We thank the reviewer for this advice. The whole brain analysis on the congruency effect for human-scale objects (objects within) has now been included in the supplementary materials (please see Author response figure 1d (New Supplementary Fig. S4d) and Author response table 1 (New Supplementary Table S5) below).

      Author response image 1.

      Significant brain activations of different contrasts in the whole-brain level analysis. a, the effect of object type, positive values (warm color) indicated higher activation for objects within than objects beyond and negative values (cold color) indicated the opposite. b, the effect of congruency, positive values indicated higher activation in congruent than incongruent condition. c, the effect of interaction between object type and congruency, positive values indicated the larger congruency effect for objects within than beyond. d, the congruency effect for objects within. All contrasts were corrected with cluster-level correction at p < .05. The detailed cluster-level results for each contrast map can be found in Supplementary Table S2 to S5.

      Author response table 1.

      Cortical regions showing significant congruency effect (congruent versus incongruent) for objects within, whole-brain analysis (R = right hemisphere, L = left hemisphere; Z > 2.3, p = 0.05, cluster corrected)

      Regarding the power of the fMRI study, we would like to clarify that, the critical test of this fMRI study is the two-way interaction of congruency effect by object size instead of the (null) congruency effect for the object beyond. Having said this, we agree that the sample size is small which might lead to lack of power in the fMRI study. In the revision we have now acknowledged this issue explicitly:

      (p 16, ln 354) “…supporting the idea that affordance is typically represented only for objects within the body size range. While it is acknowledged that the sample size of the fMRI study was small (12 participants), necessitating cautious interpretation of its results, the observed neural-level affordance discontinuity is notable. That is, qualitative differences in neural activity between objects within the affordance boundary and those beyond replicated our behavior findings. This convergent evidence reinforced our claim that objects were discretized into two broad categories along the continuous size axis, with affordance only being manifested for objects within the boundary.”

      f) Page 14 [the fMRI findings] "suggest that affordance perception likely requires perceptual processing and is not necessarily reflected in motor execution". This seems a large leap to make from a relatively basic experiment that tests only a small set of (arbitrarily chosen) objects and actions. It's important to keep in mind too that none of the studies here actually asked participants to interact with objects; that objects were shown as 2D images; and that the differences between real-world sizes of objects were greatly condensed by the way they are scaled for presentation on a computer screen (and such scaling is probably greater for the larger-than-human objects).

      Rf: The action-congruency judgement task is widely used in the studies of affordance processing (e.g., Kourtis et al., 2018; Peelen & Caramazza, 2012), so does the practice of not including actual interaction with the objects and using 2D instead of 3D objects (e.g., Peelen & Caramazza, 2012; Matić et al., 2020). However, we are aware that alternative practice exists in the field and we agree that it would be interesting for future studies to test whether actual interactions and 3D objects presentation may bring any change on the affordance boundary observed in our study.

      Our inference “affordance perception likely requires perceptual processing and is not necessarily reflected in motor execution” was based on the fMRI finding that the congruency effect only in cortical regions proposedly engaged in perceptual processing, but not in the M1 which is associated with motor execution. This significant two-way interaction pointed to a possibility that affordance processing may not necessarily manifest in motor execution.

      We acknowledge the scaling issue inherent in all laboratory experiments, but we doubt that it significantly influenced our results. In fact, it is a common practice in studies on object size to present objects of different physical sizes as constantly sized images on a screen (e.g., Konkle & Oliva, 2012; Huang et al., 2022). Moreover, scaling does not change the smoothness of object sizes, whereas the affordance boundary represents a singularity point that disrupts this smoothness. Finally, regarding the limited variety of objects and actions, please see Rd.

      g) Page 15 Why are larger objects "less interesting"? They have important implications for navigation, for example?

      Rg: We are sorry for the confusion. Our intention was to express that objects beyond the affordance boundary are generally beyond motor capacities of the animal in question. As such, compared to smaller objects within the environment, these larger objects may not typically be considered as potential targets for manipulation. We have now corrected the wording in the revised text:

      (p 16, ln 340) “In contrast, objects larger than that range typically surpass the animal’s motor capabilities, rendering them too cumbersome for effective manipulation. Consequently, these larger objects are less likely to be considered as typical targets for manipulation by the animal, as opposed to smaller objects in the environment. That is, they are perceived not as the “objects” in the animal’s eye, but as part of the background environment, due to their impracticality for direct interactions.”

      h) Page 15 At several places I wondered whether the authors were arguing against a straw man. E.g. "existing psychological studies...define objects in a disembodied manner..." but no citations are given on this point, nor do the authors describe previous theoretical positions that would make a strong counter-claim to the one advocated here.

      Rh: We are sorry for not presenting our argument clearly. Previous studies often define the object space based on object features alone, such as absolute size or function, without reference to the knowledge and the abilities of the agent (e.g., de Beeck et al., 2008; Konkle & Oliva, 2011). This perspective overlooks the importance of the features of the animal-object pairs. Gibson (1979) highlighted that an object’s affordance, which includes all action possibilities it offers to an animal, is determined by the object’s size relative to the animal’s size, rather than its real-world size. Under this embodied view, we argue that the object space is better defined by the features of the agent-object system, and this is the primary assumption and motivation of the present study. We have now clarified this point and added the references in the revision:

      (p 2, ln 35) “A contemporary interpretation of this statement is the embodied theory of cognition (e.g., Chemero, 2013; Gallagher, 2017; Gibbs, 2005; Wilson, 2002; Varela et al., 2017), which, diverging from the belief that size and shape are inherent object features (e.g., de Beeck et al., 2008; Konkle & Oliva, 2011), posits that human body scale (e.g., size) constrains the perception of objects and the generation of motor responses.”

      (p 17, ln 365) “Existing psychological studies, especially in the field of vision, define objects in a disembodied manner, primarily relying on their physical properties such as shape (e.g., de Beeck et al., 2008) and absolute size (e.g., Konkle & Oliva, 2011).”

      Reviewer #3 (Public Review):

      (1) Even after several readings, it is not entirely clear to me what the authors are proposing and to what extent the conducted work actually speaks to this. In the introduction, the authors write that they seek to test if body size serves not merely as a reference for object manipulation but also "plays a pivotal role in shaping the representation of objects." This motivation seems rather vague motivation and it is not clear to me how it could be falsified.

      Similarly, in the discussion, the authors write that large objects do not receive "proper affordance representation," and are "not the range of objects with which the animal is intrinsically inclined to interact, but probably considered a less interesting component of the environment." This statement seems similarly vague and completely beyond the collected data, which did not assess object discriminability or motivational values.

      Overall, the lack of theoretical precision makes it difficult to judge the appropriateness of the approaches and the persuasiveness of the obtained results. This is partly due to the fact that the authors do not spell out all of their theoretical assumptions in the introduction but insert new "speculations" to motivate the corresponding parts of the results section. I would strongly suggest clarifying the theoretical rationale and explaining in more detail how the chosen experiments allow them to test falsifiable predictions.

      R1: We are sorry for the confusion about the theoretical motivation and rationale. Our motivation is on the long-lasting debate regarding the representation versus direct perception of affordance. That is, we tested whether object affordance would simply covary with its continuous constraints such as object size, in line with the representation-free view, or, whether affordance would be ‘representationalized’, in line with the representation-based view, under the constrain of body size. In revision, we have clarified the motivation and its relation to our approach:

      In the introduction (p 2, ln 45): “However, the question of how object perception is influenced by the relative size of objects in relation to the human body remains open. Specifically, it is unclear whether this relative size simply acts as a continuous variable for locomotion reference, or if it affects differentiating and organizing object representations based on their ensued affordances.”

      In the discussion (p 14, ln 295): “One long-lasting debate on affordance centers on the distinction between representational and direct perception of affordance. An outstanding theme shared by many embodied theories of cognition is the replacement hypothesis (e.g., Van Gelder, 1998), which challenges the necessity of representation as posited by computationalism’s cognitive theories (e.g., Fodor, 1975). This hypothesis suggests that input is discretized/categorized and subjected to abstraction or symbolization, creating discrete stand-ins for the input (e.g., representations/states). Such representationalization would lead to a categorization between the affordable (the objects) and those beyond affordance (the environment). Accordingly, computational theories propose the emergence of affordance perception, in contrast to the perspective offered by embodied theories. The present study probed this ‘representationalization’ of affordance by examining whether affordance perception introduces discontinuity and qualitative dissociation in response to continuous action-related physical features (such as object size relative to the agents), which allows sensorimotor input to be assigned into discrete states/kinds, in line with the representation-based view under the constraints of body size. Alternatively, it assessed whether activity directly mirrors the input, free from discretization/categorization/abstraction, in line with the representation-free view.

      First, our study found evidence demonstrating discretization in affordance perception. Then, through the body imagination experiment, we provided causal evidence suggesting that this discretization originates from sensorimotor interactions with objects rather than amodal sources, such as abstract object concepts independent of agent motor capability. Finally, we demonstrated the supramodality of this embodied discontinuity by leveraging the recent advances in AI. We showed that the discretization in affordance perception is supramodally accessible to disembodied agents such as large language models (LLMs), which lack sensorimotor input but can access linguistic materials built upon discretized representations. These results collectively suggest that sensorimotor input undergoes discretization, as implied in the computationalism’s idea of representation. Note that, these results are not contradictory to the claim of the embodied theories, as these representations do shape processes beyond the sensorimotor domain but after discretization.

      The observed boundary in affordance perception extends the understanding of the discontinuity in perception in response to the continuity of physical inputs (Harnad, 1987; Young et al., 1997).”

      We are also sorry for the confusion about the expression “proper affordance representation”. We intended to express that the neural responses to objects beyond the boundary in the whole brain failed to reflect affordance congruency, and therefore did not show evidence of affordance processing. We have clarified this expression in the revised manuscript:

      (p 12, ln 265) “Taken together, the affordance boundary not only separated the objects into two categories based on their relative size to human body, but also delineated the range of objects that evoked neural representations associated with affordance processing.”

      Finally, we agree with the reviewer that the expressions, such as “not…inclined to interact” and “probably considered a less interesting component of the environment”, may be misleading. Rather, we intended to express that the objects beyond the affordance boundary is generally beyond the motor capacities of the very animal, being too large for the very animal to manipulated, as comparing to the smaller objects in the environment, may not be a typical target object for manipulation for the animal. We have revised these expressions in the manuscript and clarified their speculative nature:

      (p 16, ln 340) “In contrast, objects larger than that range typically surpass the animal’s motor capabilities, rendering them too cumbersome for effective manipulation. Consequently, these larger objects are less likely to be considered as typical targets for manipulation by the animal, as opposed to the smaller objects. That is, they are perceived not as the “objects” in the animal’s eye, but as part of the background environment, due to their impracticality for direct interactions.”

      (2) The authors used only a very small set of objects and affordances in their study and they do not describe in sufficient detail how these stimuli were selected. This renders the results rather exploratory and clearly limits their potential to discover general principles of human perception. Much larger sets of objects and affordances and explicit data-driven approaches for their selection would provide a far more convincing approach and allow the authors to rule out that their results are just a consequence of the selected set of objects and actions.

      R2: The selection of the objects and affordances in this study was guided by two key criteria. First, the objects were selected from the dataset published in Konkle and Oliva's study (2011), which systematically investigates the effect of object size on object recognition. Therefore, the range of object sizes, from 14 cm to 7,618 cm, is well-calibrated and represents a typical array of object sizes found in the real world. Second, the actions were selected to cover a wide range of daily humans objects/environments interactions, from single-point movements (e.g., hand, foot) to whole-body movements (e.g., lying, standing), based on the kinetics human action video dataset (Kay et al., 2017). Thus, this set of objects and actions is a sufficiently representative of typic human experiences. In revision, we have clarified these two criteria in the methods section:

      (p 22, ln 517) “The full list of objects, their diagonal sizes, and size rankings were provided in Supplementary Table S6. The objects were selected from the dataset in Konkle and Oliva’s study (2011) to cover typic object sizes in the world (ranging from 14 cm to 7,618 cm), and actions related to these objects were selected to span a spectrum of daily humans-objects/environments interactions, from single-point movements (e.g., hand, foot) to whole-body movements (e.g., lying, standing), based on the Kinetics Human Action Video Dataset (Kay et al., 2017).”

      Having said this, we agree with reviewer that a larger set of objects and actions will facilitate finer localization of the representational discontinuity, which can be addressed in future studies

      (p 16, ln 344): “…, due to their impracticality for direct interactions. Future studies should incorporate a broader range of objects and a more comprehensive set of affordances for finer delineation of the representational discontinuity between objects and the environment.”

      (3) Relatedly, the authors could be more thorough in ruling out potential alternative explanations. Object size likely correlates with other variables that could shape human similarity judgments and the estimated boundary is quite broad (depending on the method, either between 80 and 150 cm or between 105 to 130 cm). More precise estimates of the boundary and more rigorous tests of alternative explanations would add a lot to strengthen the authors' interpretation.

      R3: We agree with the reviewer that correlation analyses alone cannot rule out alternative explanations, as any variable co-varying with object sizes might also affect affordance perception. Therefore, our study experimentally manipulated the imagined body sizes, while keeping other variable constant across conditions. This approach provided evidence of a causal connection between body size and affordance perception, effectively ruling out alternative explanations. In revision, the rationale of experimentally manipulation of imagined body sizes has been clarified

      (p 7, ln 152): “One may argue that the location of the affordance boundary coincidentally fell within the range of human body size, rather than being directly influenced by it. To rule out this possibility, we directly manipulated participants’ body schema, referring to an experiential and dynamic functioning of the living body within its environment (Merleau-Ponty & Smith, 1962). This allowed us to examine whether the affordance boundary would shift in response to changes in the imagined body size. This experimental approach was able to establish a causal link between body size and affordance boundary, as other potential factors remained constant. Specifically, we instructed a new group of participants to imagine themselves as small as a cat (typical diagonal size: 77cm, size rank 4, referred to as the “cat condition”), and another new group to envision themselves as large as an elephant (typical diagonal size: 577 cm, size rank 7, referred to as the “elephant condition”) throughout the task (Fig. 2a).”

      Meanwhile, with correlational analysis, precise location of the boundary cannot help ruling out alternative explanation. However, we agree that future studies are needed to incorporate a broader range of objects and a more comprehensive set of affordances. For details, please see R2.

      (4) Even though the division of the set of objects into two homogenous clusters appears defensible, based on visual inspection of the results, the authors should consider using more formal analysis to justify their interpretation of the data. A variety of metrics exist for cluster analysis (e.g., variation of information, silhouette values) and solutions are typically justified by convergent evidence across different metrics. I would recommend the authors consider using a more formal approach to their cluster definition using some of those metrics.

      R4: We thank the reviewer for the suggestion. We performed three analyses on this point, all of which consistently indicated the division of objects into two distinct groups along the object size axis.

      First, a hierarchical clustering analysis of the heatmaps revealed a two-maincluster structure, which is now detailed in the revised methods section (p 25, ln 589) “A hierarchical clustering analysis was performed, employing the seaborn clustermap method with Euclidean distance and Complete linkage (Waskom, 2021).”

      Second, the similarity in affordances between neighbouring size ranks revealed the same two-main-cluster structure. In this analysis, each object was assigned a realworld size rank, and then Pearson’s correlation was calculated as the affordance similarity index for each pair of neighbouring size ranks to assess how similar the perceived affordances were between these ranks. Our results showed a clear trough in affordance similarity, with the lowest point approaching zero, while affordance similarities between neighbouring ranks on either side of the boundary remained high, confirming the observation that objects formed two groups based on affordance similarity.

      Finally, we analysed silhouette values for this clustering analysis, where 𝑎𝑖 represents the mean intra-cluster distance, and 𝑏𝑖 represents the mean nearest-cluster distance for each data point i. The silhouette coefficient is calculated as (Rousseeuw, 1987):

      The silhouette analysis revealed that the maximum silhouette value coefficient corresponded to a cluster number of two, further confirming the two-cluster structure (please see Author response table 2 below).

      Author response table 2.

      The silhouette values of a k-means clustering when k (number of clusters) = 2 to 10

      (5) While I appreciate the manipulation of imagined body size, as a way to solidify the link between body size and affordance perception, I find it unfortunate that this is implemented in a between-subjects design, as this clearly leaves open the possibility of pre-existing differences between groups. I certainly disagree with the authors' statement that their findings suggest "a causal link between body size and affordance perception."

      R5: The between-subjects design in the imagination experiment was employed to prevent contamination between conditions. Specifically, after imagining oneself as a particular size, it can be challenging to immediately transition to envisioning a different body size. In addition, participating sequentially participate in two conditions that only differ in imagined body sizes may lead to undesirable response strategies, such as deliberately altering responses to the same objects in the different conditions. The reason of employing the between-subjects design is now clarified in the revised text (p 7, ln 161): “A between-subject design was adopted to minimize contamination between conditions. This manipulation was effective, as evidenced by the participants’ reported imagined heights in the cat condition being 42 cm (SD = 25.6) and 450 cm (SD = 426.8) in the elephant condition on average, respectively, when debriefed at the end of the task.”

      Further, to address the concern that “pre-existing differences between groups” would generate this very result, we adhered to standard protocols such as random assignment of participants to different conditions (cat-size versus elephant-size). Moreover, experimentally manipulating one variable (i.e., body schema) to observe its effect on another variable (i.e., affordance boundary) is the standard method for establishing causal relationships between variables. We could not think of other better ways for this objective.

      (6) The use of LLMs in the current study is not clearly motivated and I find it hard to understand what exactly the authors are trying to test through their inclusion. As noted above, I think that the authors should discuss the putative roles of conceptual knowledge, language, and sensorimotor experience already in the introduction to avoid ambiguity about the derived predictions and the chosen methodology. As it currently stands, I find it hard to discern how the presence of perceptual boundaries in LLMs could constitute evidence for affordance-based perception.

      R6: The motivation of LLMs is to test the supramodality of this embodied discontinuity found in behavioral experiments: whether this discontinuity is accessible beyond the sensorimotor domain. To do this, we leveraged the recent advance in AI and tested whether the discretization observed in affordance perception is supramodally accessible to disembodied agents which lack access to sensorimotor input but only have access to the linguistic materials built upon discretized representations, such as large language models (LLM). The theoretical motivation and rationale regarding the LLM study are now included in the introduction and discussion:

      In the introduction (p 2, ln 59) “…, and the body may serve as a metric that facilitates meaningful engagement with the environment by differentiating objects that are accessible for interactions from those not. Further, grounded cognition theory (see Barsalou, 2008 for a review) suggests that the outputs of such differentiation might transcend sensorimotor processes and integrate into supramodal concepts and language. From this perspective, we proposed two hypotheses...”

      In the introduction (p 3, ln 70) “Notably, the affordance boundary varied in response to the imagined body sizes and showed supramodality. It could also be attained solely through language, as evidenced by the large language model (LLM), ChatGPT (OpenAI, 2022).”

      For details in the discussion, please see R1.

      (7) Along the same lines, the fMRI study also provides very limited evidence to support the authors' claims. The use of congruency effects as a way of probing affordance perception is not well motivated. What exactly can we infer from the fact a region may be more active when an object is paired with an activity that the object doesn't afford? The claim that "only the affordances of objects within the range of body size were represented in the brain" certainly seems far beyond the data.

      R7: In our study, we followed the established fMRI research paradigm of employing the congruent effect as a measure of affordance processing (e.g., Kourtis et al., 2018). The choice of this paradigm has now been clarified in the revised manuscript (p 11, ln 244): “The congruency effect, derived from the contrast of Congruent versus Incongruent conditions, is a well-established measure of affordance processing (e.g., Kourtis et al., 2018).”

      The statement that “only the affordances of objects within the range of body size were represented in the brain” is based on the observed interaction of congruency by object size. In the revised text, we have weakened this statement to better align with the direct implications of the interaction effect (p 1 ln 22): “A subsequent fMRI experiment revealed evidence of affordance processing exclusively for objects within the body size range, but not for those beyond. This suggests that only objects capable of being manipulated are the objects capable of offering affordance in the eyes of an organism.”

      (8) Importantly (related to my comments under 2) above), the very small set of objects and affordances in this experiment heavily complicates any conclusions about object size being the crucial variable determining the occurrence of congruency effects.

      R8: The objective of the fMRI study was to provide the neural basis for the affordance discontinuity found in behaviour experiments. In other words, the fMRI study is not an exploratory experiment, and therefore, the present object and action sets, which are based on the behaviour experiments, are sufficient.

      (9) I would also suggest providing a more comprehensive illustration of the results (including the effects of CONGRUENCY, OBJECT SIZE, and their interaction at the whole-brain level).

      R9: We agree and in revision, we have now included these analyses in the supplementary material (p 30, ln 711): “For the whole-brain analyses on the congruency effect, the object size effect, and their interaction, see Supplementary Fig. S4 and Table S2 to S5.” Please see Author response image 2 (New Supplementary Fig. S4) and Author responses tables 3 to 5 (New Supplementary Table S2 to S4) below.

      Author response image 2.

      Significant brain activations of different contrasts in the whole-brain level analysis. a, the effect of object type, positive values (warm color) indicated higher activation for objects within than objects beyond and negative values (cold color) indicated the opposite. b, the effect of congruency, positive values indicated higher activation in congruent than incongruent condition. c, the effect of interaction between object type and congruency, positive values indicated the larger congruency effect for objects within than beyond. d, the congruency effect for objects within. All contrasts were corrected with cluster-level correction at p < .05. The detailed cluster-level results for each contrast map can be found in Supplementary Table S2 to S5.

      Author response table 3.

      Cortical regions reaching significance in the contrasts of (A) objects within versus object beyond and (B) objects beyond versus objects within, whole-brain analysis (R = right hemisphere, L = left hemisphere; Z > 2.3, p = 0.05, cluster corrected).

      Author response table 4.

      Cortical regions reaching significance in contrasts of (A) congruent versus incongruent and (B) incongruent versus congruent, whole-brain analysis (R = right hemisphere, L = left hemisphere; Z > 2.3, p = 0.05, cluster corrected).

      Author response table 5.

      Review Table 5 (New Supplementary Table S4). Cortical regions showing significant interaction between object type and congruency, whole-brain analysis (OW = Objects within, OB = Objects beyond; R = right hemisphere, L = left hemisphere; Z > 2.3, p = 0.05, cluster corrected)

      Reviewer #3 (Recommendations For The Authors):

      a. >a) Clarify all theoretical assumptions already within the introduction and specify how the predictions are tested (and how they could be falsified).

      Ra: Please see R1.

      b. >b) Explain how the chosen experimental approach relates to the theoretical questions under investigation (e.g., it is not clear to me how affordance similarity ratings can inform inference about which part of the environment is perceived as more or less manipulable).

      Rb: We thank the reviewer for the suggestion, and the theoretical motivation and rationale are now clarified. For details, please see R1.

      c. >c) Include a much larger set of objects and affordances in the behavioural experiments (that is more generalizable and also permits a more precise estimation of the boundary), and use a more rigorous methodology to justify a particular cluster solution.

      Rc: Please see R2 for the limited variance of objects and actions, and R4 for more analyses on the boundary.

      d. >d) Clearly motivate what the use of LLMs can contribute to the study of affordance perception.

      Rd: Please see R6.

      e) Clearly motivate why congruency effects are thought to index "affordance representation in the brain" Re: Please see R7.

      e) Include a much larger set of objects and affordances in the fMRI study.

      Re: Please see R7.

      f) Consider toning down the main conclusions based on the limitations outlined above.

      Rf: We have toned down the main conclusions accordingly.

      We are profoundly grateful for the insightful comments and suggestions provided by the three reviewers, which have greatly improved the quality of this manuscript.   References

      Barsalou, L. W. (1999). Perceptual symbol systems. Behavioral and Brain Sciences, 22(4), 637-660.

      de Beeck, H. P. O., Torfs, K., & Wagemans, J. (2008). Perceived shape similarity among unfamiliar objects and the organization of the human object vision pathway. Journal of Neuroscience, 28(40), 10111-10123.

      Borghi, A. M. (2005). Object concepts and action. Grounding cognition: The role of perception and action in memory, language, and thinking, 8-34.

      Colling, L.J. (2021). ljcolling/go-bayesfactor: (Version v0.9.0).Zenodo. doi: 10.5281/zenodo.4642331

      Crawley, J. A. H., Mumby, H. S., Chapman, S. N., Lahdenperä, M., Mar, K. U., Htut, W., ... & Lummaa, V. (2017). Is bigger better? The relationship between size and reproduction in female Asian elephants. Journal of Evolutionary Biology, 30(10), 1836-1845.

      Ellis, R., & Tucker, M. (2000). Micro‐affordance: The potentiation of components of action by seen objects. British Journal of Psychology, 91(4), 451-471.

      Fan, L., Li, H., Zhuo, J., Zhang, Y., Wang, J., Chen, L., ... & Jiang, T. (2016). The human brainnetome atlas: a new brain atlas based on connectional architecture. Cerebral Cortex, 26(8), 3508-3526.

      Fodor, J. A. (1975). The Language of Thought (Vol. 5). Harvard University Press.

      Gibson, J. J. (1979). The ecological approach to visual perception: Classic edition.

      Hertrich, I., Dietrich, S., & Ackermann, H. (2016). The role of the supplementary motor area for speech and language processing. Neuroscience & Biobehavioral Reviews, 68, 602-610.

      Huang, T., Song, Y., & Liu, J. (2022). Real-world size of objects serves as an axis of object space. Communications Biology, 5(1), 1-12.

      Kantak, S. S., Stinear, J. W., Buch, E. R., & Cohen, L. G. (2012). Rewiring the brain: potential role of the premotor cortex in motor control, learning, and recovery of function following brain injury. Neurorehabilitation and Neural Repair, 26(3), 282-292.

      Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., ... & Zisserman, A. (2017). The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.

      Konkle, T., & Oliva, A. (2011). Canonical visual size for real-world objects. Journal of Experimental Psychology: human perception and performance, 37(1), 23.

      Kourtis, D., Vandemaele, P., & Vingerhoets, G. (2018). Concurrent cortical representations of function-and size-related object affordances: an fMRI study. Cognitive, Affective, & Behavioral Neuroscience, 18, 1221-1232.

      Matić, K., de Beeck, H. O., & Bracci, S. (2020). It's not all about looks: The role of object shape in parietal representations of manual tools. Cortex, 133, 358-370.

      McDannald, D. W., Mansour, M., Rydalch, G., & Bolton, D. A. (2018). Motor affordance for grasping a safety handle. Neuroscience Letters, 683, 131-137.

      NCD Risk Factor Collaboration (NCD-RisC). (2016). A century of trends in adult human height. Elife, 5, e13410.

      Peelen, M. V., & Caramazza, A. (2012). Conceptual object representations in human anterior temporal cortex. Journal of Neuroscience, 32(45), 15728-15736.

      Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53-65.

      Sakreida, K., Effnert, I., Thill, S., Menz, M. M., Jirak, D., Eickhoff, C. R., ... & Binkofski, F. (2016). Affordance processing in segregated parieto-frontal dorsal stream sub-pathways. Neuroscience & Biobehavioral Reviews, 69, 89-112.

      Van Gelder, T. (1998). The dynamical hypothesis in cognitive science. Behavioral and Brain Sciences, 21(5), 615-628.

      Wagenmakers, E.-J., Wetzels, R., Borsboom, D. & van der Maas, H. L. J. Why psychologists must change the way they analyze their data: the case of psi: Comment on Bem (2011). Journal of Personality and Social Psychology, 100(3), 426–432.

      Zhen, Z., Yang, Z., Huang, L., Kong, X. Z., Wang, X., Dang, X., ... & Liu, J. (2015). Quantifying interindividual variability and asymmetry of face-selective regions: a probabilistic functional atlas. NeuroImage, 113, 13-25.

    1. Author Response

      The following is the authors’ response to the original reviews.

      We sincerely thank the reviewers for their in-depth consideration of our manuscript and their helpful reviews. Their efforts have made the paper much better. We have responded to each point. The previously provided public responses have been updated they are included after the private response for convenience.

      Reviewer #1 (Recommendations For The Authors):

      1. In general, the manuscript will benefit from copy editing and proof reading. Some obvious edits;

      2. Page 6 line 140. Do the authors mean Cholera toxin B?

      Response: We corrected this error and went through the entire paper carefully correcting for grammar and increased clarity.

      • Page 8 line 173. Methylbetacyclodextrin is misspelled.

      Response: Yes, corrected.

      • Figure 4c is missing representative traces for electrophysiology data.

      • Figure 4. Please check labeling ordering in figure legend as it does not match the panels in the figure.

      Thank you for the correction and we apologize for the confusion in figure 4. We uploaded an incomplete figure legend, and the old panel ‘e’ was not from an experiment that was still in the figure. It was removed and the figure legends are now corrected.

      • Please mention the statistical analysis used in all figure legends.

      Response: Thank you for pointing out this omission, statistics have been added.

      • Although the schematics in each figure helps guide readers, they are very inconsistent and sometimes confusing. For example, in Figure 5 the gating model is far-reaching without conclusive evidence, whereas in Figure 6 it is over simplified and unclear what the image is truly representing (granted that the downstream signaling mechanism and channel is not known).

      Response: Figure 5d is the summary figure for the entire paper. We have made this clearer in the figure legend and we deleted the title above the figure that gave the appearance that the panel relates to swell only. It is the proposed model based on what we show in the paper and what is known about the activation mechanism of TREK-1.

      Figure 6 is supposed to be simple. It is to help the reader understand that when PA is low mechanical sensitivity is high. Without the graphic, previous reviewers got confused about threshold going down and mechanosensitivity going up and how the levels of PA relate. Low PA= high sensitivity. We’ve added a downstream effector to the right side of the panel to avoid any biased to a putative downstream channel effector. The purpose of the experiment is to show PLD has a mechanosensitive phenotype in vivo.

      Reviewer #2 (Recommendations For The Authors):

      This manuscript outlines some really interesting findings demonstrating a mechanism by which mechanically driven alterations in molecular distributions can influence a) the activity of the PLD2 molecule and subsequently b) the activation of TREK-1 when mechanical inputs are applied to a cell or cell membrane.

      The results presented here suggest that this redistribution of molecules represents a modulatory mechanism that alters either the amplitude or the sensitivity of TREK-1 mediated currents evoked by membrane stretch. While the authors do present values for the pressure required to activate 50% of channels (P50), the data presented provides incomplete evidence to conclude a shift in threshold of the currents, given that many of the current traces provided in the supplemental material do not saturate within the stimulus range, thus limiting the application of a Boltzmann fit to determine the P50. I suggest adding additional context to enable readers to better assess the limitations of this use of the Boltzmann fit to generate a P50, or alternately repeating the experiments to apply stimuli up to lytic pressures to saturate the mechanically evoked currents, enabling use of the Boltzmann function to fit the data.

      Response: We thank the reviewer for pointing this out. We agree the currents did not reach saturation. Hence the term P50 could be misleading, so we have removed it from the paper. We now say “half maximal” current measured from non-saturating pressures of 0-60 mmHg. We also deleted the xPLD data in supplemental figure 3C since there is insufficient current to realistically estimate a half maximal response.

      In my opinion, the conclusions presented in this manuscript would be strengthened by an assessment of the amount of TREK-1 in the plasma membrane pre and post application of shear. While the authors do present imaging data in the supplementary materials, these data are insufficiently precise to comment on expression levels in the membrane. To strengthen this conclusion the authors could conduct cell surface biotinylation assays, as a more sensitive and quantitative measure of membrane localisation of the proteins of interest.

      1. Response: as mentioned previously, we do not have an antibody to the extracellular domain. Nonetheless to better address this concern we directly compared the levels of TREK-1, PIP2, and GM1; in xPLD2, mPLD2, enPLD2 with and without shear. The results are in supplemental figure 2. PLD2 is known to increase endocytosis1 and xPLD2 is known to block both agonist induced and constitutive endocytosis of µ-opioid receptor2. The receptor is trapped on the surface. This is true of many proteins including Rho3, ARF4, and ACE21 among others. In agreement with this mechanism, in Figure S2C,G we show that TREK increases with xPLD and the localization can clearly be seen at the plasma membrane just like in all of the other publications with xPLD overexpression. xPLD2 would be expected to inhibit the basal current but we presume the increased expression likely has compensated and there is sufficient PA and PG from other sources to allow for the basal current. It is in this state that we then conduct our ephys and monitor with a millisecond time resolution and see no activation. We are deriving conclusion from a very clear response—Figure 1b shows almost no current, even at 1-10 ms after applying pressure. There is little pressure current when we know the channel is present and capable of conducting ion (Figure 1d red bar). After shear there is a strong decrease in TREK-1 currents on the membrane in the presence of xPLD2. But it is not less than TREK-1 expression with mPLD2. And since mouse PLD2 has the highest basal current and pressure activation current. The amount of TREK-1 present is sufficient to conduct large current. To have almost no detective current would require at least a 10 fold reduction compared to mPLD2 levels before we would lack the sensitivity to see a channel open. Lasty endocytosis typically in on the order of seconds to minutes, no milliseconds.

      2. We have shown an addition 2 independent ways that TREK-1 is on the membrane during our stretch experiments. Figure 1d shows the current immediately prior to applying pressure for wt TREK-1. When catalytically dead PLD is present (xPLD2) there is almost normal basal current. The channel is clearly present. And then in figure 1a we show within a millisecond there is no pressure current. As a control we added a functionally dead TREK-1 truncation (xTREK). Compared to xPLD2 there is clearly normal basal current. If this is not strong evidence the channel was available on the surface for mechanical activation please help us understand why. And if you think within 2.1 ms 100% of the channel is gone by endocytosis please provide some evidence that this is possible so we can reconsider.

      3. We have TIRF super resolution imaging with ~20 nm x-y resolution and ~ 100nm z resolution and Figure 2b clearly shows the channel on the membrane. When we apply pressure in 1b, the channel is present.

      4. Lastly, In our previous studies we showed activation of PLD2 by anesthetics was responsible for all of TREK-1’s anesthetic sensitivity and this was through PLD2 binding to the C-terminus of TREK-15. We showed this was the case by transferring anesthetic sensitivity to an anesthetic insensitive homolog TRAAK. This established conclusively the basic premise of our mechanism. Here we show the same C-terminal region and PLD2 are responsible for the mechanical current observed by TREK-1. TRAAK is already mechanosensitive so the same chimera will not work for our purposes here. But anesthetic activation and mechanical activation are dramatically different stimuli, and the fact that the role of PLD is robustly observed in both should be considered.

      The authors discuss that the endogenous levels of TREK-1 and PLD2 are "well correlated: in C2C12 cells, that TREK-1 displayed little pair correlation with GM1 and that a "small amount of TREK-1 trafficked to PIP2". As such, these data suggest that the data outlined for HEK293T cells may be hampered by artefacts arising from overexpression. Can TREK-1 currents be activated by membrane stretch in these cells C2C12 cells and are they negatively impacted by the presence of xPLD2? Answering this question would provide more insight into the proposed mechanism of action of PLD2 outlined by the authors in this manuscript. If no differences are noted, the model would be called into question. It could be that there are additional cell-specific factors that further regulate this process.

      Response: The low pair correlation of TREK-1 and GM1 in C2C12 cells was due to insufficient levels of cholesterol in the cell membrane to allow for robust domain formation. In Figure 4b we loaded C2C12 cells with cholesterol using the endogenous cholesterol transport protein apoE and serum (an endogenous source of cholesterol). As can be seen in Fig. 4b, the pair correlation dramatically increased (purple line). This was also true in neuronal cells (N2a) (Fig 4d, purple bar). And shear (3 dynes/cm2) caused the TREK-1 that was in the GM1 domains to leave (red bar) reversing the effect of high cholesterol. This demonstrates our proposed mechanism is working as we expect with endogenously expressed proteins.

      There are many channels in C2C12 cells, it would be difficult to isolate TREK-1 currents, which is why we replicated the entire system (ephys and dSTORM) in HEK cells. Note, in figure 4c we also show that adding cholesterol inhibits TREK-1 whole cell currents in HEK293cells.

      As mentioned in the public review, the behavioural experiments in D. melanogaster can not solely be attributed to a change in threshold. While there may be a change in the threshold to drive a different behaviour, the writing is insufficiently precise to make clear that conclusions cannot be drawn from these experiments regarding the functional underpinnings of this outcome. Are there changes in resting membrane potential in the mutant flys? Alterations in Nav activity? Without controlling for these alternate explanations it is difficult to see what this last piece of data adds to the manuscript, particularly given the lack of TREK-1 in this organism. At the very least, some editing of the text to more clearly indicate that these data can only be used to draw conclusions on the change in threshold for driving the behaviour not the change in threshold of the actual mechanotransduction event (i.e. conversion of the mechanical stimulus into an electrochemical signal).

      Response: We agree; features other than PLDs direct mechanosensitivity are likely contributing. This was shown in figure 6g left side. We have an arrow going to ion channel and to other downstream effectors. We’ve added the putative alteration to downstream effectors to the right side of the panel. This should make it clear that we no more speculate the involvement of a channel than any of the other many potential downstream effectors. As mentioned above, the figure helps the reader coordinate low PA with increased mechanosensitivity. Without the graphic reviewers got confused that PA increased the threshold which corresponds to a decreased sensitivity to pain. Nonetheless we removed our conclusion about fly thresholds from the abstract and made clearer in the main text the lack of mechanism downstream of PLD in flies including endocytosis. Supplemental Figure S2H also helps emphasize this. .

      Nav channels are interesting, and since PLD contribute to endocytosis and Nav channels are also regulated by endocytosis there is likely a PLD specific effect using Nav channels. There are many ways PA likely regulates mechanosensitive thresholds, but we feel Nav is beyond the scope of our paper. Someone else will need to do those studies. We have amended a paragraph in the conclusion which clearly states we do not know the specific mechanism at work here with the suggestions for future research to discover the role of lipid and lipid-modifying enzymes in mechanosensitive neurons.

      There may be fundamental flaws in how the statistics have been conducted. The methods section indicates that all statistical testing was performed with a Student's t-test. A visual scan of many of the data sets in the figures suggests that they are not normally distributed, thus a parametric test such as a Student's t-test is not valid. The authors should assess if each data set is normally distributed, and if not, a non-parametric statistical test should be applied. I recommend assessing the robustness of the statistical analyses and adjusting as necessary.

      Response: We thank the reviewer for pointing this out, indeed there is some asymmetry in Figure 6C-d. The p values with Mann Whitney were slightly improved p=0.016 and p=0.0022 for 6c and 6d respectively. For reference, the students t-test had slightly worse statistics p=0.040 and p=0.0023. The score remained the same 1 and 2 stars respectively.

      The references provided for the statement regarding cascade activation of the TRPs are incredibly out of date. While it is clear that TRPV4 can be activated by a second messenger cascade downstream of osmotic swelling of cells, TRPV4 has also been shown to be activated by mechanical inputs at the cell-substrate interface, even when the second messenger cascade is inhibited. Recommend updating the references to reflect more current understanding of channel activation.

      Response: We thank the reviewer for pointing this out. We have updated the references and changed the comment to “can be” instead of “are”. The reference is more general to multiple ion channel types including KCNQ4. This should avoid any perceived conflict with the cellsubstrate interface mechanism which we very much agree is a correct mechanism for TRP channels.

      Minor comments re text editing etc:

      The central messages of the manuscript would benefit from extensive work to increase the precision of the writing of the manuscript and the presentation of data in the figures, such textual changes alone would help address a number of the concerns outlined in this review, by clarifying some ambiguities. There are numerous errors throughout, ranging from grammatical issues, ambiguities with definitions, lack of scale bars in images, lack of labels on graph axes, lack of clarity due to the mode of presentation of sample numbers (it would be far more precise to indicate specific numbers for each sample rather than a range, which is ambiguous and confusing), unnecessary and repeat information in the methods section. Below are some examples but this list is not exhaustive.

      Response: Thank you, reviewer # 1 also had many of these concerns. We have gone through the entire paper and improved the precision of the writing of the manuscript. We have also added the missing error bar to Figure 6. And axis labels have been added to the inset images. The redundancy in cell culture methods has been removed. Where a range is small and there are lots of values, the exact number of ‘n’ are graphically displayed in the dot plot for each condition.

      Text:

      I recommend considering how to discuss the various aspects of channel activation. A convention in the field is to use mechanical activation or mechanical gating to describe that process where the mechanical stimulus is directly coupled to the channel gating mechanism. This would be the case for the activation of TREK-1 by membrane stretch alone. The increase in activation by PLD2 activity then reflects a modulation of the mechanical activation of the channel, because the relevant gating stimulus is PA, rather than force/stretch. The sum of these events could be described as shear-evoked or mechanically-evoked, TREK-1 mediated currents (thus making it clear that the mechanical stimulus initiates the relevant cascade, but the gating stimulus may be other than direct mechanical input.) Given the interesting and compelling data offered in this manuscript regarding the sensitisation of TREK-1 dependent mechanicallyevoked currents by PLD2, an increase in the precision of the language would help convey the central message of this work.

      Response; We agree there needs to be convention. We have taken the suggestion of mechanically evoked and we suggest the following definitions:

      1. Mechanical activation of PLD2: direct force on the lipids releasing PLD2 from nonactivating lipids.

      2. Mechanical activation/gating of TREK1: direct force from lipids from either tension or hydrophobic mismatch that opens the channel.

      3. Mechanically evoked: a mechanical event that leads to a downstream effect. The effect is mechanically “evoked”.

      4. Spatial patterning/biochemistry: nanoscopic changes in the association of a protein with a nanoscopic lipid cluster or compartment.

      An example of where discussion of mechanical activation is ambiguous in the text is found at line 109: "channel could be mechanically activated by a movement from GM1 to PIP2 lipids." In this case, the sentence could be suggesting that the movement between lipids provides the mechanical input that activates the channel, which is not what the data suggest.

      Response: Were possible we have replaced “movement” with “spatial patterning” and “association” and “dissociation” from specific lipid compartment. This better reflects the data we have in this paper. However, we do think that a movement mechanically activates the channel, GM1 lipids are thick and PIP2 lipids are thin, so movement between the lipids could activate the channel through direct lipid interaction. We will address this aspect in a future paper.

      Inconsistencies with usage:

      • TREK1 versus TREK-1

      Response: corrected to TREK-1

      • mPLD2 versus PLD2

      Response: where PLD2 represents mouse this has been corrected.

      • K758R versus xPLD2

      Response: we replaced K758R in the methods with xPLD2.

      • HEK293T versus HEK293t Response: we have changed all instances to read HEK293T.

      • Drosophila melanogaster and D. melanogaster used inconsistently and in many places incorrectly

      Response: we have read all to read the common name Drosophila.

      Line 173: misspelled methylbetacyclodextrin

      Response corrected

      Line 174: degree symbol missing

      Response corrected

      Line 287: "the decrease in cholesterol likely evolved to further decrease the palmate order in the palmitate binding site"... no evidence, no support for this statement, falsely attributes intention to evolutionary processes .

      Response: we have removed the reference to evolution at the request of the reviewer, it is not necessary. But we do wish to note that to our knowledge, all biological function is scientifically attributed to evolution. The fact that cholesterol decreases in response to shear is evidence alone that the cell evolved to do it.

      Line 307: grammatical error

      Response: the redundant Lipid removed.

      Line 319: overinterpreted - how is the mechanosensitivy of GPCRs explained by this translocation?

      Response: all G-alpha subunits of the GPCR complex are palmitoylated. We showed PLD (which has the same lipidation) is mechanically activated. If the palmitate site is disrupted for PLD2, then it is likely disrupted for every G-alpha subunit as well.

      Line 582: what is the wild type referred to here?

      Response: human full length with a GFP tag.

      Methods:

      • Sincere apologies if I missed something but I do not recall seeing any experiments using purified TREK-1 or flux assays. These details should be removed from the methods section

      Response: Removed.

      • There is significant duplication of detail across the methods (three separate instances of electrophysiology details) these could definitely be consolidated.

      Response: Duplicates removed.

      Figures:

      • Figure 2- b box doesn't correspond to inset. Bottom panel should provide overview image for the cell that was assessed with shear. In bottom panel, circle outlines an empty space.

      Response: We have widened the box slightly to correspond so the non shear box corresponds to the middle panel. We have also added the picture for the whole cell to Fig S2g and outlined the zoom shown in the bottom panel of Fig 2b as requested. The figure is of the top of a cell. We also added the whole cell image of a second sheared cell.

      Author response image 1.

      • Figure 3 b+c: inset graph lacking axis labels

      Response; the inset y axis is the same as the main axis. We added “pair corr. (5nM)” and a description in the figure legend to make this clearer. The purpose of the inset is to show statistical significance at a single point. The contrast has been maximized but without zooming in points can be difficult to see.

      • Figure 5: replicate numbers missing and individual data points lacking in panels b + c, no labels of curve in b + c, insets, unclear what (5 nm) refers to in insets.

      Response: Thank you for pointing out these errors. The N values have been added. Similar to figure 3, the inset is a bar graph of the pair correlation data at 5 nm. A better explanation of the data has been added to the figure legend.

      • Figure 6: no scale bar, no clear membrane localization evident from images presented, panel g offers virtually nothing in terms of insight

      Response: We have added scale bars to figure 6b. Figure 6g is intentionally simplistic, we found that correlating decreased threshold with increased pain was confusing. A previous reviewer claimed our data was inconsistent. The graphic avoids this confusion. We also added negative effects of low PA on downstream effects to the right panel. This helps graphically show we don’t know the downstream effects.

      Reviewer #3 (Recommendations For The Authors):

      Minor suggestions:

      1. line 162, change 'heat' to 'temperature'.

      Response: changed.

      1. in figure 1, it would be helpful to keep the unit for current density consistent among different panels. 1e is a bit confusing: isn't the point of Figure 1 that most of TREK1 activation is not caused by direct force-sensing?

      Response: Yes, the point of figure 1 is to show that in a biological membrane over expressed TREK-1 is a downstream effector of PLD2 mechanosensation which is indirect. We agree the figure legend in the previous version of the paper is very confusing.

      There is almost no PLD2 independent current in our over expressed system, which is represented by no ions in the conduction pathway of the channel despite there being tension on the membrane.

      Purified TREK-1 is only mechanosensitive in a few select lipids, primarily crude Soy PC. It was always assumed that HEK293 and Cos cells had the correct lipids since over expressed TREK-1 responded to mechanical force in these lipids. But that does not appear to be correct, or at least only a small amount of TREK-1 is in the mechanosensitive lipids. Figure 1e graphically shows this. The arrows indicate tension, but the channel isn’t open with xPLD2 present. We added a few sentences to the discussion to further clarify.

      Panels c has different units because the area of the tip was measured whereas in d the resistance of the tip was measured. They are different ways for normalizing for small differences in tip size.

      1. line 178, ~45 of what?

      Response: Cells were fixed for ~30 sec.

      1. line 219 should be Figure 4f?

      Response: thank you, yes Figure 4f.

      Previous public reviews with minor updates.

      Reviewer #1 (Public Review):

      Force sensing and gating mechanisms of the mechanically activated ion channels is an area of broad interest in the field of mechanotransduction. These channels perform important biological functions by converting mechanical force into electrical signals. To understand their underlying physiological processes, it is important to determine gating mechanisms, especially those mediated by lipids. The authors in this manuscript describe a mechanism for mechanically induced activation of TREK-1 (TWIK-related K+ channel. They propose that force induced disruption of ganglioside (GM1) and cholesterol causes relocation of TREK-1 associated with phospholipase D2 (PLD2) to 4,5-bisphosphate (PIP2) clusters, where PLD2 catalytic activity produces phosphatidic acid that can activate the channel. To test their hypothesis, they use dSTORM to measure TREK-1 and PLD2 colocalization with either GM1 or PIP2. They find that shear stress decreases TREK-1/PLD2 colocalization with GM1 and relocates to cluster with PIP2. These movements are affected by TREK-1 C-terminal or PLD2 mutations suggesting that the interaction is important for channel re-location. The authors then draw a correlation to cholesterol suggesting that TREK-1 movement is cholesterol dependent. It is important to note that this is not the only method of channel activation and that one not involving PLD2 also exists. Overall, the authors conclude that force is sensed by ordered lipids and PLD2 associates with TREK-1 to selectively gate the channel. Although the proposed mechanism is solid, some concerns remain.

      1) Most conclusions in the paper heavily depend on the dSTORM data. But the images provided lack resolution. This makes it difficult for the readers to assess the representative images.

      Response: The images were provided are at 300 dpi. Perhaps the reviewer is referring to contrast in Figure 2? We are happy to increase the contrast or resolution.

      As a side note, we feel the main conclusion of the paper, mechanical activation of TREK-1 through PLD2, depended primarily on the electrophysiology in Figure 1b-c, not the dSTORM. But both complement each other.

      2) The experiments in Figure 6 are a bit puzzling. The entire premise of the paper is to establish gating mechanism of TREK-1 mediated by PLD2; however, the motivation behind using flies, which do not express TREK-1 is puzzling.

      Response: The fly experiment shows that PLD mechanosensitivity is more evolutionarily conserved than TREK-1 mechanosensitivity. We have added this observation to the paper.

      -Figure 6B, the image is too blown out and looks over saturated. Unclear whether the resolution in subcellular localization is obvious or not.

      Response: Figure 6B is a confocal image, it is not dSTORM. There is no dSTORM in Figure 6. We have added the error bars to make this more obvious. For reference, only a few cells would fit in the field of view with dSTORM.

      -Figure 6C-D, the differences in activity threshold is 1 or less than 1g. Is this physiologically relevant? How does this compare to other conditions in flies that can affect mechanosensitivity, for example?

      Response: Yes, 1g is physiologically relevant. It is almost the force needed to wake a fly from sleep (1.2-3.2g). See ref 33. Murphy Nature Pro. 2017.

      3) 70mOsm is a high degree of osmotic stress. How confident are the authors that a cell health is maintained under this condition and b. this does indeed induce membrane stretch? For example, does this stimulation activate TREK-1?

      Response: Yes, osmotic swell activates TREK1. This was shown in ref 19 (Patel et al 1998). We agree the 70 mOsm is a high degree of stress. This needs to be stated better in the paper.

      Reviewer #2 (Public Review):

      This manuscript by Petersen and colleagues investigates the mechanistic underpinnings of activation of the ion channel TREK-1 by mechanical inputs (fluid shear or membrane stretch) applied to cells. Using a combination of super-resolution microticopy, pair correlation analysis and electrophysiology, the authors show that the application of shear to a cell can lead to changes in the distribution of TREK-1 and the enzyme PhospholipaseD2 (PLD2), relative to lipid domains defined by either GM1 or PIP2. The activation of TREK-1 by mechanical stimuli was shown to be sensi>zed by the presence of PLD2, but not a catalytically dead xPLD2 mutant. In addition, the activity of PLD2 is increased when the molecule is more associated with PIP2, rather than GM1 defined lipid domains. The presented data do not exclude direct mechanical activation of TREK-1, rather suggest a modulation of TREK-1 activity, increasing sensitivity to mechanical inputs, through an inherent mechanosensitivity of PLD2 activity. The authors additionally claim that PLD2 can regulate transduction thresholds in vivo using Drosophila melanogaster behavioural assays. However, this section of the manuscript overstates the experimental findings, given that it is unclear how the disruption of PLD2 is leading to behavioural changes, given the lack of a TREK-1 homologue in this organism and the lack of supporting data on molecular function in the relevant cells.

      Response: We agree, the downstream effectors of PLD2 mechanosensitivity are not known in the fly. Other anionic lipids have been shown to mediate pain see ref 46 and 47. We do not wish to make any claim beyond PLD2 being an in vivo contributor to a fly’s response to mechanical force. We have removed the speculative conclusions about fly thresholds from the abstract.

      That said we do believe we have established a molecular function at the cellular level. We showed PLD is robustly mechanically activated in a cultured fly cell line (BG2-c2) Figure 6a of the manuscript. And our previous publication established mechanosensation of PLD (Petersen et. al. Nature Com 2016) through mechanical disruption of the lipids. At a minimum, the experiments show PLDs mechanosensitivity is evolutionarily better conserved across species than TREK1.

      This work will be of interest to the growing community of scientists investigating the myriad mechanisms that can tune mechanical sensitivity of cells, providing valuable insight into the role of functional PLD2 in sensi>zing TREK-1 activation in response to mechanical inputs, in some cellular systems.

      The authors convincingly demonstrate that, post application of shear, an alteration in the distribution of TREK-1 and mPLD2 (in HEK293T cells) from being correlated with GM1 defined domains (no shear) to increased correlation with PIP2 defined membrane domains (post shear). These data were generated using super-resolution microticopy to visualise, at sub diffraction resolution, the localisation of labelled protein, compared to labelled lipids. The use of super-resolution imaging enabled the authors to visualise changes in cluster association that would not have been achievable with diffraction limited microticopy. However, the conclusion that this change in association reflects TREK-1 leaving one cluster and moving to another overinterprets these data, as the data were generated from sta>c measurements of fixed cells, rather than dynamic measurements capturing molecular movements.

      When assessing molecular distribution of endogenous TREK-1 and PLD2, these molecules are described as "well correlated: in C2C12 cells" however it is challenging to assess what "well correlated" means, precisely in this context. This limitation is compounded by the conclusion that TREK-1 displayed little pair correlation with GM1 and the authors describe a "small amount of TREK-1 trafficked to PIP2". As such, these data may suggest that the findings outlined for HEK293T cells may be influenced by artefacts arising from overexpression.

      The changes in TREK-1 sensitivity to mechanical activation could also reflect changes in the amount of TREK-1 in the plasma membrane. The authors suggest that the presence of a leak currently accounts for the presence of TREK-1 in the plasma membrane, however they do not account for whether there are significant changes in the membrane localisation of the channel in the presence of mPLD2 versus xPLD2. The supplementary data provide some images of fluorescently labelled TREK-1 in cells, and the authors state that truncating the c-terminus has no effect on expression at the plasma membrane, however these data provide inadequate support for this conclusion. In addition, the data reporting the P50 should be noted with caution, given the lack of saturation of the current in response to the stimulus range.

      Response: We thank the reviewer for his/her concern about expression levels. We did test TREK-1 expression. mPLD decreases TREK-1 expression ~two-fold (see Author response image 2 below). We did not include the mPLD data since TREK-1 was mechanically activated with mPLD. For expression to account for the loss of TREK-1 stretch current (Figure 1b), xPLD would need to block surface expression of TREK-1 prior to stretch. The opposite was true, xPLD2 increased TREK-1 expression (see Figure S2c). Furthermore, we tested the leak current of TREK-1 at 0 mV and 0 mmHg of stretch. Basal leak current was no different with xPLD2 compared to endogenous PLD (Figure 1d; red vs grey bars respectively) suggesting TREK-1 is in the membrane and active when xPLD2 is present. If anything, the magnitude of the effect with xPLD would be larger if the expression levels were equal.

      Author response image 2.

      TREK expression at the plasma membrane. TREK-1 Fluorescence was measured by GFP at points along the plasma membrane. Over expression of mouse PLD2 (mPLD) decrease the amount of full-length TREK-1 (FL TREK) on the surface more than 2-fold compared to endogenously expressed PLD (enPLD) or truncated TREK (TREKtrunc) which is missing the PLD binding site in the C-terminus. Over expression of mPLD had no effect on TREKtrunc.

      Finally, by manipulating PLD2 in D. melanogaster, the authors show changes in behaviour when larvae are exposed to either mechanical or electrical inputs. The depletion of PLD2 is concluded to lead to a reduction in activation thresholds and to suggest an in vivo role for PA lipid signaling in setting thresholds for both mechanosensitivity and pain. However, while the data provided demonstrate convincing changes in behaviour and these changes could be explained by changes in transduction thresholds, these data only provide weak support for this specific conclusion. As the authors note, there is no TREK-1 in D. melanogaster, as such the reported findings could be accounted for by other explanations, not least including potential alterations in the activation threshold of Nav channels required for action potential generation. To conclude that the outcomes were in fact mediated by changes in mechanotransduction, the authors would need to demonstrate changes in receptor potential generation, rather than deriving conclusions from changes in behaviour that could arise from alterations in resting membrane potential, receptor potential generation or the activity of the voltage gated channels required for action potential generation.

      Response: We are willing to restrict the conclusion about the fly behavior as the reviewers see fit. We have shown PLD is mechanosensitivity in a fly cell line, and when we knock out PLD from a fly, the animal exhibits a mechanosensation phenotype. We tried to make it clear in the figure and in the text that we have no evidence of a particular mechanism downstream of PLD mechanosensation.

      This work provides further evidence of the astounding flexibility of mechanical sensing in cells. By outlining how mechanical activation of TREK-1 can be sensitised by mechanical regulation of PLD2 activity, the authors highlight a mechanism by which TREK-1 sensitivity could be regulated under distinct physiological conditions.

      Reviewer #3 (Public Review):

      The manuscript "Mechanical activation of TWIK-related potassium channel by nanoscopic movement and second messenger signaling" presents a new mechanism for the activation of TREK-1 channel. The mechanism suggests that TREK1 is activated by phosphatidic acids that are produced via a mechanosensitive motion of PLD2 to PIP2-enriched domains. Overall, I found the topic interesting, but several typos and unclarities reduced the readability of the manuscript. Additionally, I have several major concerns on the interpretation of the results. Therefore, the proposed mechanism is not fully supported by the presented data. Lastly, the mechanism is based on several previous studies from the Hansen lab, however, the novelty of the current manuscript is not clearly stated. For example, in the 2nd result section, the authors stated, "fluid shear causes PLD2 to move from cholesterol dependent GM1 clusters to PIP2 clusters and this activated the enzyme". However, this is also presented as a new finding in section 3 "Mechanism of PLD2 activation by shear."

      For PLD2 dependent TREK-1 activation. Overall, I found the results compelling. However, two key results are missing.

      1. Does HEK cells have endogenous PLD2? If so, it's hard to claim that the authors can measure PLD2-independent TREK1 activation.

      Response: yes, there is endogenous PLD (enPLD). We calculated the relative expression of xPLD2 vs enPLD. xPLD2 is >10x more abundant (Fig. S3d of Pavel et al PNAS 2020, ref 14 of the current manuscript). Hence, as with anesthetic sensitivity, we expect the xPLD to out compete the endogenous PLD, which is what we see. We added the following sentence and reference : “The xPLD2 expression is >10x the endogenous PLD2 (enPLD2) and out computes the TREK-1 binding site for PLD25.”

      1. Does the plasma membrane trafficking of TREK1 remain the same under different conditions (PLD2 overexpression, truncation)? From Figure S2, the truncated TREK1 seem to have very poor trafficking. The change of trafficking could significantly contribute to the interpretation of the data in Figure 1.

      Response: If the PLD2 binding site is removed (TREK-1trunc), yes, the trafficking to the plasma membrane is unaffected by the expression of xPLD and mPLD (Author response image 2 above). For full length TREK1 (FL-TREK-1), co-expression of mPLD decreases TREK expression (Author response image 2) and coexpression with xPLD increases TREK expression (Figure S2f). This is exactly opposite of what one would expect if surface expression accounted for the change in pressure currents. Hence, we conclude surface expression does not account for loss of TREK-1 mechanosensitivity with xPLD2. A few sentences was added to the discussion. We also performed dSTORM on the TREKtruncated using EGFP. TREK-truncated goes to PIP2 (see figure 2 of 6)

      Author response image 3.

      To better compare the levels of TREK-1 before and after shear, we added a supplemental figure S2f where the protein was compared simultaneously in all conditions. 15 min of shear significantly decreased TREK-1 except with mPLD2 where the levels before shear were already lowest of all the expression levels tested.

      For shear-induced movement of TREK1 between nanodomains. The section is convincing, however I'm not an expert on super-resolution imaging. Also, it would be helpful to clarify whether the shear stress was maintained during fixation. If not, what is the >me gap between reduced shear and the fixed state. lastly, it's unclear why shear flow changes the level of TREK1 and PIP2.

      Response: Shear was maintained during the fixing. xPLD2 blocks endocytosis, presumably endocytosis and or release of other lipid modifying enzymes affect the system. The change in TREK-1 levels appears to be directly through an interaction with PLD as TREK trunc is not affected by over expression of xPLD or mPLD.

      For the mechanism of PLD2 activation by shear. I found this section not convincing. Therefore, the question of how does PLD2 sense mechanical force on the membrane is not fully addressed. Par>cularly, it's hard to imagine an acute 25% decrease cholesterol level by shear - where did the cholesterol go? Details on the measurements of free cholesterol level is unclear and additional/alternative experiments are needed to prove the reduction in cholesterol by shear.

      Response: The question “how does PLD2 sense mechanical force on the membrane” we addressed and published in Nature Comm. In 2016. The title of that paper is “Kinetic disruption of lipid rafts is a mechanosensor for phospholipase D” see ref 13 Petersen et. al. PLD is a soluble protein associated to the membrane through palmitoylation. There is no transmembrane domain, which narrows the possible mechanism of its mechanosensation to disruption.

      The Nature Comm. reviewer identified as “an expert in PLD signaling” wrote the following of our data and the proposed mechanism:

      “This is a provocative report that identi0ies several unique properties of phospholipase D2 (PLD2). It explains in a novel way some long established observations including that the enzyme is largely regulated by substrate presentation which 0its nicely with the authors model of segregation of the two lipid raft domains (cholesterol ordered vs PIP2 containing). Although PLD has previously been reported to be involved in mechanosensory transduction processes (as cited by the authors) this is the 0irst such report associating the enzyme with this type of signaling... It presents a novel model that is internally consistent with previous literature as well as the data shown in this manuscript. It suggests a new role for PLD2 as a force transduction tied to the physical structure of lipid rafts and uses parallel methods of disrup0on to test the predic0ons of their model.”

      Regarding cholesterol. We use a fluorescent cholesterol oxidase assay which we described in the methods. This is an appropriate assay for determining cholesterol levels in a cell which we use routinely. We have published in multiple journals using this method, see references 28, 30, 31. Working out the metabolic fate of cholesterol after sheer is indeed interesting but well beyond the scope of this paper. Furthermore, we indirectly confirmed our finding using dSTORM cluster analysis (Figure 3d-e). The cluster analysis shows a decrease in GM1 cluster size consistent with our previous experiments where we chemically depleted cholesterol and saw a similar decrease in cluster size (see ref 13). All the data are internally consistent, and the cholesterol assay is properly done. We see no reason to reject the data.

      Importantly, there is no direct evidence for "shear thinning" of the membrane and the authors should avoid claiming shear thinning in the abstract and summary of the manuscript.

      Response: We previously established a kinetic model for PLD2 activation see ref 13 (Petersen et al Nature Comm 2016). In that publication we discussed both entropy and heat as mechanisms of disruption. Here we controlled for heat which narrowed that model to entropy (i.e., shear thinning) (see Figure 3c). We provide an overall justification below. But this is a small refinement of our previous paper, and we prefer not to complicate the current paper. We believe the proper rheological term is shear thinning. The following justification, which is largely adapted from ref 13, could be added to the supplement if the reviewer wishes.

      Justification: To establish shear thinning in a biological membrane, we initially used a soluble enzyme that has no transmembrane domain, phospholipase D2 (PLD2). PLD2 is a soluble enzyme and associated with the membrane by palmitate, a saturated 16 carbon lipid attached to the enzyme. In the absence of a transmembrane domain, mechanisms of mechanosensation involving hydrophobic mismatch, tension, midplane bending, and curvature can largely be excluded. Rather the mechanism appears to be a change in fluidity (i.e., kinetic in nature). GM1 domains are ordered, and the palmate forms van der Waals bonds with the GM1 lipids. The bonds must be broken for PLD to no longer associate with GM1 lipids. We established this in our 2016 paper, ref 13. In that paper we called it a kinetic effect, however we did not experimentally distinguish enthalpy (heat) vs. entropy (order). Heat is Newtonian and entropy (i.e., shear thinning) is non-Newtonian. In the current study we paid closer attention to the heat and ruled it out (see Figure 3c and methods). We could propose a mechanism based on kinetic disruption, but we know the disruption is not due to melting of the lipids (enthalpy), which leaves shear thinning (entropy) as the plausible mechanism.

      The authors should also be aware that hypotonic shock is a very dirty assay for stretching the cell membrane. Ouen, there is only a transient increase in membrane tension, accompanied by many biochemical changes in the cells (including acidification, changes of concentration etc). Therefore, I would not consider this as definitive proof that PLD2 can be activated by stretching membrane.

      Response: Comment noted. We trust the reviewer is correct. In 1998 osmotic shock was used to activate the channel. We only intended to show that the system is consistent with previous electrophysiologic experiments.

      References cited:

      1 Du G, Huang P, Liang BT, Frohman MA. Phospholipase D2 localizes to the plasma membrane and regulates angiotensin II receptor endocytosis. Mol Biol Cell 2004;15:1024–30. htps://doi.org/10.1091/mbc.E03-09-0673.

      2 Koch T, Wu DF, Yang LQ, Brandenburg LO, Höllt V. Role of phospholipase D2 in the agonist-induced and constistutive endocytosis of G-protein coupled receptors. J Neurochem 2006;97:365–72. htps://doi.org/10.1111/j.1471-4159.2006.03736.x.

      3 Wheeler DS, Underhill SM, Stolz DB, Murdoch GH, Thiels E, Romero G, et al. Amphetamine activates Rho GTPase signaling to mediate dopamine transporter internalization and acute behavioral effects of amphetamine. Proc Natl Acad Sci U S A 2015;112:E7138–47. htps://doi.org/10.1073/pnas.1511670112.

      4 Rankovic M, Jacob L, Rankovic V, Brandenburg L-OO, Schröder H, Höllt V, et al. ADP-ribosylation factor 6 regulates mu-opioid receptor trafficking and signaling via activation of phospholipase D2. Cell Signal 2009;21:1784–93. htps://doi.org/10.1016/j.cellsig.2009.07.014.

      5 Pavel MA, Petersen EN, Wang H, Lerner RA, Hansen SB. Studies on the mechanism of general anesthesia. Proc Natl Acad Sci U S A 2020;117:13757–66. htps://doi.org/10.1073/pnas.2004259117.

      6 Call IM, Bois JL, Hansen SB. Super-resolution imaging of potassium channels with genetically encoded EGFP. BioRxiv 2023. htps://doi.org/10.1101/2023.10.13.561998.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public reviews:

      Reviewer #1:

      This work by Leclercq and colleagues performed metabolomics on biospecimens collected from 96 patients diagnosed with several types of alcohol use disorders (AUD). The authors discovered strong alterations in circulating glycerophospholipids, bile acids, and some gut microbe-derived metabolites in AUD patients compared to controls. An exciting part of this work is that metabolomics was also performed in frontal cortex of post-mortem brains and cerebrospinal fluid of heavy alcohol users, and some of the same metabolites were seen to be altered in the central nervous system. This is an important study that will form the basis for hypothesis generation around diet-microbe-host interactions in alcohol use disorder. The work is done in a highly rigorous manner, and the rigorously collected human samples are a clear strength of this work. Overall, many new insights may be gained by this work, and it is poised to have a high impact on the field.

      Strengths:

      (1) The rigorously collected patient-derived samples.

      (2) There is high rigor in the metabolomics investigation.

      (3) Statistical analyses are well-described and strong.

      (4) An evident strength is the careful control of taking blood samples at the same time of the day to avoid alterations in meal- and circadian-related fluctuations in metabolites.

      Weaknesses:

      (1) Some validation in animal models of ethanol exposure compared to pair-fed controls would help strengthen causal relationships between metabolites and alterations in the CNS.

      (2) The classification of "heavy alcohol users" based on autopsy reports may not be that accurate.

      (3) The fact that most people with alcohol use disorder choose to drink over eating food, there needs to be some more discussion around how dietary intake (secondary to heavy drinking) most likely has a significant impact on the metabolome.<br />

      We thank this reviewer for his/her encouraging comments and for highlighting the fact that this study is important in the field to generate hypotheses around diet-microbe-host interactions in alcohol use disorder.

      Concerning weakness #1: Regarding the validation in animal models of ethanol exposure, we were very careful in our discussion to avoid pretending that the study allowed to test causality of the factors. This was certainly not the objective of the present study. The testing of causality would indeed probably necessitate animal models but these models could only test the effects of one single metabolite at a time and could not at the same time capture the complexity of the changes occurring in AUD patients. The testing of metabolites would be a totally different topic. Hence, we do not feel comfortable in conducting rodent experiments for several reasons. First, AUD is a very complex pathology with physiological and psychological/psychiatric alterations that are obviously difficult to reproduce in animal models. Secondly, as mentioned by the reviewer, AUD pathology spontaneously leads to nutritional deficits, including significant reductions in carbohydrates, lipids, proteins and fiber intakes. We have recently published a paper in which we carefully conducted detailed dietary anamneses and described the changes in food habits in AUD patients (Amadieu et al., 2021). As explained below, some blood metabolites that are significantly correlated with depression, anxiety and craving belong to the xanthine family and are namely theobromine, theophylline, and paraxanthine, which derived from metabolism of coffee, tea or chocolate (which are not part of the normal diet of mice or rats).Therefore, conducting an experiment in animal model of ethanol exposure compared to pair-fed controls will omit the important impact of nutrition in blood metabolomics and consequently won’t mimic the human AUD pathology. In addition, if we take into consideration the European Directive 2010/63/EU (on the protection of animals used for scientific purposes) which aims at Reducing (Refining, Replacing) the number of animals used in experiment, it is extremely difficult to justify, at the ethical point of view, the need to reproduce human results in an animal model that won’t be able to mimic the nutritional, physiological and psychological alterations of alcohol use disorder.

      Concerning weakness #2: The classification of subjects to the group who have a history of heavy alcohol use was not solely on autopsy record, but was also based on medical history i.e. diagnosis of alcohol-related diseases: ICD-10 codes F10.X, G31.2, G62.1, G72.1, I42.6, K70.0-K70.4, K70.9, and K86.0, or signs of heavy alcohol use in the clinical or laboratory findings, e.g., increased levels of gamma-glutamyl transferase, mean corpuscular volume, carbohydrate-deficient transferrin, as stated in the methods section of the manuscript. In Finland, the medical records from the whole life of the subjects are available. We consider that getting diagnosis of alcohol-related disease is clear sign of history of heavy alcohol use.

      Concerning weakness#3:  As explained above, we do agree with the reviewer that AUD is not only “drinking alcohol” but is also associated with reduction in food intake that obviously influenced the metabolomics data presented in this current study.  We have therefore added some data, which have not been published before, in the results section that refer to key nutrients modified by alcohol intake and we refer to those data and their link with metabolomics in the discussion section:

      Results section page 8, Line 153-155. This sentence has been added:

      “The changes in metabolites belonging to the xanthine family during alcohol withdrawal could be explained by the changes in dietary intake of coffee, tea and chocolate (see Fig S5).”

      Discussion section: Page 11, Line 235-240.

      “Interestingly, the caffeine metabolites belonging to the xanthine family such as paraxanthine, theophylline and theobromine that were decreased at baseline in AUD patients compared to controls, increased significantly during alcohol withdrawal to reach the levels of healthy controls. Changes in dietary intake of coffee, tea and chocolate during alcohol withdrawal could explain these results”.

      In the conclusion, Page 16, Line 354-356, we clearly stated that: “LC-MS metabolomics plasma analysis allowed for the identification of metabolites that were clearly linked to alcohol consumption, and reflected changes in metabolism, alterations of nutritional status, and gut microbial dysbiosis associated with alcohol intake”

      Reference:

      Amadieu C, Leclercq S, Coste V, Thijssen V, Neyrinck AM, Bindels LB, Cani PD, Piessevaux H, Stärkel P, Timary P de, Delzenne NM. 2021. Dietary fiber deficiency as a component of malnutrition associated with psychological alterations in alcohol use disorder. Clinical Nutrition 40:2673–2682. doi:10.1016/j.clnu.2021.03.029

      Leclercq S, Cani PD, Neyrinck AM, Stärkel P, Jamar F, Mikolajczak M, Delzenne NM, de Timary P. 2012. Role of intestinal permeability and inflammation in the biological and behavioral control of alcohol-dependent subjects. Brain Behav Immun 26:911–918. doi:10.1016/j.bbi.2012.04.001

      Leclercq S, De Saeger C, Delzenne N, de Timary P, Stärkel P. 2014a. Role of inflammatory pathways, blood mononuclear cells, and gut-derived bacterial products in alcohol dependence. Biol Psychiatry 76:725–733. doi:10.1016/j.biopsych.2014.02.003

      Leclercq S, Matamoros S, Cani PD, Neyrinck AM, Jamar F, Stärkel P, Windey K, Tremaroli V, Bäckhed F, Verbeke K, de Timary P, Delzenne NM. 2014b. Intestinal permeability, gut-bacterial dysbiosis, and behavioral markers of alcohol-dependence severity. Proc Natl Acad Sci U S A 111:E4485–E4493. doi:10.1073/pnas.1415174111

      Voutilainen T, Kärkkäinen O. 2019. Changes in the Human Metabolome Associated With Alcohol Use: A Review. Alcohol and Alcoholism 54:225–234. doi:10.1093/alcalc/agz030

      Public Reviewer #2:

      The authors carried out the current studies with the justification that the biochemical mechanisms that lead to alcohol addiction are incompletely understood. The topic and question addressed here are impactful and indeed deserve further research. To this end, a metabolomics approach toward investigating the metabolic effects of alcohol use disorder and the effect of alcohol withdrawal in AUD subjects is valuable. However, it is primarily descriptive in nature, and these data alone do not meet the stated goal of investigating biochemical mechanisms of alcohol addiction. The current work's most significant limitation is the cross-sectional study design, though inadequate description and citation of the underlying methodological approaches also hampers interest. Most of the data are cross-sectional in the study design, i.e., alcohol use disorder vs controls. However, it is well established that there is a high degree of interpersonal variation with metabolism, and further, there is somewhat high intra-personal variation in metabolism over time. This means that the relatively small cohort of subjects is unlikely to reflect the broader condition of interest (AUD/withdrawal). The authors report a comparison of a later time-point after alcohol withdrawal (T2) vs. the AUD condition. However, without replicative time points from the control subjects it is difficult to assess how much of these changes are due to withdrawal vs the intra-personal variation described above.

      We agree with the reviewer. Our goal was not to investigate the biochemical mechanisms of AUD but rather to investigate how metabolomics could contribute to the psychological alterations of AUD. The goals of the study are defined at the end of the introduction (Page 4 – Lines 80-91), as follows:

      “The aims of this study are multiple. First, we investigated the impact of severe AUD on the blood metabolome by non-targeted LC-MS metabolomics analysis. Second, we investigated the impact of a short-term alcohol abstinence on the blood metabolome followed by assessing the correlations between the blood metabolome and psychological symptoms developed in AUD patients. Last, we hypothesized that metabolites significantly correlated with depression, anxiety or alcohol craving could potentially have neuroactive properties, and therefore the presence of those neuroactive metabolites was confirmed in the central nervous system using post-mortem analysis of frontal cortex and cerebrospinal fluid of persons with a history of heavy alcohol use. Our data bring new insights on xenobiotics- or microbial-derived neuroactive metabolites, which can represent an interesting strategy to prevent or treat psychiatric disorders such as AUD”.

      Due to the fact that the method section describing the study design is located at the end of the manuscript, we have decided to clarify the methodological approach in the first paragraph of the result section in order to show that in fact, we have performed a longitudinal study (which includes the same group of AUD, tested at two time points – at the beginning and at the end of alcohol withdrawal). This is stated as follows:

      Results section, Page 6, Line 97-99: “All patients were hospitalized for a 3-week detoxification program, and tested at two timepoints: T1 which represents the first day of alcohol withdrawal, and T2 which represents the last day of the detoxification program”.

      We propose to add a figure with a schematic representation of the protocol. We let the editor deciding whether this figure can be added (as supplemental material).

      Author response image 1.

      Schematic representation of the protocol

      We agree with the reviewer that the correlational analysis (between blood metabolites and psychological symptoms) is conducted at one time point (T1) only, which has probably led to the confusion between cross-sectional and longitudinal study. In fact we had a strong motivation to provide correlations at T1, instead of T2. T1, which is at the admission time, is really the moment where we can take into account variability of the psychological scores. Indeed, after 3 weeks of abstinence (T2), the levels of depression, anxiety and alcohol craving decreased significantly ( as shown in other studies from our group (Leclercq et al., 2014b, 2014a, 2012)) and remained pretty low in AUD patients, with a much lower inter-individual variability which makes the correlations less consistent.

      We agree with the reviewer that there is a high intra and inter-personal variability in the metabolomics data, that could be due to the differences in previous meals intakes within and between subjects. While AUD subjects have been tested twice (at the beginning and at the end of a 3-week detoxification program), the control subjects have only been tested once. Consequently, we did not take into account the intra-personal variability in the control group. The metabolomics changes observed in AUD patients between T1 and T2 are therefore due to alcohol withdrawal but also to intra-personal variability. This is a limitation of the study that we have now added in the discussion section, Page 16, Lines 354-357  as follows:

      “The selection of the control group is always challenging in alcohol research. Here, the healthy subjects were matched for sex, age and BMI but not for smoking status or nutritional intake. Alcohol addiction is a major cause of malnutrition in developed countries and tobacco smoking is more prevalent in alcohol users compared to healthy subjects. These two main confounding factors, although being an integral part of the alcoholic pathology, are known to influence the blood metabolome. Furthermore, another limitation is that the control group was tested only once, while the AUD patients were tested twice (T1 and T2). This means that we do not take into consideration the intra-personal variability of the metabolomics data when interpreting the results of alcohol withdrawal effects”.

      The limitation concerning the small sample size is already mentioned in the discussion section, as follows:

      “Large studies are usually required in metabolomics to observe small and medium size changes. Here, we included only 96 AUD patients, but they were all well characterized and received standardized therapies (for instance, vitB supplementation) during alcohol withdrawal”.

      Overall, there is not enough experimental context to interpret these findings into a biological understanding. For example, while several metabolites are linked with AUD and associated with microbiome or host metabolism based on existing literature, it's unclear from the current study what function these changes have concerning AUD, if any. The authors also argue that alcohol withdrawal shifts the AUD plasma metabolic fingerprint towards healthy controls (line 153). However, this is hard to assess based on the plots provided since the change in the direction of the orange data subset is considers AUD T2 vs T1. In contrast, AUD T2 vs Control would represent the claimed shift. To support these claims, the authors would better support their argument by showing this comparison as well as showing all experimental groups (including control subjects) in their multi-dimensional model (e.g., PCA).

      We thank the reviewer for these comments. It is true in this type of discovery-based approach the causality cannot be interpreted nor do we claim so. The aim was to characterize the metabolic alterations in this population, response to withdrawal period and suggest potential candidate metabolites linked to psychological symptoms. Rigorous pre-clinical assays and validation trials in humans are required to prove the causality, if any, of the discussed metabolites.

      The original claim on line 153 was poorly constructed and the Figure 2c is meant to visualize the influence of withdrawal on selected metabolites and also show the effect of chronic alcohol intake on the selected metabolites at baseline. The description of the Figure 2c has been modified in result section from line 156 onwards: “Overall, Fig. 2c demonstrates that a number of identified metabolites altered in sAUD patients relative to control are affected by alcohol withdrawal. Apart from 4-pyridoxic acid, cotinine, and heme metabolites bilirubin and biliverdin, the shifts observed in the selected metabolites are generally in the opposite direction as compared to the baseline.”

      The authors attempt to extend the significance of their findings by assessing post-mortem brain tissues from AUD subjects; however, the finding that many of the metabolites changed in T2/T1 are also present in AUD brain tissues is interesting; however, not strongly supporting of the authors' claims that these metabolites are markers of AUD (line 173). Concerning the plasma cohort itself, it is unclear how the authors assessed for compliance with alcohol withdrawal or whether the subjects' blood-alcohol levels were independently verified.

      We did not claim that the metabolites significantly correlated with the psychological symptoms - and present in central nervous system (frontal cortex or CSF) -  are “markers of AUD”. Line 173 did not refer to this idea, and the terms “markers of AUD” do not appear in the whole manuscript.

      Regarding the compliance with alcohol cessation, we did not assess the ethanol blood level. The patients are hospitalized for a 3-week detoxification program, they are not allowed to drink alcohol and are under strict control of the nurses and medical staff of the unit. Consuming alcoholic beverage within the hospitalization unit is a reason for exclusion. However, we carefully monitored the liver function during alcohol withdrawal. For the reviewers’ information, we have added here below, the evolution of liver enzymes (ALT, AST, gGT) during the 3-week detoxification program as indirect markers of alcohol abstinence.

      Author response image 2.

      Data are described as median ± SEM. AST, Aspartate transaminase; ALT, Alanine transaminase; gGT: gamma glutamyltranspeptidase. ** p<0.01 vs T1, *** p<0.001 vs T1

       

      The second area of concern is the need for more description of the analytical methodology, the lack of metabolite identification validation evidence, and related statistical questions. The authors cite reference #59 regarding the general methodology. However, this reference from their group is a tutorial/review/protocol-focused resource paper, and it is needs to be clarified how specific critical steps were actually applied to the current plasma study samples given the range of descriptions provided in the citations. The authors report a variety of interesting metabolites, including their primary fragment intensities, which are appreciated (Supplementary Table 3), but no MS2 matching scores are provided for level 2 or 3 hits. Further, level 1 hits under their definition are validated by an in-house standard, but no supporting data are provided besides this categorization. Finally, a common risk in such descriptive studies is finding spurious associations, especially considering many factors described in the current work. These include AUD, depression, anxiety, craving, withdrawal, etc. The authors describe the use of BH correction for multiple-hypothesis testing. However, this approach only accounts for the many possible metabolite association tests within each comparison (such as metabolites vs depression). It does not account for the multi-variate comparisons to the many behavior/clinical factors described above. The authors should employ one of several common strategies, such as linear mixed effects models, for these types of multi-variate assessments.

      The methodological details related to the sample processing, data acquisition, data pre-processing and metabolite identification have been provided in the supplementary materials and described below. Supplementary table 3 has been amended with characteristic MS2 fragments for both positive and negative ionization modes if data was available. Additionally, all annotations against the in-house library additions have been rechecked, identification levels corrected and EICs for all level 1 identifications are provided in the supplementary material.

      As described in the statistical analysis methods, BH correction was employed in the group-wise comparisons to shortlist the altered features for identification. Manual curating was then applied for the significant features and annotated metabolites subjected to correlation analysis. In this discovery-based approach the aim was to discover potential candidates linked with psychological symptoms for subsequent work to evaluate causality. Hence, the application of multi-variate analysis assessing biomarker candidates is not in the scope of this study.

      “LC-MS analysis. Plasma sample preparation and LC-MS measurement followed the parameters previously detailed in Klåvus et al (57).  Samples were randomized and thawed on ice before processing. 100 µl of plasma was added to 400 µl of LC-MS grade acetonitrile, mixed by pipetting four time, followed by centrifugation in 700 g for 5 minutes at 4 °C. A quality control sample was prepared by pooling 10 µl of each sample together. Extraction blanks having only cold acetonitrile and devoid of sample were prepared following the same procedure as sample extracts. LC-MS grade acetonitrile, methanol, water, formic acid and ammonium formate (Riedel-de Haën™, Honeywell, Seelze, Germany) were used to prepare mobile phase eluents in reverse phase (Zorbax Eclipse XDBC18, 2.1 × 100 mm, 1.8 μm, Agilent Technologies, Palo Alto, CA, USA) and hydrophilic interaction (Acquity UPLC® BEH Amide 1.7 μm, 2.1 × 100 mm, Waters Corporation, Milford, MA, USA) liquid chromatography separation. In reverse phase separation, the samples were analyzed by Vanquish Flex UHPLC system (Thermo Scientific, Bremen, Germany) coupled to high-resolution mass spectrometry (Q Exactive Focus, Thermo Scientific, Bremen, Germany) in both positive and negative polarity mass range from 120 to 1200, target AGC 1e6 and resolution 70,000 in full scan mode. Data dependent MS/MS data was acquired for both modes with target AGC 8e3 and resolution 17,500, precursor isolation window was 1.5 amu, normalized collision energies were set at 20, 30 and 40 eV and dynamic exclusion at 10.0 seconds. In hydrophobic interaction separation, the samples were analyzed by a 1290 LC system coupled to a 6540 UHD accurate mass Q-ToF spectrometer (Agilent Technologies, Waldbronn, Karlsruhe, Germany) using electrospray ionization (ESI, Jet Stream) in both positive and negative polarity with mass range from 50 to 1600 and scan rate of 1.67 Hz in full scan mode. Source settings were as in the protocol. Data dependent MS/MS data was acquired separately using 10, 20 and 40 eV collision energy in subsequent runs. Scan rate was set at 3.31 Hz, precursor isolation width of 1.3 amu and target counts/spectrum of 20,000, maximum of 4 precursor pre-cycle, precursor exclusion after 2 spectra and release after 15.0 seconds. Detectors were calibrated prior sequence and continuous mass axis calibration was performed throughout runs by monitoring reference ions from infusion solution for operating at high accuracy of < 2 ppm. Quality control samples were injected in the beginning of the analysis to equilibrate the system and after every 12 samples for quality assurance and drift correction in all modes. All data were acquired in centroid mode by either MassHunter Acquisition B.05.01 (Agilent Technologies) or in profile mode by Xcalibur 4.1 (Thermo Fisher Scientific) softwares.

      Metabolomics analysis of TSDS frontal cortex and CSF samples using the same 1290 LC system coupled with a 6540 UHD accurate mass Q-ToF spectrometer has been previously accomplished by Karkkainen et al (10).

      Peak picking and data processing. Raw instrumental data (*raw and *.d files) were converted to ABF format using Reifycs Abf Converter (https://www.reifycs.com/AbfConverter). MS-DIAL (Version 4.70) was employed for automated peak picking and alignment with the parameters according to Klåvus et al., 2020 (57) separately for each analytical mode. For the 6540 Q-ToF mass data minimum peak height was set at 8,000 and for the Q Exactive Focus mass data minimum peak height was set at 850,000. Commonly, m/z values up to 1600 and all retention times were considered, for aligning the peaks across samples retention time tolerance was 0.2 min and MS1 tolerance 0.015 Da and the “gap filling by compulsion” was selected. Alignment results across all modes and sample types as peak areas were exported into Microsoft Excel sheets to be used for further data pre-processing.

      Pre-processing including drift correction and quality assessment was done using the notame package v.0.2.1 R software version 4.0.3 separately for each mode. Features present in less than 80% of the samples within all groups and with detection rate in less than 70% of the QC samples were flagged. All features were subjected to drift correction where the features were log-transformed and a regularized cubic spline regression line was fitted for each feature against the quality control samples. After drift correction, QC samples were removed and missing values in the non-flagged features were imputed using random forest imputation. Finally, the preprocessed data from each analytical mode was merged into a single data matrix.

      Molecular feature characteristics (exact mass, retention time and MS/MS spectra) were compared against in-house standard library, publicly available databases such as METLIN, HMDB and LIPIDMAPS and published literature. Annotation of metabolites and the level of identification was based on the recommendations given by the Chemical Analysis Working Group (CAWG) Metabolomics Standards Initiative (MSI) (59): 1 = identified based on a reference standard, 2 = putatively annotated based on physicochemical properties or similarity with public spectral libraries, 3 = putatively annotated to a chemical class and 4 = unknown.”

      Reference 59: Sumner LW, Amberg A, Barrett D, Beale MH, Beger R, Daykin CA, et al. Proposed minimum reporting standards for chemical analysis. Metabolomics. 2007;3:211–221.

      Recommendations for the authors:

      Reviewer #1:

      (1) There should be more discussion comparing and contrasting the differences between the 2 cohorts (ALCOHOLBIS versus GUT2BRAIN), instead of stressing the similarities.

      As indicated in the results section, we have verified that the ALCOHOLBIS cohort and GUT2BRAIN cohort are similar in term of age, gender, smoking habits, drinking habits and severity of psychological symptoms. Those similar features are important to allow the combination of the metabolomics data from the two cohorts, which subsequently allows to have a bigger sample size (n = 96) and more statistical power.

      (2) The identification of 97 heavy alcohol users based on hospital codes at autopsy may not be the most rigorous way to define those with AUD. More information is needed on how these 97 were classified as heavy alcohol users.

      The classification of subjects to the group who have a history of heavy alcohol use was not based solely on the autopsy records. The classification was also based on medical history, which in Finland is available from the whole life of the subjects, and including diagnoses and laboratory finding. The subjects needed to have a diagnosis of alcohol-related disease, as stated in the methods section of the manuscript. However, since some of the used diagnoses are related to organ damage related to heavy alcohol use, we do not claim that these subjects would all have alcohol dependence. But history of heavy use of alcohol is needed to get organ damage associated with alcohol use. Therefore, we consider that diagnosis of alcohol-related disease is a clear sign of a history of heavy alcohol use.

      (3) The fact that the control group mainly died of cardiovascular disease confounds the interpretations around alcohol impact metabolite levels. How much of the metabolomics differences are related to hyperlipidemia or other CVD risk factors in the controls?

      There are no healthy controls in post-mortem studies, since all subjects need to die from something to be included to the cohort. The challenge in studying AUD is that they die relatively young. The only other group of individuals who die outside of hospital at the relatively same age as subjects with AUD are those with CVD. Post-mortem autopsies are done in Finland to all who die outside of hospital, and these are the main source of samples for post-mortem sample cohorts. Therefore, there is no other control group to compare AUD subject to in these types of studies.

      As for the altered metabolites in the post-mortem sample, the phospholipids observed could be associated with CVD. However, alterations in phospholipids are also commonly associated with alcohol use and AUD (for a review see (Voutilainen and Kärkkäinen, 2019)) and this effect is also seen in the results from the clinical cohorts in this study (Figure 1). Therefore, it cannot be said that these phospholipids finding would be due to selection of the control group.

      (4) When examining metabolomics alterations, it is extremely important to understand what people are eating (i.e., providing a substrate). A major confounding issue here is that heavy alcohol users typically choose drinking over eating food. How much of the observed alterations in the plasma metabolome is due to the decreased food intake? Some validation in animal models of ethanol exposure compared to pair-fed controls would help strengthen causal relationships between metabolites and alterations in the circulation and CNS.

      Regarding the validation in animal models of ethanol exposure, we were very careful in our discussion to avoid pretending that the study allowed to test causality of the factors. This was certainly not the objective of the present study. The testing of causality would indeed probably necessitate animal models but these models could only test the effects of one single metabolite at a time and could not at the same time capture the complexity of the changes occurring in AUD patients. The testing of metabolites would be a totally different topic. Hence, we do not feel comfortable in conducting rodent experiments for several reasons. First, AUD is a very complex pathology with physiological and psychological/psychiatric alterations that are obviously difficult to reproduce in animal models. Secondly, as mentioned by the reviewer, AUD pathology spontaneously leads to nutritional deficits, including significant reductions in carbohydrates, lipids, proteins and fiber intakes. We have recently published a paper in which we carefully conducted detailed dietary anamneses and described the changes in food habits in AUD patients (Amadieu et al., 2021). As explained below, some blood metabolites that are significantly correlated with depression, anxiety and craving belong to the xanthine family and are namely theobromine, theophylline, and paraxanthine, which derived from metabolism of coffee, tea or chocolate (which are not part of the normal diet of mice or rats).Therefore, conducting an experiment in animal model of ethanol exposure compared to pair-fed controls will omit the important impact of nutrition in blood metabolomics and consequently won’t mimic the human AUD pathology. In addition, if we take into consideration the European Directive 2010/63/EU (on the protection of animals used for scientific purposes) which aims at Reducing (Refining, Replacing) the number of animals used in experiment, it is extremely difficult to justify, at the ethical point of view, the need to reproduce human results in an animal model that won’t be able to mimic the nutritional, physiological and psychological alterations of alcohol use disorder.

      As explained above, we do agree with the reviewer that AUD is not only “drinking alcohol” but is also associated with reduction in food intake that obviously influenced the metabolomics data presented in this current study.  We have therefore added some data, which have not been published in the previous version of the manuscript, in the results section that refer to key nutrients modified by alcohol intake and we refer to those data and their link with metabolomics in the discussion section:

      Results section page 8, Line 153-155. This sentence has been added:

      “The changes in metabolites belonging to the xanthine family during alcohol withdrawal could be explained by the changes in dietary intake of coffee, tea and chocolate (see Fig S5).”

      Discussion section: Page 11, Line 234-238.

      “Interestingly, the caffeine metabolites belonging to the xanthine family such as paraxanthine, theophylline and theobromine that were decreased at baseline in AUD patients compared to controls, increased significantly during alcohol withdrawal to reach the levels of healthy controls. Changes in dietary intake of coffee, tea and chocolate during alcohol withdrawal could explain these results”.

      In the conclusion, Page 16, Line 360-32, we clearly stated that: “LC-MS metabolomics plasma analysis allowed for the identification of metabolites that were clearly linked to alcohol consumption, and reflected changes in metabolism, alterations of nutritional status, and gut microbial dysbiosis associated with alcohol intake”

      Reference:

      Amadieu C, Leclercq S, Coste V, Thijssen V, Neyrinck AM, Bindels LB, Cani PD, Piessevaux H, Stärkel P, Timary P de, Delzenne NM. 2021. Dietary fiber deficiency as a component of malnutrition associated with psychological alterations in alcohol use disorder. Clinical Nutrition 40:2673–2682. doi:10.1016/j.clnu.2021.03.029

      Leclercq S, Cani PD, Neyrinck AM, Stärkel P, Jamar F, Mikolajczak M, Delzenne NM, de Timary P. 2012. Role of intestinal permeability and inflammation in the biological and behavioral control of alcohol-dependent subjects. Brain Behav Immun 26:911–918. doi:10.1016/j.bbi.2012.04.001

      Leclercq S, De Saeger C, Delzenne N, de Timary P, Stärkel P. 2014a. Role of inflammatory pathways, blood mononuclear cells, and gut-derived bacterial products in alcohol dependence. Biol Psychiatry 76:725–733. doi:10.1016/j.biopsych.2014.02.003

      Leclercq S, Matamoros S, Cani PD, Neyrinck AM, Jamar F, Stärkel P, Windey K, Tremaroli V, Bäckhed F, Verbeke K, de Timary P, Delzenne NM. 2014b. Intestinal permeability, gut-bacterial dysbiosis, and behavioral markers of alcohol-dependence severity. Proc Natl Acad Sci U S A 111:E4485–E4493. doi:10.1073/pnas.1415174111

      Voutilainen T, Kärkkäinen O. 2019. Changes in the Human Metabolome Associated With Alcohol Use: A Review. Alcohol and Alcoholism 54:225–234. doi:10.1093/alcalc/agz030

      Reviewer #2:

      (1) More methodological information about the laboratory processing of samples, instrumentation, and data analysis needs to be provided. Reference 59 needs to be more specific and include important methodological details for this project. Please provide an actual methods section for the mass-spectrometry-based metabolomics.

      The reviewer is correct that the methods should be described in detail but due to word limits, the description was moved to a supplementary file. Methodological details are provided in the answer to the final comment in the public reviews section and we kindly refer to that for the methodological details. Reference 57 (Klåvus et al) is a method paper and covers the whole untargeted metabolomics pipeline that is used in our work.

      (2) The VIP figures, e.g., Figure 1b and Figure 2b are not very informative and would be better represented in a supplementary table

      VIP scores for all annotated metabolites are provided in the supplementary table 3 along with peak data and other values derived from statistical tests. Furthermore, we have removed the VIP value in figures 1 and 2 and we have replaced them by an updated Volcano plot to represent also the VIP values in addition to the q and Cohen’s d values.

      (3) The findings on odd-chain lyso-lipids are interesting, and while these have been reported biologically, odd-chain lipids are uncommon and should be validated with authentic standards as available (please provide an XIC of the level 1 peak and standard if possible, e.g., LPC 17:0) or at least a supplementary figure on manual inspection of the negative mode MS2 spectrum showing the putative fatty acid chain fragment. The current assignments are based on positive mode lipid class fragments and accurate mass.

      We thank the reviewer for pointing this out and it is correct that the negative MS2 spectrum is essential for lipid identification. Although the current assignments show only positive fragments for many lipids, the fatty acid chain, if reported, has been confirmed from negative mode MS2 spectrum. The supplementary table 3 with peak information has been augmented with fragment information from both negative and positive ionizations if available. Also, reference and experimental MS2 spectra have been provided as separate supplemental file for level 1 identifications, including the odd-chain lyso-lipids LPC 15:0 and 17:0.

      (4) Please provide some supplementary information (MS1/MS2 if available) on the untargeted features of interest (up and down-regulated) from Figure 1C, especially the 5 encircled features. If any manual annotation of these features was attempted, please include a brief description in the results/discussion.

      All statistically significant features with MS2 data have been subjected to manual annotation and database searches using at least METLIN, HMDB and LipidMaps. Additionally, if the manual inspection failed to provide any identification, in silico fragmentation software MS-FINDER was used to calculate candidate molecular formula. The features were labeled as unknown if all efforts were unsuccessful. The peak characteristics of the key unknowns in Figure 1b have also been included in the supplemental table.

      A note of the manual inspection has been included in the result section line 129: “The top-ranked metabolites in Fig. 1b remained unknown regardless of manual curation.”

      Reviewer #3:

      I think this is an interesting paper with a very solid methodology and an abundance of results. I am not an expert on metabolomics, and I have some very interesting hours here, trying (but sometimes failing) to grasp this paper's content. This paper also needs to be closely read by a reviewer who knows the metabolomics field and can give feedback on the meaning of the results. I have focused purely on the AUD clinical side as this is where I may contribute. My main concern is conceptualizing the aims and what authors want to investigate. As far as I understand, this is a study of the relationship between alcohol use and the metabolome, and in this respect, I think there are some issues.

      Just take the abstract that talks about (in the first sentence) alcohol use disorder ("AUD") - a term that generally sometimes refers to harmful use of alcohol and alcohol addiction and sometimes to all F10-diagnosis (and thus an inaccurate term), then the following sentence talks about what leads to alcohol addiction (not dependence) - and this in a mechanistic direction and in the last part of the second sentence talks about metabolomics being able to decipher metabolic events related to AUD. So, even in the first two sentences, it is confusing - is this about correlates, mechanisms, prevention, or treatment? The inaccuracy of terms continues in sentence 4. We have "chronic alcohol abuse" (?) and "severe alcohol use disorder (AUD)" (abbreviated for the second time). Later, only "alcohol abuse" is used and the abstract ends with something about these findings being interesting in "the management of [...] AUD". All this illustrates that there is a large mixture of concepts - what aspect of alcohol use or abuse are you looking at? Moreover, of intention: is it to find correlates, explanations, or targets for interventions? Without clarity in this respect, one can get lost in what all these interesting measures mean - how we should interpret them. This comment is made only for the abstract. However, but it is equally valid and important for the introduction and discussion parts of the ms, where additional terms and formulations are introduced: "heavy alcohol use" (lines 86-7) and "prevent or treat psychiatric disorders such as AUD" (lines 90-1). This is then reflected in the discussion where the authors claim that what they have found is related to "chronic alcohol abuse" (line 188), "heavy alcohol drinkers" (line 191), and "AUD patients" (lines 199 and 202 and further on).  

      We thank the reviewer for this useful comment and we apologize for the confusion. We agree that it is important to use the correct terms and definitions. All patients included in this study were diagnosed as severe AUD (for more information on the diagnosis, see answer to the comments related to DSM-IV and DSM5). This manuscript is consequently related to severe AUD and other terms like “alcohol abuse, “alcohol addiction” are therefore not appropriate. In the revised version of the manuscript, we have used severe AUD or the abbreviation sAUD. The figure and legends have been changed accordingly.

      In the first paragraph of the results section, ALCOHOLBIS and GUT2BRAIN are compared. It says they are similar on many measures, including craving, but different on some measures, again including craving. It is difficult to grasp this even if the authors try to explain (lines 101-2). This sentence also introduces some discussion in the results section by saying something normative about their finding and relating this to other research (references 12, 13, and 14).

      We would like to apologize for the confusion related to first paragraph of the results section. We have indeed indicated that, while the ALCOHOLBIS cohort and the GUT2BRAIN cohort are highly similar in term of biological and psychological features, a significant difference does exist in the compulsive component of the craving score. Indeed, the mean score of compulsion is 11 ± 3 in the ALCOHOLBIS cohort and 14  ± 3 in the GUT2BRAIN cohort. In healthy controls, the mean score of compulsion is 1.5 ± 1.5. Despite the statistically significant difference in craving between both cohorts, we do not think that this difference is relevant in our context since both scores (11 and 14) are considered high compared to the control group. In order to simplify the message, we have revised the first paragraph as follows:

      “Both groups of patients were similar in terms of age, gender, smoking and drinking habits and presented with high scores of depression, anxiety and alcohol craving at T1 (Table 1). These biological and psychological similarities allow us to combine both cohorts (and consequently increase sample size) and compare them to a group of heathy controls for metabolomics analysis”.

      In line 104 the abbreviation PCA is introduced but needs to be explained. Such objections could be made for many of the abbreviations used (sPLS-DA VIP, LPC, CSF, CNS, LPE, etc.), but of course, they may be made more difficult by the unusual way of stacking the different sections.

      We thank the reviewer for pointing these out. Most abbreviations are written out in the figure legends or method section but indeed the organization of the different sections makes it less evident. The abbreviations pointed out have been opened in the results section when they are first used.

      Furthermore, they say that the severity of AUD was "evaluated by a psychiatrist using the Diagnostic and Statistical Manual of Mental Disorders (DSM) criteria, fourth edition (DSM-IV) (ALCOHOLBIS cohort) or fifth edition (DSM-5)" (GUT2BRAIN cohort): This makes sense for DSM-5 but needs to be explained more for DSM-IV. They also need to say what levels were included.

      We thank the reviewer for this very appropriate remark that deserves some explanations.

      While the patients of the GUT2BRAIN cohort were enrolled in 2018-2019 where the DSM5 was applicable, the patients from the ALCOHOLBIS cohort were recruited many years before. The protocol related to the ALCOHOLBIS cohort was written before 2013, and approved by ethical committee, where the DSM-IV was the last version of the DSM used at that moment. 

      We therefore totally agree with the reviewer that our sentence “the severity of AUD was "evaluated by a psychiatrist using the Diagnostic and Statistical Manual of Mental Disorders (DSM) criteria, fourth edition (DSM-IV) (ALCOHOLBIS cohort) or fifth edition (DSM-5)" (GUT2BRAIN cohort)” is not correct. Indeed, DSM-IV (before 2013) described two distinct disorders, alcohol abuse and alcohol dependence, while the DSM-5 integrates the two DSM-IV disorders into a single disorder called alcohol use disorder with mild (2 or 3 symptoms), moderate (4 or 5 symptoms) and severe (6 or more symptoms) sub-classifications.

      In this present study, we have enrolled patients that received the diagnosis of alcohol dependence (DSM-IV criteria) or severe alcohol use disorder (DSM5 criteria).

      We have changed the paragraph related to this issue into this new one:

      “The severity of AUD was evaluated by a psychiatrist using the Diagnostic and Statistical Manual of Mental Disorders (DSM) criteria, fourth edition (DSM-IV) (Alcoholbis cohort) or fifth edition (DSM-5) (GUT2BRAIN cohort). Patients evaluated with the DSM-IV received the diagnosis of “alcohol dependence”, while the patients evaluated with the DSM-5 received the diagnosis of “severe alcohol use disorder” (6 or more criteria). To simplify, we used the term “sAUD” (for severe alcohol use disorder) that includes both diagnosis (sAUD and alcohol dependence)”.

      I am unsure about the shared first co-authorship and the shared last co-authorship request, but I leave this up to the editors and the journal policies. Also, the order of the different parts may be correct (the M+M placed last) but is unusual for many journals. This is also up to the journal to decide.

      As mentioned in the guidelines to authors, the method section should be included at the end of the manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      These experiments are some of the first to assess the role of dopamine release and the activity of D1 and D2 MSNs in pair bond formation in Mandarin voles. This is a novel and comprehensive study that presents exciting data about how the dopamine system is involved in pair bonding. The authors provide very detailed methods and clearly presented results. Here they show dopamine release in the NAc shell is enhanced when male voles encounter their pair bonded partner 7 days after cohabitation. In addition, D2 MSN activity decreases whereas D1 MSN activity increases when sniffing the pair-bonded partner.

      The authors do not provide justification for why they only use males in the current study, without discussing sex as a biological variable these data can only inform readers about one sex (which in pair-bonded animals by definition have 2 sexes). In addition, the authors do not use an isosbestic control wavelength in photometry experiments, although they do use EGFP control mice which show no effects of these interventions, a within-subject control such as an isosbestic excitation wavelength could give more confidence in these data and rule out motion artefacts within subjects.

      We agree with your suggestion that mechanism underlying pair bonding in females should also be investigated. In general, natal philopatry among mammals is female biased in the wild(Greenwood, 1983; Brody and Armitage, 1985; Ims, 1990; Solomon and Jacquot, 2002); social mammals are rarely characterized by exclusively male natal philopatry (Solomon and Jacquot, 2002). Males often disperse from natal area to a new place. Thus, males rodents may play a dominant role in the formation and maintenance of mating relationships. This is a reason we investigate pair bonding in male firstly. Certainly, female mate selection, and sexual receptivity or refusal through olfactory cues from males, thereby affect the formation and maintenance of pair bonding (Hoglen and Manoli, 2022). This is also the reason why we should focus on the mechanisms underlying pair bonding formation in females in the future research. This has been added in the limitation in the discussion.

      In photometry experiments, rAAV-D1/D2-GCaMP6m, a D1/D2 genetically encoded fluorescent calcium sensor, was injected into the NAc shell. The changes in fluorescence signals during these social interactions were collected and digitalized. To assess the specific response to social stimulus in fluorescence signals, changes in fluorescence signals during non-social behavioral bouts (such as freezing, exploration of the environment, grooming, rearing, etc…) were also recorded and analyzed. The result showed that dopamine release or D1/D2 MSNs activity displayed no significant changes after cohabitation of 3 or 7 days upon occurring of no-social behavior such as freezing, exploring, grooming and rearing. In addition, GCaMP6m is a genetically encoded calcium indicator. Changes in its fluorescence signal reflect changes in intracellular calcium ion concentration. Using EGFP virus as a control, it can be determined whether the fluorescence signal observed in the experiment is generated by the specific response of GCaMP6m to calcium or if there are other non-specific factors leading to fluorescence changes. If there is no similar fluorescence change in the EGFP control group, it can more strongly prove that the signal detected by GCaMP6m is a calcium-related specific signal. In some research article, they also use EGFP control group in photometry experiments (Yamaguchi et al., 2020; Qu et al., 2024; Zhan et al., 2024). Therefore, changes in fluorescence signals observed in the present study reflect neuron activities upon specific social behaviors, but were not affected by motion artefacts.

      There is an existing literature (cited in this manuscript) from Aragona et al., (particularly Aragona et al., 2006) which has highlighted key differences in the roles of rostral versus caudal NAc shell dopamine in pair bond formation and maintenance. Specifically, they report that dopamine transmission promoting pair bonding only occurs in the rostral shell and not the caudal shell or core regions. Given that the authors have targeted more caudally a discussion of how these results fit with previous work and why there may be differences in these areas is warranted.

      Thanks for your professional consideration. The brain coordinates of Bilateral 26-gauge guide cannulae were NAc (1.6 mm rostral, ± 1 mm bilateral, 4.5 mm ventral (for shell), 3.5 mm ventral (for core) from bregma) in report from Aragona et al (2006). In the present study, the brain coordinates of virus injection were (AP: +1.5, ML: ±0.99, DV: −4.2 (for NAc shell)). Thus, the virus injection sites were close to rostral shell in our study. However, as the diffusive expression of the virus, part of neurons in the rostrocaudal border and caudal shell also be infected by the virus, so we did not distinguish different subregions of NAc shell. In the future, we will use AAV13, a viral strategy could target / manipulate precise local neural populations, to address this issue. NAc is a complex brain structure with distinct regions that have different functions. Previous study suggested that GABAergic substrates of positive and negative types of motivated behavior in the nucleus accumbens shell are segregated along a rostrocaudal gradient (Reynolds and Berridge, 2001). However, a study found that food intake is significantly enhanced by administering μ-selective opioid agonists into the NAc, especially its shell region (Znamensky et al., 2001). Also, μ-opioid stimulation increases the motivation to eat (“wanting”) both in the NAc shell and throughout the entire NAc, as well as in several limbic or striatal structures beyond. For DAMGO stimulation of eating, the “wanting” substrates anatomically extend additionally beyond the rostrodorsal shell and throughout the entire shell (the caudal shell). Furthermore, DAMGO stimulates eating at NAc shell and core, as well as the neostriatum, amygdala…(Gosnell et al., 1986; Gosnell and Majchrzak, 1989; Peciña and Berridge, 2000; Zhang and Kelley, 2000; Echo et al., 2002; Peciña and Berridge, 2005, 2013; Castro and Berridge, 2014). In pair bond formation and maintenance, the rostral shell is the specific subregion of the NAc important for DA regulation of partner preference (Aragona et al., 2006). In conclusion, it appears that the changes in real time dopamine release and activities and electrophysiological properties of D1R, D2R MSNs in the NAc shell after pair bond formation may have primarily targeted to the rostral shell in our study, which is consistent with the report from Aragona et al.

      The authors could discuss the differences between pair bond formation and pair bond maintenance more deeply.

      Thanks for your suggestion. I have discussed the differences between pair bond formation and pair bond maintenance more deeply.

      The dopamine and different types of dopamine receptors in the NAc may play different roles in regulation of pair bond formation and maintenance. The chemogenetic manipulation revealed that VP-projecting D2 MSNs are necessary and more important in pair bond formation compared to VPprojecting D1 MSNs. It is consistent with previous pharmacological experiments that blocking of D2R with its specific antagonist, while D1R was not blocked, can prevent the formation of a pair bond in prairie voles (Gingrich et al., 2000). This indicates that D2R is crucial for the initial formation of the pair bond. D2R is involved in the reward aspects related to mating. In female prairie voles, D2R in the NAc is important for partner preference formation. The activation of D2R may help to condition the brain to assign a positive valence to the partner's cues during mating, facilitating the development of a preference for a particular mate. In addition, the cohabitation caused the DA release, the high affinity Gi-coupled D2R was activated first, which inhibited D2 MSNs activity and promoted the pair bond formation. And then, after 7 days of cohabitation, the pair bonding was already established, the significantly increased release of dopamine significantly activated Gs-coupled D1R with the low affinity to dopamine, which increased D1 MSNs activity and maintained the formation of partner preference. While D1R is also present and involved in the overall process, its role in the initial formation of the pair bond is not as dominant as D2R (Aragona et al., 2006). However, it still participates in the neurobiological processes related to pair bond formation. For example, in male mandarin voles, after 7 days of cohabitation with females, D1R activity in the NAc shell was affected during pair bond formation. The extracellular DA concentration was higher when sniffing their partner compared to a stranger, and this increase in DA release led to an increase in D1R activity in the NAc shell. In prairie voles, dopamine D1 receptors seem to be essential for pair bond maintenance. Neonatal treatment with D1 agonists can impair partner preference formation later in life, suggesting an organizational role for D1 in maintaining the bond (Aragona et al., 2006). In pair-bonded male prairie voles, D1R is involved in inducing aggressive behavior toward strangers, which helps to maintain the pair bond by protecting it from potential rivals. In the NAc shell, D1 agonist decreases the latency to attack same-sex conspecifics, while D1 antagonism increases it (Aragona et al., 2006). In summary, D2R is more crucial for pair bond formation, being involved in reward association and necessary for the initial development of the pair bond. D1R, on the other hand, is more important for pair bond maintenance, being involved in aggression and mate guarding behaviors and having an organizational role in maintaining the pair bond over time. We therefore suggest that D2 MSNs are more predominantly involved in the formation of a pair bond compared with D1 MSNs.

      The authors have successfully characterised the involvement of dopamine release, changes in D1 and D2 MSNs, and projections to the VP in pair bonding voles. Their conclusions are supported by their data and they make a number of very reasonable discussion points acknowledging various limitations

      Reviewer #2 (Public review):

      Summary:

      Using in vivo fiber-photometry the authors first establish that DA release when contacting their partner mouse increases with days of cohabitation while this increase is not observed when contacting a stranger mouse. Similar effects are found in D1-MSNs and D2-MSNs with the D1MSN responses increasing and D2-MSN responses decreasing with days of cohabitation. They then use slice physiology to identify underlying plasticity/adaptation mechanisms that could contribute to the changes in D1/D2-MSN responses. Last, to address causality the authors use chemogenetic tools to selectively inhibit or activate NAc shell D1 or D2 neurons that project to the ventral pallidum. They found that D2 inhibition facilitates bond formation while D2 excitation inhibits bond formation. In contrast, both D1-MSN activation and inhibition inhibit bond formation.

      Strengths:

      The strength of the manuscript lies in combining in vivo physiology to demonstrate circuit engagement and chemogenetic manipulation studies to address circuit involvement in pair bond formation in a monogamous vole.

      Weaknesses:

      Comment: Weaknesses include that a large set of experiments within the manuscript are dependent on using short promoters for D1 and D2 receptors in viral vectors. As the authors acknowledge this approach can lead to ectopic expression and the presented immunohistochemistry supports this notion. It seems to me that the presented quantification underestimates the degree of ectopic expression that is observed by eye when looking at the presented immunohistochemistry. However, given that Cre transgenic animals are not available for Microtus mandarinus and given the distinct physiological and behavioral outcomes when imaging and manipulating both viral-targeted populations this concern is minor.

      Thanks for your professional comment. The virus used in the present study were purchased from brainVTA company. D1/D2 receptor promoter genes were predicted and amplified for validation by the company. The promoter gene was constructed and packaged by aav virus vector (taking rAAV-D2-mCherry-WPRE-bGH_polyA virus as an example, Author response image 1A). The D1/D2 promoter sequence is shown in the Author response image 1B-C. In addition, the D1 receptor gene promoter and D2 receptor gene promoter viruses used in this paper have been used in several published papers with high specificity (Zhao et al., 2019; Ying et al., 2022). In our paper, a high proportion of virus and mRNA co-localization was found through FISH verification and also showed high specificity of virus (Figure S15, S16).

      Author response image 1.

      (A)   Gene carrier of rAAV-D2-mCherry-WPRE-bGH_polyA. (B-C) Gene sequence of D1 promoter and D2 promoter.

      The slice physiology experiments provide some interesting outcomes but it is unclear how they can be linked to the in vivo physiological outcomes and some of the outcomes don't match intuitively (e.g. cohabitation enhances excitatory/inhibitory balance in D2-MSNs but the degree of contact-induced inhibition is enhanced in D2-MSN).

      Thanks for your comment. The present study found that the frequencies of sEPSC and sIPSC were significantly enhanced after the formation of a pair bond in NAc shell D2 MSNs. The excitatory/inhibitory balance of D2 MSNs was enhanced after cohabitation.These results are not consistent with the findings from fiber photometry of calcium signals. One study showed that NAc D2 MSNs was linked to both ‘liking’ (food consumption) and ‘wanting’ (food approach) but with opposing actions; high D2 MSNs activity signaled ‘wanting’, and low D2 MSNs activity enhanced ‘liking’. D2 MSNs are faced with a tradeoff between increasing ‘wanting’ by being more active or allowing ‘liking’ by remaining silent (Guillaumin et al., 2023). Therefore, the increase in frequencies of sEPSC and sIPSC in D2 MSNs may reflect two processes, liking and wanting, respectively. We thought that hedonia and motivation might influence D2 MSNs activity differently during cohabitation and contribute to the processing of pair bond formation in a more dynamic and complex way than previously expected.

      Moreover, the frequencies of sEPSC and sIPSC were significantly reduced in the NAc shell D1 MSNs after pair bonding, whereas the intrinsic excitability increased after cohabitation with females.

      The bidirectional modifications (reduced synaptic inputs vs. increased excitability) observed in D1 MSNs might result from homeostatic regulation. The overall synaptic transmission may produce no net changes, given that reductions in both excitatory and inhibitory synaptic transmission of D1 MSNs were observed. Also, increases in the intrinsic excitability of D1 MSNs would result in an overall excitation gain on D1 MSNs.

      One interesting finding is that the relationship between D2-MSN and pair bond formation is quite clear (inhibition facilitates while excitation inhibits pair bond formation). In contrast, the role of D1-MSNs is more complicated since both excitation and inhibition disrupt pair bond formation. This is not convincingly discussed.

      Considering the reviewer’s suggestion, the discussion has been added in the revised manuscript.

      In the present study, DREADDs approaches were used to inhibit or excite NAc MSNs to VP projection and it was found that D1 and D2 NAc MSNs projecting to VP play different roles in the formation of a pair bond. Chemogenetic inhibition of VP-projecting D2 MSNs promoted partner preference formation, while activation of VP-projecting D2 MSNs inhibited it (Figure 6). Chemogenetic activation of D2 MSNs produced the opposite effect of DA on the D2 MSNs on partner preference, while inhibition of these neurons produced the same effects of DA on D2 MSNs. DA binding with D2R is coupled with Gi and produces an inhibitory effect (Lobo and Nestler, 2011). It is generally assumed that activation of D2R produces aversive and negative reinforcement. These results were consistent with the reduced D2 MSNs activity upon sniffing their partner in the fiber photometry test and the increased frequency and amplitude of sIPSC in the present study. Our results also agree with other previous studies that chemogenetic inhibition of NAc D2 MSNs is sufficient to enhance reward-oriented motivation in a motivational task (Carvalho Poyraz et al., 2016; Gallo et al., 2018). Inhibition of D2 MSNs during self-administration enhanced response and motivation to obtain cocaine (Bock et al., 2013). This also suggests that the mechanism underlying attachment to a partner and drug addiction is similar.

      Besides, in the present study, the formation of partner preference was inhibited after activation or inhibition of VP-projecting D1 MSNs, which is not consistent with conventional understanding of prairie vole behavior. Alternatively, DA binding with D1R is coupled with Gs and produces an excitatory effect (Lobo and Nestler, 2011), while activation of D1R produces reward and positive reinforcement (Hikida et al., 2010; Tai et al., 2012; Kwak and Jung, 2019). For example, activation of D1 MSNs enhances the cocaine-induced conditioned place preference (Lobo et al., 2010). In addition, D1R activation by DA promotes D1 MSNs activation, which promotes reinforcement. However, a recent study found that NAc-ventral mesencephalon D1 MSNs promote reward and positive reinforcement learning; in contrast, NAc-VP D1 MSNs led to aversion and negative reinforcement learning (Liu et al., 2022). It is consistent with our results that activation of NAc-VP D1 MSNs pathway reduced time spent side-by-side and impaired partner preference after 7 days of cohabitation. In contrast to inhibition of D2 MSNs, we found that inhibition of the D1 MSNs did not elicit corresponding increases in partner preference. One possible explanation is that almost all D1 MSNs projecting to the VTA/ substantia nigra (SN) send collaterals to the VP (Pardo-Garcia et al., 2019). For example, optogenetically stimulating VP axons may inadvertently cause effects in the VTA/SN through the antidromic activation of axon collaterals (Yizhar et al., 2011). Therefore, chemogenetic inhibition of D1 MSNs may also inhibit DA neurons in VTA, subsequently inhibiting the formation of a pair bond.

      The dopamine and different types of dopamine receptors in the NAc may play different roles in regulation of pair bond formation and maintenance. The chemogenetic manipulation revealed that VP-projecting D2 MSNs are necessary and more important in pair bond formation compared to VPprojecting D1 MSNs. It is consistent with previous pharmacological experiments that blocking of D2R with its specific antagonist, while D1R was not blocked, can prevent the formation of a pair bond in prairie voles (Gingrich et al., 2000). This indicates that D2R is crucial for the initial formation of the pair bond. D2R is involved in the reward aspects related to mating. In female prairie voles, D2R in the NAc is important for partner preference formation. The activation of D2R may help to condition the brain to assign a positive valence to the partner's cues during mating, facilitating the development of a preference for a particular mate. In addition, the cohabitation caused the DA release, the high affinity Gi-coupled D2R was activated first, which inhibited D2 MSNs activity and promoted the pair bond formation. And then, after 7 days of cohabitation, the pair bonding was already established, the significantly increased release of dopamine significantly activated Gs-coupled D1R with the low affinity to dopamine, which increased D1 MSNs activity and maintained the formation of partner preference. While D1R is also present and involved in the overall process, its role in the initial formation of the pair bond is not as dominant as D2R (Aragona et al., 2006). However, it still participates in the neurobiological processes related to pair bond formation. For example, in male mandarin voles, after 7 days of cohabitation with females, D1R activity in the NAc shell was affected during pair bond formation. The extracellular DA concentration was higher when sniffing their partner compared to a stranger, and this increase in DA release led to an increase in D1R activity in the NAc shell. In prairie voles, dopamine D1 receptors seem to be essential for pair bond maintenance. Neonatal treatment with D1 agonists can impair partner preference formation later in life, suggesting an organizational role for D1 in maintaining the bond (Aragona et al., 2006). In pair-bonded male prairie voles, D1R is involved in inducing aggressive behavior toward strangers, which helps to maintain the pair bond by protecting it from potential rivals. In the NAc shell, D1 agonist decreases the latency to attack same-sex conspecifics, while D1 antagonism increases it (Aragona et al., 2006). In summary, D2R is more crucial for pair bond formation, being involved in reward association and necessary for the initial development of the bond. D1R, on the other hand, is more important for pair bond maintenance, being involved in aggression and mate guarding behaviors and having an organizational role in maintaining the bond over time. We therefore suggest that D2 MSNs are more predominantly involved in the formation of a pair bond compared with D1 MSNs.

      It seemed a missed opportunity that physiological readout is limited to males. I understand though that adding females may be beyond the scope of this manuscript.

      We gratefully appreciate for your valuable comment. The reviewer 1 also concerned this issue. We made a following response.

      In general, natal philopatry among mammals is female biased in the wild(Greenwood, 1983; Brody and Armitage, 1985; Ims, 1990; Solomon and Jacquot, 2002); social mammals are rarely characterized by exclusively male natal philopatry (Solomon and Jacquot, 2002). Males often disperse from natal area to a new place. Thus, male rodents may play a dominant role in the formation and maintenance of mating relationships. This is a reason we investigate pair bonding in male firstly. Certainly, female mate selection, and sexual receptivity or refusal through olfactory cues from males, thereby affect the formation and maintenance of pair bonding (Hoglen and Manoli, 2022). This is also the reason why we should focus on the mechanisms underlying pair bonding formation in females in the future research. This has been added in the limitation in the discussion.

      Reviewer #3 (Public review):

      Summary:

      The manuscript is evaluating changes in dopamine signaling in the nucleus accumbens following pair bonding and exposure to various stimuli in mandarin voles. In addition, the authors present chemogenetic data that demonstrate excitation and inhibition of D1 and D2 MSN affect pair bond formation.

      Strengths:

      The experimental designs are strong. The approaches are innovative and use cutting-edge methods.

      The manuscript is well written.

      Weaknesses:

      The statistical results are not presented, and not all statistical analyses are appropriate.

      Additionally, some details of methods are absent.

      As you suggested, we added the detailed information in the revised manuscript.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Remove references to 'extreme significance' - p is set as a threshold and the test is either significant or not.

      Thanks for your suggestion. We have removed 'extreme significance' in the revised manuscript.

      (2) The second half of the abstract is a little confusing the use of activation/inhibition makes it difficult to read and follow, this could be re-worded for clarity.

      Sorry for the confusing. We reorganized the sentence as following.

      In addition, chemogenetic inhibition of ventral pallidum-projecting D2 MSNs in the NAc shell enhanced pair bond formation, while chemogenetic activation of VP-projecting D2 MSNs in the NAc shell inhibited pair bond formation.

      Reviewer #2 (Recommendations for the authors):

      (1) In many instances repeated measures are presented from the same mice (e.g. Figures 1F, I; S1BC). Repeated measures for each mouse should be connected with a line in the figures. This will allow the reader to visually compare the repeated measures for each animal.

      Thanks for your careful consideration. As reviewer suggested, the figures have been changed.

      (2) It is unclear to me how the time point 0 for sniffing was determined. How is the time point 0 for side-by-side contact determined?

      Sniffing is a behavior for olfactory investigation and defined as animals uses nose to inspect any portion of the stimulus mouse’s body, including the tail. The time point 0 for sniffing was the beginning of sniffing behavior occurs. The side-by-side behavior is defined as significant physical contact with a social object and huddle in a quiescent state. The time point 0 for side-byside behavior was the beginning of side-by-side behavior occurs.

      (3) Figure 1-3: For the fiber photometry data 7 events (sniffs) are shown in the heat maps. Are these the first 7 sniffs? What went into the quantification? It seems that DA and D1/D2 responses are habituating. This could be analyzed and would need to be discussed.

      In the heat maps (Figure 1-3), we showed the mean fluorescence signal changes of every subject (n = 7 voles) upon sniffing partner, stranger or an object in the experiment, but not the fluorescence signal changes of sniffing events in one vole. The quantification of changes in mean fluorescence signal of all subjects was showed in Figure 1F, 1I, Figure 2F, 2I, Figure 3F and 3I.

      (4) Generally, it is very difficult to obtain cell type selectivity using short promoters in viruses (the authors acknowledge this). Which D1 and D2 promoter sequences were used for obtaining specificity? The degree of ectopic expression looks much higher than the quantification (e.g. in Fig. 3b, 6C, 7C, S14A, C). Is this due to thresholding?

      The virus used in the present study were purchased from brainVTA company. D1/D2 receptor promoter genes were predicted and amplified for validation. The promoter gene was constructed and packaged by aav virus vector (taking rAAV-D2-mCherry-WPRE-bGH_polyA virus as an example, Author response image 1A). The D1/D2 promoter sequence is shown in the Author response image 1B-C. In addition, the D1 receptor gene promoter and D2 receptor gene promoter viruses used in this paper have been used in several published papers with high specificity (Zhao et al., 2019; Ying et al., 2022). In the Figure 6C, the first image is the merged fluorescence images that were taken under different fluorescence channels with the 20X objective. The second and the third images were taken under 40X objective from field of white box in the first image. The second and the third images were merged into fourth one. Due to the different exposure time and intensity, the fluorescence photo taken at 40X are clearer compared to image taken at the 20X. For example, in the Figure 6C, the labeled-cells were presented as following (Author response image 2). In our paper,virus infection and mRNA through FISH verification were co-localized in a high proportion displaying high specificity of virus (Figure S15, S16).Certainly, the number of positive neurons may be dependent on visuality (thresholding). Only visible cells were counted. The cell counting results at Author response image 2B and 2C are similar to the quantification in the Figure 6C.

      Author response image 2.

      (A) Immunohistological image showing co-localization of hM3Dq- mCherry-anti expression (green), D2R-mRNA (red), and DAPI (blue) in the NAc shell. Scale bar: 100 μm. (B) The cell counts and the determination of colocalization of the 20× immunohistochemistry images. The marked neurons were counted with white dots. (C) The cell counts and the determination of colocalization of the 40× immunohistochemistry images. The marked neurons were counted with white dots.

      (5) Figure 6D/7D: the time scale seems to be off for both traces (40 seconds). For the hM3D Gq experiment, only one trace is shown. It would be more convincing to provide an input-output curve from several mice and to statistically compare the curves.

      Response: Thanks for your careful consideration. As reviewer suggested, the figure of resting membrane potentials before and after drug CNO exposure from several voles was added in the revised manuscript.

      (6) The presence of GIRK channels in MSNs has been a long debate and hM4D Gi activation may mostly act at the level of terminals by inhibiting neurotransmitter release. For demonstrating hyperpolarization of the soma showing the resting membrane potential before and after drug CNO exposure would be more convincing.

      Thanks for your careful consideration. As reviewer suggested, the figure of resting membrane potential before and after drug CNO exposure was added in the revised manuscript.

      (7) It is unclear to me how far the slice physiology informs the in vivo physiology (e.g. cohabitation enhances excitatory/inhibitory balance in D2-MSNs but the degree of contact-induced inhibition is enhanced in D2-MSN; D2-MSNs become less responsive to DA in the slice yet but at the time of enhanced DA release D2-MSN activity is also strongly reduced).

      The present study found that the frequencies of sEPSC and sIPSC were significantly enhanced after the formation of a pair bond in NAc shell D2 MSNs. The excitatory/inhibitory balance of D2 MSNs was enhanced after cohabitation. These results are not consistent with the findings from fiber photometry of calcium signals. One study showed that NAc D2 MSNs was linked to both ‘liking’ (food consumption) and ‘wanting’ (food approach) but with opposing actions; high D2 MSNs activity signaled ‘wanting’, and low D2 MSNs activity enhanced ‘liking’. D2 MSNs are faced with a tradeoff between increasing ‘wanting’ by being more active or allowing ‘liking’ by remaining silent (Guillaumin et al., 2023). Therefore, the increase in frequencies of sEPSC and sIPSC in D2 MSNs may reflect two processes, liking and wanting, respectively. We thought that hedonia and motivation might different influence D2 MSNs activity during cohabitation and contribute to the processing of pair bond formation in a more dynamic and complex way than previously expected.

      Moreover, the frequencies of sEPSC and sIPSC were significantly reduced in the NAc shell D1

      MSNs after pair bonding, whereas the intrinsic excitability increased after cohabitation with females.

      The bidirectional modifications (reduced synaptic inputs vs. increased excitability) observed in D1 MSNs might result from homeostatic regulation. The overall synaptic transmission may produce no net changes, given that reductions in both excitatory and inhibitory synaptic transmission of D1 MSNs were observed. Also, increases in the intrinsic excitability of D1 MSNs would result in an overall excitation gain on D1 MSNs.

      (8) One interesting finding is that the relationship between D2-MSN and pair bond formation is quite clear (inhibition facilitates while excitation inhibits pair bond formation). In contrast, the role of D1-MSNs is more complicated since both excitation and inhibition disrupt pair bond formation.

      The discussion of this would benefit from another attempt.

      As reviewer suggested, the discussion was added in the revised manuscript.

      In the present study, DREADDs approaches were used to inhibit or excite NAc MSNs to VP projection and it was found that D1 and D2 NAc MSNs projecting to VP play different roles in the formation of a pair bond. Chemogenetic inhibition of VP-projecting D2 MSNs promoted partner preference formation, while activation of VP-projecting D2 MSNs inhibited it (Figure 6). Chemogenetic activation of D2 MSNs produced the opposite effect of DA on the D2 MSNs on partner preference, while inhibition of these neurons produced the same effects of DA on D2 MSNs. DA binding with D2R is coupled with Gi and produces an inhibitory effect (Lobo and Nestler, 2011). It is generally assumed that activation of D2R produces aversive and negative reinforcement. These results were consistent with the reduced D2 MSNs activity upon sniffing their partner in the fiber photometry test and the increased frequency and amplitude of sIPSC in the present study. Our results also agree with other previous studies, which showed that chemogenetic inhibition of NAc D2 MSNs is sufficient to enhance reward-oriented motivation in a motivational task (Carvalho Poyraz et al., 2016; Gallo et al., 2018). Inhibition of D2 MSNs during self-administration enhanced response and motivation to obtain cocaine (Bock et al., 2013). This also suggests that the mechanism underlying attachment to a partner and drug addiction is similar.

      Besides, in the present study, the formation of partner preference was inhibited after activation or inhibition of VP-projecting D1 MSNs, which is not consistent with conventional understanding of prairie vole behavior. Alternatively, DA binding with D1R is coupled with Gs and produces an excitatory effect (Lobo and Nestler, 2011), while activation of D1R produces reward and positive reinforcement (Hikida et al., 2010; Tai et al., 2012; Kwak and Jung, 2019). For example, activation of D1 MSNs enhances the cocaine-induced conditioned place preference (Lobo et al., 2010). In addition, D1R activation by DA promotes D1 MSNs activation, which promotes reinforcement. However, a recent study found that NAc-ventral mesencephalon D1 MSNs promote reward and positive reinforcement learning; in contrast, NAc-VP D1 MSNs led to aversion and negative reinforcement learning (Liu et al., 2022). It is consistent with our results that activation of NAc-VP D1 MSNs pathway reduced time spent side-by-side and impaired partner preference after 7 days of cohabitation. In contrast to inhibition of D2 MSNs, we found that inhibition of the D1 MSNs did not elicit corresponding increases in partner preference. One possible explanation is that almost all D1 MSNs projecting to the VTA/ substantia nigra (SN) send collaterals to the VP (Pardo-Garcia et al., 2019). For example, optogenetically stimulating VP axons may inadvertently cause effects in the VTA/SN through the antidromic activation of axon collaterals (Yizhar et al., 2011). Therefore, chemogenetic inhibition of D1 MSNs may also inhibit DA neurons in VTA, subsequently inhibiting the formation of a pair bond.

      The dopamine and different types of dopamine receptors in the NAc may play different roles in regulation of pair bond formation and maintenance. The chemogenetic manipulation revealed that VP-projecting D2 MSNs are necessary and more important in pair bond formation compared to VPprojecting D1 MSNs. It is consistent with previous pharmacological experiments that blocking of D2R with its specific antagonist, while D1R was not blocked, can prevent the formation of a pair bond in prairie voles (Gingrich et al., 2000). This indicates that D2R is crucial for the initial formation of the pair bond. D2R is involved in the reward aspects related to mating. In female prairie voles, D2R in the NAc is important for partner preference formation. The activation of D2R may help to condition the brain to assign a positive valence to the partner's cues during mating, facilitating the development of a preference for a particular mate. In addition, the cohabitation caused the DA release, the high affinity Gi-coupled D2R was activated first, which inhibited D2 MSNs activity and promoted the pair bond formation. And then, after 7 days of cohabitation, the pair bonding was already established, the significantly increased release of dopamine significantly activated Gs-coupled D1R with the low affinity to dopamine, which increased D1 MSNs activity and maintained the formation of partner preference. While D1R is also present and involved in the overall process, its role in the initial formation of the pair bond is not as dominant as D2R (Aragona et al., 2006). However, it still participates in the neurobiological processes related to pair bond formation. For example, in male mandarin voles, after 7 days of cohabitation with females, D1R activity in the NAc shell was affected during pair bond formation. The extracellular DA concentration was higher when sniffing their partner compared to a stranger, and this increase in DA release led to an increase in D1R activity in the NAc shell. In prairie voles, dopamine D1 receptors seem to be essential for pair bond maintenance. Neonatal treatment with D1 agonists can impair partner preference formation later in life, suggesting an organizational role for D1 in maintaining the bond (Aragona et al., 2006). In pair-bonded male prairie voles, D1R is involved in inducing aggressive behavior toward strangers, which helps to maintain the pair bond by protecting it from potential rivals. In the NAc shell, D1 agonist decreases the latency to attack same-sex conspecifics, while D1 antagonism increases it (Aragona et al., 2006). In summary, D2R is more crucial for pair bond formation, being involved in reward association and necessary for the initial development of the bond. D1R, on the other hand, is more important for pair bond maintenance, being involved in aggression and mate guarding behaviors and having an organizational role in maintaining the bond over time. We therefore suggest that D2 MSNs are more predominantly involved in the formation of a pair bond compared with D1 MSNs.

      (9) For the chemogenetic inhibition/excitation experiment please specify the temporal relationship between CNO injection and the behavioral testing. Are the DREADDs activated during the preference testing or are we only looking at the consequences of DREADD activation during cohabitation? This would impact the interpretation of the results.

      Considering the reviewer’s suggestion, we have clarified the time of CNO injection and the behavioral testing. In chemogenetic experiments, male voles were injected with CNO (1 mg/kg, i.p. injection) or saline once per day during 7-days cohabitation period. On day 3 and day 7 of cohabitation, the partner preference tests (3 h) were conducted after 3h of injection. Anton Pekcec (Jendryka et al., 2019) found that, in mice, after 60 min of CNO injection (i.p.), free CNO levels had dropped surprisingly sharply in CSF and cortex tissue, CNO could not be detected after 60 min. However, associated biological effects are reported to endure 6 - 24 h after CNO treatment (Farzi et al., 2018; Desloovere et al., 2019; Paretkar and Dimitrov, 2019). For example, René He et al. (Anacker et al., 2018) showed that chemogenetic inhibition of adult-born neurons in the vDG promotes susceptibility to social defeat stress by using of DREADDs for 10 days, whereas increasing neurogenesis confers resilience to chronic stress. Moreover, Ming-Ming Zhang et al. (Zhang et al., 2022) revealed that the selective activation or inhibition of the IC-BLA projection pathway strengthens or weakens the intensity of observational pain while the CNO (1 mg/kg) was i.p. injected into the infected mice on days 1, 3, 5, and 7 after virus expression. Furthermore, in study of James P Herman et al. (Nawreen et al., 2020) chronic inhibition of IL PV INs reduces passive and increases active coping behavior in FST. Therefore, we believe that 7-day CNO injections can produce chronic effects on MSNs and alters the formation of partner preferences.

      (10) Discussion: "The observed increase in DA release resulted in suppression of D2 neurons in the NAc shell". "In contrast, the rise in DA release increases D1 activity selectively in response to their partner after extended cohabitation." These statements would need to be weakened as causality is not shown here.

      Thanks for your rigorous consideration. We have reorganized the discussion in the revised manuscript.

      “The observed increase in DA release resulted in alterations in activities of D2 and D1 neurons in the NAc shell selectively in response to their partner after extended cohabitation.”

      (11) It would help if the order of supplementary figures would match their order of figures appearance in the result section.

      Thanks for your suggestion. We reorganized the order of appearance in the revised manuscript.

      (12) This may be beyond the focus of the study but it would be very interesting to know whether the physiological responses to partner contact are similarly observed in females.

      Thanks for your concern. It is regretful that we did not observe physiological responses of female to partner contact. We predict the females may show the similar response patterns to their partner. In the future, we will supplement the research on the mechanism of partner preferences in female voles.

      Reviewer #3 (Recommendations for the authors):

      The manuscript is evaluating changes in dopamine signaling in the nucleus accumbens following pair bonding and exposure to various stimuli in mandarin voles. The manuscript is generally wellwritten. The experiment designs seem strong, although there are missing details to fully evaluate them. The statistics are not completed correctly, and the statistical values are not reported making them even harder to evaluate. There are a lot of potential strengths in this research. However, my review is limited because I am limited in how to evaluate data interpretation when statistical analyses are not clear. I provide details below.

      Major

      (1) Statistics should be provided in the Results section. It is not clear how to evaluate the authors' interpretations without presenting the statistical data. What stats are being reported about viral expression in cells on lines 192-194? What posthocs? There is only one condition, so I assume the statistic was a one-sample t-test. The authors should report the t-value, df, and p-value. No post-hoc is needed. There are many issues like this, which makes reviewing this manuscript very difficult. If the statistics were not conducted properly and reported clearly, I do not have confidence that I can evaluate the author's interpretation of the results.

      Thanks for your suggestion. We report the t-value, df, and p-value in the Results section.

      (2) Statistical tests should be labeled correctly. ANOVAs (found in figure caption) for Figure 1 data are not repeated measures. Rather, they are one-way ANOVA (with stimulus as a within-subject variable).

      We used one-way ANOVA to analyze the changes in fluorescence signals in figure1-3. In the experiment, the changes in fluorescence signals of every subject were collected upon sniffing the partner, an unknown female, and an object. So, we used One-Way Repeated Measures ANOVA to analyze the data.

      (3) The protocol for behavioral assessment and stimulus presentation during fiber photometry recording is not clear. For example, the authors mention on line 662 that voles ate carrots during some of the recording sessions, but nothing else is described about the recording session. What was the order of stimulus presentation? What was the object provided? Why is eating carrots analyzed separately from object, partner, and stranger exposure?

      Response: Sorry for the confusing. The detailed description has been added. After 3 and 7 days of cohabitation, males were exposed to their partner or an unfamiliar female (each exposure lasted for 30 min) in random order in a clean social interaction cage. The changes in fluorescence signals during these social interactions with their partner, an unfamiliar vole of the opposite sex, or an object (Rubik's Cube) were collected and digitalized by CamFiberPhotometry software (ThinkerTech). To rule out that the difference in fluorescence signals was caused by the difference in virus expression at different time points, we used the same experimental strategy in new male mandarin voles and measured the fluorescence signal changes upon eating carrot after 3 and 7 days of cohabitation (The male mandarin voles were fasted for four hours before the test.). Since sniffing (object, partner, and stranger) and eating carrot were not tested in the same males, we analyzed sniffing and eating carrot separately.

      (4) Supplement figures would be better as figures instead of tables. Many effects are hard to interpret.

      As you suggested, we added the information of Supplement table1 in results.

      (5) Citations should be included to note when pair bonding occurs in mandarin voles.

      As you suggested, we added the citation in the revised manuscript.

      Minor

      (1) Add a citation for the statement that married people live longer than unmarried people (Lines 51-52).

      As you suggested, we added the citation in the revised manuscript.

      (2) There is a table labeling viral vectors, but the table is not titled properly or referenced in the methods section.

      Thanks for our careful checking. We reorganized the table title and the table was also cited in the revised manuscript.

      (3) Sentences on lines 608-610 and 610-612 seem redundant.

      This sentence was corrected.

      (4) This is a rather subjective statement "Carrots are voles' favorite food."

      We reorganized the sentence in the revised manuscript.

      "Carrots are voles' daily food."

      Anacker C, Luna VM, Stevens GS, Millette A, Shores R, Jimenez JC, Chen B, Hen R (2018) Hippocampal neurogenesis confers stress resilience by inhibiting the ventral dentate gyrus. Nature 559:98-102.

      Aragona BJ, Liu Y, Yu YJ, Curtis JT, Detwiler JM, Insel TR, Wang Z (2006) Nucleus accumbens dopamine differentially mediates the formation and maintenance of monogamous pair bonds. Nature neuroscience 9:133-139.

      Bock R, Shin JH, Kaplan AR, Dobi A, Markey E, Kramer PF, Gremel CM, Christensen CH, Adrover MF, Alvarez VA (2013) Strengthening the accumbal indirect pathway promotes resilience to compulsive cocaine use. Nature neuroscience 16:632-638.

      Brody AK, Armitage KB (1985) The effects of adult removal on dispersal of yearling yellow-bellied marmots. Canadian Journal of Zoology 63:2560-2564.

      Carvalho Poyraz F, Holzner E, Bailey MR, Meszaros J, Kenney L, Kheirbek MA, Balsam PD, Kellendonk C (2016) Decreasing Striatopallidal Pathway Function Enhances Motivation by Energizing the Initiation of Goal-Directed Action. The Journal of neuroscience : the official journal of the Society for Neuroscience 36:5988-6001.

      Castro DC, Berridge KC (2014) Opioid hedonic hotspot in nucleus accumbens shell: mu, delta, and kappa maps for enhancement of sweetness "liking" and "wanting". The Journal of neuroscience : the official journal of the Society for Neuroscience 34:4239-4250.

      Desloovere J, Boon P, Larsen LE, Merckx C, Goossens MG, Van den Haute C, Baekelandt V, De Bundel D, Carrette E, Delbeke J, Meurs A, Vonck K, Wadman W, Raedt R (2019) Longterm chemogenetic suppression of spontaneous seizures in a mouse model for temporal lobe epilepsy. Epilepsia 60:2314-2324.

      Echo JA, Lamonte N, Ackerman TF, Bodnar RJ (2002) Alterations in food intake elicited by GABA and opioid agonists and antagonists administered into the ventral tegmental area region of rats. Physiology & behavior 76:107-116.

      Farzi A, Lau J, Ip CK, Qi Y, Shi YC, Zhang L, Tasan R, Sperk G, Herzog H (2018) Arcuate nucleus and lateral hypothalamic CART neurons in the mouse brain exert opposing effects on energy expenditure. eLife 7.

      Gallo EF, Meszaros J, Sherman JD, Chohan MO, Teboul E, Choi CS, Moore H, Javitch JA, Kellendonk C (2018) Accumbens dopamine D2 receptors increase motivation by decreasing inhibitory transmission to the ventral pallidum. Nature communications 9:1086.

      Gingrich B, Liu Y, Cascio C, Wang Z, Insel TR (2000) Dopamine D2 receptors in the nucleus accumbens are important for social attachment in female prairie voles (Microtus ochrogaster). Behavioral neuroscience 114:173-183.

      Gosnell BA, Majchrzak MJ (1989) Centrally administered opioid peptides stimulate saccharin intake in nondeprived rats. Pharmacology, biochemistry, and behavior 33:805-810.

      Gosnell BA, Levine AS, Morley JE (1986) The stimulation of food intake by selective agonists of mu, kappa and delta opioid receptors. Life sciences 38:1081-1088.

      Greenwood PJ (1983) Mating systems and the evolutionary consequences of dispersal. The ecology of animal movement:116-131.

      Guillaumin MCC, Viskaitis P, Bracey E, Burdakov D, Peleg-Raibstein D (2023) Disentangling the role of NAc D1 and D2 cells in hedonic eating. Molecular psychiatry 28:3531-3547.

      Hikida T, Kimura K, Wada N, Funabiki K, Nakanishi S (2010) Distinct roles of synaptic transmission in direct and indirect striatal pathways to reward and aversive behavior. Neuron 66:896907.

      Hoglen NEG, Manoli DS (2022) Cupid's quiver: Integrating sensory cues in rodent mating systems. Frontiers in neural circuits 16:944895.

      Ims RA (1990) Determinants of natal dispersal and space use in grey-sided voles, Clethrionomys rufocanus : a combined field and laboratory experiment. Oikos 57:106-113.

      Jendryka M, Palchaudhuri M, Ursu D, van der Veen B, Liss B, Kätzel D, Nissen W, Pekcec A (2019) Pharmacokinetic and pharmacodynamic actions of clozapine-N-oxide, clozapine, and compound 21 in DREADD-based chemogenetics in mice. Scientific reports 9:4522.

      Kwak S, Jung MW (2019) Distinct roles of striatal direct and indirect pathways in value-based decision making. eLife 8.

      Liu Z, Le Q, Lv Y, Chen X, Cui J, Zhou Y, Cheng D, Ma C, Su X, Xiao L, Yang R, Zhang J, Ma L, Liu X (2022) A distinct D1-MSN subpopulation down-regulates dopamine to promote negative emotional state. Cell Res 32:139-156.

      Lobo MK, Nestler EJ (2011) The striatal balancing act in drug addiction: distinct roles of direct and indirect pathway medium spiny neurons. Front Neuroanat 5:41.

      Lobo MK, Covington HE, 3rd, Chaudhury D, Friedman AK, Sun H, Damez-Werno D, Dietz DM, Zaman S, Koo JW, Kennedy PJ, Mouzon E, Mogri M, Neve RL, Deisseroth K, Han MH, Nestler EJ (2010) Cell type-specific loss of BDNF signaling mimics optogenetic control of cocaine reward. Science (New York, NY) 330:385-390.

      Nawreen N, Cotella EM, Morano R, Mahbod P, Dalal KS, Fitzgerald M, Martelle S, Packard BA, Franco-Villanueva A, Moloney RD, Herman JP (2020) Chemogenetic Inhibition of Infralimbic Prefrontal Cortex GABAergic Parvalbumin Interneurons Attenuates the Impact of Chronic Stress in Male Mice. eNeuro 7.

      Pardo-Garcia TR, Garcia-Keller C, Penaloza T, Richie CT, Pickel J, Hope BT, Harvey BK, Kalivas PW, Heinsbroek JA (2019) Ventral Pallidum Is the Primary Target for Accumbens D1 Projections Driving Cocaine Seeking. The Journal of neuroscience : the official journal of the Society for Neuroscience 39:2041-2051.

      Paretkar T, Dimitrov E (2019) Activation of enkephalinergic (Enk) interneurons in the central amygdala (CeA) buffers the behavioral effects of persistent pain. Neurobiology of disease 124:364-372.

      Peciña S, Berridge KC (2000) Opioid site in nucleus accumbens shell mediates eating and hedonic 'liking' for food: map based on microinjection Fos plumes. Brain research 863:71-86.

      Peciña S, Berridge KC (2005) Hedonic hot spot in nucleus accumbens shell: where do mu-opioids cause increased hedonic impact of sweetness? The Journal of neuroscience : the official journal of the Society for Neuroscience 25:11777-11786.

      Peciña S, Berridge KC (2013) Dopamine or opioid stimulation of nucleus accumbens similarly amplify cue-triggered 'wanting' for reward: entire core and medial shell mapped as substrates for PIT enhancement. The European journal of neuroscience 37:1529-1540.

      Qu Y, Zhang L, Hou W, Liu L, Liu J, Li L, Guo X, Li Y, Huang C, He Z, Tai F (2024) Distinct medial amygdala oxytocin receptor neurons projections respectively control consolation or aggression in male mandarin voles. Nature communications 15:8139.

      Reynolds SM, Berridge KC (2001) Fear and feeding in the nucleus accumbens shell: rostrocaudal segregation of GABA-elicited defensive behavior versus eating behavior. The Journal of neuroscience : the official journal of the Society for Neuroscience 21:3261-3270.

      Solomon NG, Jacquot JJ (2002) Characteristics of resident and wandering prairie voles, Microtus ochrogaster. Canadian Journal of Zoology 80:951-955.

      Tai LH, Lee AM, Benavidez N, Bonci A, Wilbrecht L (2012) Transient stimulation of distinct subpopulations of striatal neurons mimics changes in action value. Nature neuroscience 15:1281-1289.

      Yamaguchi T, Wei D, Song SC, Lim B, Tritsch NX, Lin D (2020) Posterior amygdala regulates sexual and aggressive behaviors in male mice. Nature neuroscience 23:1111-1124.

      Ying L, Zhao J, Ye Y, Liu Y, Xiao B, Xue T, Zhu H, Wu Y, He J, Qin S, Jiang Y, Guo F, Zhang L, Liu N, Zhang L (2022) Regulation of Cdc42 signaling by the dopamine D2 receptor in a mouse model of Parkinson's disease. Aging cell 21:e13588.

      Yizhar O, Fenno LE, Davidson TJ, Mogri M, Deisseroth K (2011) Optogenetics in neural systems. Neuron 71:9-34.

      Zhan S, Qi Z, Cai F, Gao Z, Xie J, Hu J (2024) Oxytocin neurons mediate stress-induced social memory impairment. Current biology : CB 34:36-45.e34.

      Zhang M, Kelley AE (2000) Enhanced intake of high-fat food following striatal mu-opioid stimulation: microinjection mapping and fos expression. Neuroscience 99:267-277.

      Zhang MM et al. (2022) Glutamatergic synapses from the insular cortex to the basolateral amygdala encode observational pain. Neuron 110:1993-2008.e1996.

      Zhao J, Ying L, Liu Y, Liu N, Tu G, Zhu M, Wu Y, Xiao B, Ye L, Li J, Guo F, Zhang L, Wang H, Zhang L (2019) Different roles of Rac1 in the acquisition and extinction of methamphetamineassociated contextual memory in the nucleus accumbens. Theranostics 9:7051-7071.

      Znamensky V, Echo JA, Lamonte N, Christian G, Ragnauth A, Bodnar RJ (2001) gammaAminobutyric acid receptor subtype antagonists differentially alter opioid-induced feeding in the shell region of the nucleus accumbens in rats. Brain research 906:84-91.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The drug Ivermectin is used to effectively treat a variety of worm parasites in the world, however resistance to Ivermectin poses a rising challenge for this treatment strategy. In this study, the authors found that loss of the E3 ubiquitin ligase UBR-1 in the worm C. elegans results in resistance to Ivermectin. In particular, the authors found that ubr-1 mutants are resistant to the effects of Ivermectin on worm viability, body size, pharyngeal pumping, and locomotion. The authors previously showed that loss of UBR-1 disrupts homeostasis of the amino acid and neurotransmitter glutamate resulting in increased levels of glutamate in C. elegans. Here, the authors found that the sensitivity of ubr-1 mutants to Ivermectin can be restored if glutamate levels are reduced using a variety of different methods. Conversely, treating worms with exogenous glutamate to increase glutamate levels also results in resistance to Ivermectin supporting the idea that increased glutamate promotes resistance to Ivermectin. The authors found that the primary known targets of Ivermectin, glutamate-gated chloride channels (GluCls), are downregulated in ubr-1 mutants providing a plausible mechanism for why ubr-1 mutants are resistant to Ivermectin. Although it is clear that loss of GluCls can lead to resistance to Ivermectin, this study suggests that one potential mechanism to decrease GluCl expression is via disruption of glutamate homeostasis that leads to increased glutamate. This study suggests that if parasitic worms become resistant to Ivermectin due to increased glutamate, their sensitivity to Ivermectin could be restored by reducing glutamate levels using drugs such as Ceftriaxone in a combination drug treatment strategy.

      Strengths:

      (1) The use of multiple independent assays (i.e., viability, body size, pharyngeal pumping, locomotion, and serotonin-stimulated pharyngeal muscle activity) to monitor the effects of Ivermectin

      (2) The use of multiple independent approaches (got-1, eat-4, ceftriaxone drug, exogenous glutamate treatment) to alter glutamate levels to support the conclusion that increased glutamate in ubr-1 mutants contributes to Ivermectin resistance.

      Weaknesses:

      (1) The primary target of Ivermectin is GluCls so it is not surprising that alteration of GluCl expression or function would lead to Ivermectin resistance.

      (2) It remains to be seen what percent of Ivermectin-resistant parasites in the wild have disrupted glutamate homeostasis as opposed to mutations that more directly decrease GluCl expression or function.

      Thank you for your thoughtful and constructive comments. We completely agree with your observation that alterations in GluCl expression or function can lead to Ivermectin resistance. However, we would like to emphasize that our study highlights an additional mechanism: disruptions in glutamate homeostasis can also lead to decreased GluCl expression, thereby contributing to Ivermectin resistance. This mechanism, which has not been fully explored previously, offers new insights into the complexity of drug resistance and could have important implications for understanding the development of Ivermectin resistance in parasitic nematodes.

      As you pointed out, the role of disrupted glutamate homeostasis in wild parasitic populations and the proportion of resistant parasites with this mechanism remain unknown. We believe this uncertainty underlines the significance of our findings, as they suggest a novel avenue for studying Ivermectin resistance and for developing potential strategies to counteract it.

      We have incorporated this discussion into the revised manuscript to further enrich the context of our findings.

      Reviewer #2 (Public review):

      Summary:

      The authors provide a very thorough investigation of the role of UBR-1 in anthelmintic resistance using the non-parasitic nematode, C. elegans. Anthelmintic resistance to macrocyclic lactones is a major problem in veterinary medicine and likely just a matter of time until resistance emerges in human parasites too. Therefore, this study providing novel insight into the mechanisms of ivermectin resistance is particularly important and significant.

      Strengths:

      The authors use very diverse technologies (behavior, genetics, pharmacology, genetically encoded reporters) to dissect the role of UBR-1 in ivermectin resistance. Deploying such a comprehensive suite of tools and approaches provides exceptional insight into the mechanism of how UBR-1 functions in terms of ivermectin resistance.

      Weaknesses:

      I do not see any major weaknesses in this study. My only concern is whether the observations made by the authors would translate to any of the important parasitic helminthes in which resistance has naturally emerged in the field. This is always a concern when leveraging a non-parasitic nematode to shed light on a potential mechanism of resistance of parasitic nematodes, and I understand that it is likely beyond the scope of this paper to test some of their results in parasitic nematodes.

      Thank you for your kind words and positive feedback on our work. We greatly appreciate your acknowledgment of the diverse technologies and comprehensive approaches we utilized to uncover the role of UBR-1 in ivermectin resistance.

      Your concern about whether our findings in C. elegans translate to parasitic helminthes in which ivermectin resistance has naturally emerged is both valid and critical. This is indeed a key question we expect to figure out in future studies. Collaborating with parasitologists to investigate whether naturally occurring mutations in ubr-1 exist in parasitic and non-parasitic nematodes is a priority for us. We hope that these efforts will lead to meaningful discoveries that have a significant impact on both livestock management and medicine.

      Reviewer #3 (Public review):

      Summary:

      Li et al propose to better understand the mechanisms of drug resistance in nematode parasites by studying mutants of the model roundworm C. elegans that are resistant to the deworming drug ivermectin. They provide compelling evidence that loss-of-function mutations in the E3 ubiquitin ligase encoded by the UBR-1 gene make worms resistant to the effects of ivermectin (and related compounds) on viability, body size, pharyngeal pumping rate, and locomotion and that these mutant phenotypes are rescued by a UBR-1 transgene. They propose that the mechanism is resistance is indirect, via the effects of UBR-1 on glutamate production. They show mutations (vesicular glutamate transporter eat-4, glutamate synthase got-1) and drugs (glutamate, glutamate uptake enhancer ceftriaxone) affecting glutamate metabolism/transport modulate sensitivity to ivermectin in wild-type and ubr-1 mutants. The data are generally consistent with greater glutamate tone equating to ivermectin resistance. Finally, they show that manipulations that are expected to increase glutamate tone appear to reduce expression of the targets of ivermectin, the glutamate-gated chloride channels, which is known to increase resistance.

      There is a need for genetic markers of ivermectin resistance in livestock parasites that can be used to better track resistance and to tailor drug treatment. The discovery of UBR-1 as a resistance gene in C. elegans will provide a candidate marker that can be followed up in parasites. The data suggest Ceftriaxone would be a candidate compound to reverse resistance.

      Strengths:

      The strength of the study is the thoroughness of the analysis and the quality of the data. There can be little doubt that ubr-1 mutations do indeed confer ivermectin resistance. The use of both rescue constructs and RNAi to validate mutant phenotypes is notable. Further, the variety of manipulations they use to affect glutamate metabolism/transport makes a compelling argument for some kind of role for glutamate in resistance.

      Weaknesses:

      The proposed mechanism of ubr-1 resistance i.e.: UBR-1 E3 ligase regulates glutamate tone which regulates ivermectin receptor expression, is broadly consistent with the data but somewhat difficult to reconcile with the specific functions of the genes regulating glutamatergic tone. Ceftriaxone and eat-4 mutants reduce extracellular/synaptic glutamate concentrations by sequestering available glutamate in neurons, suggesting that it is extracellular glutamate that is important. But then why does rescuing ubr-1 specifically in the pharyngeal muscle have such a strong effect on ivermectin sensitivity? Is glutamate leaking out of the pharyngeal muscle into the extracellular space/synapse? Is it possible that UBR-1 acts directly on the avr-15 subunit, both of which are expressed in the muscle, perhaps as part of a glutamate sensing/homeostasis mechanism?

      Thank you for your insightful feedback and thought-provoking questions. These are excellent points that have prompted us to critically reconsider our findings and the proposed mechanism.

      Several potential explanations could be considered, although we currently lack direct evidence to support this hypothesis: (1) The pharynx likely plays a dominant role in ivermectin resistance, as previously reported (Dent et al., 1997; Dent et al., 2000), and overexpression of UBR-1 in the pharyngeal muscle may exhibit a strong effect on ivermectin sensitivity. (2) It is also possible that pharyngeal muscle cells have the capacity to release glutamate into the extracellular space, which could contribute to the observed effect. (3) Alternatively, UBR-1 expression in the pharyngeal muscle may regulate other indirect pathways affecting extracellular or synaptic glutamate concentrations.

      We also appreciate your suggestion that UBR-1 may act directly on AVR-15 in the pharynx. While this is an interesting possibility, UBR-1 is an E3 ubiquitin ligase, and if AVR-15 were a direct target, we would expect UBR-1 to ubiquitinate AVR-15 and promote its degradation. In this case, loss of UBR-1 should inhibit AVR-15 ubiquitination, reduce its degradation, and lead to increased AVR-15 protein levels in the pharynx. However, our experimental data show a reduction, rather than an increase, in AVR-15::GFP levels in ubr-1 mutants (Figure 4A). This observation suggests that AVR-15 is less likely to be a direct target of UBR-1. To definitively address this hypothesis, a direct assessment of AVR-15 ubiquitination levels in wild-type and ubr-1 mutant backgrounds would be needed. We agree that this is an important avenue for future investigation.

      The use of single ivermectin dose assays can be misleading. A response change at a single dose shows that the dose-response curve has shifted, but the response is not linear with dose, so the degree of that shift may be difficult to discern and may result from a change in slope but not EC50. Similarly, in Figure 3C, the reader is meant to understand that eat-4 mutant is epistatic to ubr-1 because the double mutant has a wild-type response to ivermectin. But eat-4 alone is more sensitive, so (eyeballing it and interpolating) the shift in EC50 caused by the ubr-1 mutant in a wild type background appears to be the same as in an eat-4 background, so arguably you are seeing an additive effect, not epistasis. For the above reasons, it would be desirable to have results for rescuing constructs in a wild-type background, in addition to the mutant background.

      Thank you for your detailed feedback and observations.

      The potential additive effect you noted in Figure 3C appears to be specific to the body length analysis. In our other three ivermectin resistance assays (viability, pumping rate, and locomotion velocity), this additive effect was not observed. A possible explanation for this is that eat-4 and got-1 single mutants inherently exhibit reduced body length compared to wild-type worms (Mörck and Pilon 2006; Greer et al. 2008; Chitturi et al. 2018), which may give the appearance of an additive effect in this particular assay.

      Regarding the use of rescuing constructs, we performed these experiments in the ubr-1;got-1 and ubr-1;eat-4 double mutant backgrounds. This was designed to test whether the suppression of ubr-1-mediated ivermectin resistance by got-1 or eat-4 mutations is indeed due to the functional activity of GOT-1 and EAT-4, respectively. The choice of this setup was to ensure that the double mutant phenotype was fully addressed. In contrast, rescuing constructs of GOT-1 and/or EAT-4 in a wild-type background might not sufficiently reveal the relationship between GOT-1, EAT-4, and UBR-1. However, we are open to further testing your suggestion in the future.

      To aid in the interpretation and clarify the apparent effects, we have revised Figure 3 annotation to clearly represent the data and the comparisons being made. We hope this adjustment makes the results more straightforward and easier for readers to understand.

      The added value of the pumping data in Figure 5 (using calcium imaging) over the pump counts (from video) in Figure 1G, Figure 2E, F, K, & Figure 3D, H is not clearly explained. It may have to do with the use of "dissected" pharynxes, the nature/advantage of which is not sufficiently documented in the Methods/Results.

      Thank you for pointing this out. The behavioral pumping data in Figure 1G, Figure 2E, F, K, & Figure 3D and calcium imaging data in Figure 5 were obtained under different experimental conditions. Specifically, the behavioral assays (pumping rate) were conducted on standard culture plates with freely moving worms, whereas the calcium imaging experiments were performed in a liquid environment with immobilized worms. In the calcium imaging setup, the dissection refers to gently puncturing the epidermis behind head of the worm with a glass electrode to relieve internal pressure, which aids in stabilizing the calcium imaging process and ensures better visualization of pharyngeal muscle activity.

      We compared the pharyngeal muscle activity of worms that were not subjected to puncturing the epidermis and found no significant difference when activated by 20 mM serotonin. Therefore, we speculate that there is no direct interaction between the bath solution and the pharynx or head neurons. To avoid confusion, we have removed the term "dissected" from the manuscript and added additional experimental details in the Methods section.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) The authors propose that ubr-1 mutants are resistant to ivermectin due to persistent elevation of glutamate that leads to a compensatory reduction in GluCl levels and thus resistance to Ivermectin. This model would be strengthened by experiments more directly connecting glutamate, GluCls and Ivermectin sensitivity. For example, does overexpression of a relevant GluCl such as AVR-15 restore Ivermectin sensitivity to ubr-1 mutants? Does Ceftriaxone treatment affect the Ivermectin resistance of worms lacking the relevant GluCls (i.e., avr-15, avr-14 and glc-1)? - The model suggests that Ceftriaxone treatment would have no effect in the latter case.

      Thank you for your valuable suggestion. Based on your recommendation, we have performed two additional experiments to strengthen our model. First, we conducted an overexpression experiment of AVR-15 and found that it significantly, though partially, restored ivermectin sensitivity in ubr-1 mutants (p < 0.01, Supplemental Figure S5D). Second, we tested the effect of Ceftriaxone treatment on the IVM resistance of avr-15; avr-14; glc-1 triple mutants, which encode the most critical glutamate receptors involved in IVM sensitivity. As expected, we found that Ceftriaxone treatment did not alter the IVM resistance in these triple mutants (Supplemental Figure S5E), supporting the idea that these specific GluCls are key to the observed resistance.

      These two experiments provide further support for our proposed model. We have integrated the results into the manuscript, updating the Results section and Supplemental Figure S5D, E, as well as the corresponding Figure Legends.

      (2) Line 211 - Ceftriaxone is known to upregulate EAAT2 expression in mammals. Do the authors know if the drug also increases EAAT expression in C. elegans?

      Thank you for raising this point. To our knowledge, this is the first study to demonstrate the antagonistic effect of ceftriaxone on ivermectin resistance in C. elegans, particularly in the context of ubr-1-mediated resistance. Ceftriaxone enhances glutamate uptake by increasing the expression of excitatory amino acid transporter-2 (EAAT2) in mammals (Rothstein et al., 2005, Lee et al., 2008). C. elegans has six glutamate transporters encoded by glt-1 and glt-3–7 (Mano et al. 2007).

      Compared to testing whether ceftriaxone increases the expression of these EAATs in C. elegans, identifying which specific glt gene targeted by ceftriaxone may better reveal its mechanism of action. To investigate this, we performed a genetic analysis. In the ubr-1 mutant, we individually deleted the six glt genes and found that ceftriaxone’s ability to restore ivermectin sensitivity was specifically suppressed in the ubr-1; glt-1 and ubr-1; glt-5 double mutants (Author response image 1A). This suggests that glt-1 and glt-5 may be the targets of ceftriaxone in C. elegans. In contrast,  ivermectin sensitivity was unaffected in the individual glt mutants (Author response image 1B), indicating that a single glt deletion may not be sufficient to alter glutamate level or induce GluRs downregulation. Further studies are needed to determine whether ceftriaxone directly increases GLT-1 and GLT-5 expression in C. elegans and to explore the underlying mechanisms.

      Author response image 1.

      Glutamate transporter removal inhibits ceftriaxone-mediated restoration of ivermectin sensitivity in ubr-1. (A) Compared to the ubr-1 mutants, the ubr-1; glt-1 and ubr-1; glt-5 double mutants show enhanced ivermectin resistance under ceftriaxone treatment. (B) The glt mutants do not show resistance to ivermectin. ****p < 0.0001; one-way ANOVA test.

      (3) Line 64 - as part of the rationale for the study, the authors state that "...increasing reports of unknown causes of IVM resistance continue to emerge...suggesting that additional unknown mechanisms are awaiting investigation." While this may be true, the ultimate conclusion from this study is that decreasing expression of Ivermectin-targeted GluCls causes Ivermectin resistance, which is a known mechanism. The field already knows that Ivermectin targets GluCls and thus decreasing GluCl expression or function would lead to Ivermectin resistance. The authors may want to edit the sentence mentioned above for clarity.

      Thanks for the suggestion. We have revised the sentence for clarity: “…, suggesting that previously unrecognized or additional mechanisms regulating GluCls expression may await further investigation.” This revision better reflects the focus on GluCl regulation and clarifies the potential for additional mechanisms to be explored.

      (4) The introduction to the serotonin-stimulated pharyngeal Calcium imaging section is a little confusing. The role of the various GluCls in pharyngeal pumping should be defined/clarified in the introduction to the last section (lines 337-341).

      Thanks. We have revised and clarified the introduction as follows: “GluCls downregulation was functionally validated by the diminished IVM-mediated inhibition of serotonin-activated pharyngeal Ca2+ activity observed in ubr-1 mutants. ”

      Additionally, the role of the various GluCls in pharyngeal pumping has been clarified:

      “Using translational reporters, we found that IVM resistance in ubr-1 mutants is caused by the functional downregulation of IVM-targeted GluCls, including AVR-15, AVR-14, and GLC-1. These receptors are activated by glutamate to facilitate chloride ion influx into pharyngeal muscle cells, resulting in the inhibition of muscle contractions and the suppression of food intake in C. elegans. ”

      We hope these revisions address the concerns raised and improve the clarity of this section.

      (5) The color code key on the right-hand side of the Raster Plots in Figure 1H should be made larger for clarity.

      Revised.

      (6) In Figure S3, a legend should be included to define the black and blue box plots.

      Thank you for your comment. We have added the following clarification to the figure legend: “Black plots: wild-type, blue plots: ubr-1 mutants.” This should now make the distinction between the two groups clear.

      (7) Figure S4, the brackets above the graphs are misleading. It is not clear which comparisons are being made.

      Thank you for your feedback. We have clarified the figure by updating the legend to include the statement: “All statistical analyses were performed against the ubr-1 mutant.” This clarification is now also included in Figure 3F-I to ensure consistency and avoid any confusion regarding the comparisons being made.

      Reviewer #2 (Recommendations for the authors):

      (1) In Figure 1A: the "trails" table needs more clarification to orient the reader.

      To improve clarity and better orient the reader, we have updated Figure 1A by explicitly adding the number of trials and including a statistical analysis of the viability of wild-type and ubr-1 mutants under different ML conditions. In Figure 1A legend, we have added “we used shades of red to represent worm viability on each experimental plate (n = 50 animals per plate), with darker shades indicating lower survival rates. The viability test was repeated at least 5 times (5 trials).”. These modifications aim to provide a clearer understanding of the data presentation and its significance.

      (2) In Figure S2: it would benefit the reader to include the major human parasitic nematodes in the phylogeny and include a discussion of the conservation.

      Thank you for your insightful comment. In Figure S2A, we have included the human parasitic nematodes Onchocerca volvulus, Brugia malayi, and Toxocara canis. Unfortunately, other major human parasitic nematodes, such as Ascaris lumbricoides (roundworm), Ancylostoma duodenale (hookworm), and Trichuris trichiura (whipworm), currently lack reported homologs of the ubr-1 gene.

      To provide some context, Onchocerca volvulus is a leading cause of infectious blindness globally, affecting millions of people, while Brugia malayi causes lymphatic filariasis, a significant tropical disease. Toxocara canis is a zoonotic parasite responsible for serious human syndromes such as visceral and ocular larval migration. Ivermectin remains a primary treatment for these parasitic infections.

      Interestingly, while we have identified relevant sequences in Onchocerca volvulus, Brugia malayi, and Toxocara canis, potential mutations in ubr-1-like genes in these parasitic nematodes may lead to ivermectin resistance. Sequence comparison analysis could shed light on the risks of such mutations and their relevance to ivermectin treatment failure, warranting further attention. We have added a discussion of this potential risk in the manuscript.

      Reviewer #3 (Recommendations for the authors):

      Minor corrections/suggestions:

      (1) The level of resistance in ubr-1 is similar to dyf genes. Should double-check ubr-1 mutant is not dyf.

      Thank you for your insightful suggestion. We are also interested in this point and designed the following experiments. We first directly tested the Dyf phenotype of ubr-1 using standard DIO dye staining (Author response image 2A) and found that ubr-1 clearly show a "dye filling defective" phenotype (Author response image 2B). This raises an interesting question: Could the IVM resistance observed in ubr-1 be due to its Dyf defect? To address this, we further performed experiment by using Ceftriaxone to test ubr-1’s Dyf phenotype. Ceftriaxone can fully rescue the sensitivity of ubr-1 to IVM (Figure 2). If IVM resistance observed in ubr-1 is due to its Dyf defect, we should observe same rescued Dyf defect. After treating ubr-1 mutants with Ceftriaxone (50 μg/mL) until L4 stage, we again performed DIO dye staining and found that while Ceftriaxone fully rescued IVM resistance in ubr-1, it did not rescue the Dyf defect (Author response image 2C). These results suggest that while ubr-1 has a Dyf defect, it is unlikely the primary cause of the IVM resistance in ubr-1 mutant.

      Author response image 2.

      ubr-1 mutant is not dyf. (A) Depiction of the DIO dye-staining assays. Diagram is adapted from (Power et al. 2020). (B) ubr-1 mutant exhibits obvious Dyf phenotype. (C) Cef treatment (50 μg/mL) does not alter the ubr-1 Dyf defect phenotype. Scale bar, 20 µm.

      (2) 367 "in IVM" superscript.

      (3) 429 ubr-1 italics.

      Thanks, revised.

      (4) Methods: Need more info on dissection: if there is direct interaction of bath with pharynx, as suggested by bath solution, then 5HT concentrations are too high. Direct exposure to 20mM 5HT will kill a pharynx. 20uM 5HT?

      Thank you for your comment. We have reviewed our experimental records and confirmed that the concentration mentioned in the manuscript is correct. In our experiment, the dissection refers to gently puncturing the epidermis behind head of the worm with a glass electrode to relieve internal pressure, which helps stabilize the calcium imaging process. In fact, there is no direct interaction between the bath solution and the pharynx or head neurons. We have revised the Methods section to clarify this point.

      (5) Figure 2: Meaning of "Trials" arrow on grid y-axis is not immediately obvious to me. Would prefer you just label/number individual trials.

      Sure, we have labeled the trails accordingly in revised Figure 1, 2, and Figure S1.

      (6) Figure 3: Legend should include [IVM]. Meaning of +EAT-4, +GOT-1 should be described in the legend.

      Thank you for your suggestion. We have updated the figure legend to include the IVM concentration (5 ng/mL). Additionally, we have clarified the meaning of +EAT-4 and +GOT-1 in the legend with the description: “…whereas the re-expression of GOT-1 (+GOT-1) and EAT-4 (+EAT-4) partially reinstated IVM resistance in the respective double mutants.” This ensures the figure is more informative and accessible to the reader.

      (7) 784 signalling pathway should just be pathway.

      Thanks, revised.

      (8) Line 811 " Both types of motor neurons are innervated by serotonin (5 -HT)." Innervated by serotonergic "neurons"? However, even that is misleading because serotonin is not necessarily synaptic.

      Thank you for your comment. We have revised the sentence to: “Both types of motor neurons could be activated by serotonin (5-HT).” This clarification better reflects the role of serotonin in modulating motor neuron activity.

      (9) Line 814 puffing or perfusion. Perfusion seems more accurate. Make the figure consistent.

      Thanks, revised.

      (10) Figure S1 requires an x axis label with better explanation.

      Thank you for your feedback. We have revised Figure S1 and added "x-axis" to clarify that it represents the trail number. Additionally, we have updated the figure legend to include the experimental conditions: “The shades of red represent worm viability, with darker shades indicating lower survival rates, based on 100 animals per plate and at least 5 trials.”

      (11) Figure S2 C-F needs ivermectin concentration.

      (12) Line 865 plants -> plates?

      Thanks, revised.

      (13) Figure S4. 875 "Rescue of IVM sensitivity of the ubr-1 mutant by the UBR-1 genomic fragment." Wrong title? Describes GFP expression and RNAi experiments.

      Thank you for pointing out the mistake in the title. We have revised the title to: “Knockdown of UBR-1 induces IVM resistance phenotypes.” Additionally, we have updated the figure description to include details about GFP expression and RNAi experiments. The GFP expression is now described as: “Expression of functional UBR-1::GFP, driven by its endogenous promoter, was observed predominantly in the pharynx, head neurons, and body wall muscles with weaker expression detected in vulval muscles and the intestine.” The RNAi experiments are described as: “Double-stranded RNA (dsRNA) interference was employed to suppress gene expression in specific tissues (Methods).”